|
1 .. Author : Nishanth |
|
2 Internal Reviewer 1 : |
|
3 Internal Reviewer 2 : |
|
4 External Reviewer : |
|
5 |
1 Hello friends and welcome to the tutorial on Parsing Data |
6 Hello friends and welcome to the tutorial on Parsing Data |
2 |
7 |
3 {{{ Show the slide containing title }}} |
8 {{{ Show the slide containing title }}} |
4 |
9 |
5 {{{ Show the slide containing the outline slide }}} |
10 {{{ Show the slide containing the outline slide }}} |
6 |
11 |
7 In this tutorial, we shall learn |
12 In this tutorial, we shall learn |
8 |
13 |
9 * What is parsing data |
14 * What we mean by parsing data |
10 * the string operations required for parsing data |
15 * the string operations required for parsing data |
11 * datatype conversion |
16 * datatype conversion |
12 |
17 |
|
18 #[Puneeth]: Changed a few things, here. |
|
19 |
|
20 #[Puneeth]: I don't like the way the term "parsing data" has been used, all |
|
21 through the script. See if that can be changed. |
|
22 |
13 Lets us have a look at the problem |
23 Lets us have a look at the problem |
14 |
24 |
15 {{{ Show the slide containing problem statement. }}} |
25 {{{ Show the slide containing problem statement. }}} |
16 |
26 |
17 There is an input file containing huge no.of records. Each record corresponds |
27 There is an input file containing huge no. of records. Each record corresponds |
18 to a student. |
28 to a student. |
19 |
29 |
20 {{{ show the slide explaining record structure }}} |
30 {{{ show the slide explaining record structure }}} |
21 As you can see, each record consists of fields seperated by a ";". The first |
31 As you can see, each record consists of fields seperated by a ";". The first |
22 record is region code, then roll number, then name, marks of second language, |
32 record is region code, then roll number, then name, marks of second language, |
26 Our job is to calculate the mean of all the maths marks in the region "B". |
36 Our job is to calculate the mean of all the maths marks in the region "B". |
27 |
37 |
28 #[Nishanth]: Please note that I am not telling anything about AA since they do |
38 #[Nishanth]: Please note that I am not telling anything about AA since they do |
29 not know about any if/else yet. |
39 not know about any if/else yet. |
30 |
40 |
31 |
41 #[Puneeth]: Should we talk pass/fail etc? I think we should make the problem |
32 So what exactly is parsing data? |
42 simple and leave out all the columns after total marks. |
33 |
43 |
34 |
44 Now what is parsing data. |
35 Parsing data is all about reading the data and converting it into a form which |
45 |
36 can be used for computations. In our case, that is numbers. |
46 From the input file, we can see that the data we have is in the form of |
|
47 text. Parsing this data is all about reading it and converting it into a form |
|
48 which can be used for computations -- in our case, sequence of numbers. |
|
49 |
|
50 #[Puneeth]: should the word tokenizing, be used? Should it be defined before |
|
51 using it? |
37 |
52 |
38 We can clearly see that the problem involves reading files and tokenizing. |
53 We can clearly see that the problem involves reading files and tokenizing. |
39 |
54 |
40 .. #[[Amit:Definition of Tokenizing here.]] |
55 #[Puneeth]: the sentence above seems kinda redundant. |
|
56 |
41 Let us learn about tokenizing strings. Let us define a string first. Type |
57 Let us learn about tokenizing strings. Let us define a string first. Type |
42 :: |
58 :: |
43 |
59 |
44 line = "parse this string" |
60 line = "parse this string" |
45 |
61 |
46 We are now going to split this string on whitespace. |
62 We are now going to split this string on whitespace. |
47 :: |
63 :: |
48 |
64 |
49 line.split() |
65 line.split() |
50 |
66 |
51 As you can see, we get a list of strings. Which means, when split is called |
67 As you can see, we get a list of strings. Which means, when ``split`` is called |
52 without any arguments, it splits on whitespace. In simple words, all the spaces |
68 without any arguments, it splits on whitespace. In simple words, all the spaces |
53 are treated as one big space. |
69 are treated as one big space. |
54 |
70 |
55 split also can split on a string of our choice. This is acheived by passing |
71 ``split`` also can split on a string of our choice. This is acheived by passing |
56 that as an argument. But first lets define a sample record from the file. |
72 that as an argument. But first lets define a sample record from the file. |
57 :: |
73 :: |
58 |
74 |
59 record = "A;015163;JOSEPH RAJ S;083;042;47;AA;72;244;;;" |
75 record = "A;015163;JOSEPH RAJ S;083;042;47;AA;72;244;;;" |
60 record.split(';') |
76 record.split(';') |
61 |
77 |
62 We can see that the string is split on ';' and we get each field seperately. |
78 We can see that the string is split on ';' and we get each field seperately. |
63 We can also observe that an empty string appears in the list since there are |
79 We can also observe that an empty string appears in the list since there are |
64 two semi colons without anything in between. |
80 two semi colons without anything in between. |
65 |
81 |
66 Hence split splits on whitespace if called without an argument and splits on |
82 To recap, ``split`` splits on whitespace if called without an argument and |
67 the given argument if it is called with an argument. |
83 splits on the given argument if it is called with an argument. |
68 |
84 |
69 {{{ Pause here and try out the following exercises }}} |
85 {{{ Pause here and try out the following exercises }}} |
70 |
86 |
71 %% 1 %% split the variable line using a space as argument. Is it same as |
87 %% 1 %% split the variable line using a space as argument. Is it same as |
72 splitting without an argument ? |
88 splitting without an argument ? |
74 {{{ continue from paused state }}} |
90 {{{ continue from paused state }}} |
75 |
91 |
76 We see that when we split on space, multiple whitespaces are not clubbed as one |
92 We see that when we split on space, multiple whitespaces are not clubbed as one |
77 and there is an empty string everytime there are two consecutive spaces. |
93 and there is an empty string everytime there are two consecutive spaces. |
78 |
94 |
79 Now that we know how to split a string, we can split the record and retreive each |
95 Now that we know how to split a string, we can split the record and retrieve |
80 field seperately. But there is one problem. The region code "B" and a "B" |
96 each field seperately. But there is one problem. The region code "B" and a "B" |
81 surrounded by whitespace are treated as two different regions. We must find a |
97 surrounded by whitespace are treated as two different regions. We must find a |
82 way to remove all the whitespace around a string so that "B" and a "B" with |
98 way to remove all the whitespace around a string so that "B" and a "B" with |
83 white spaces are dealt as same. |
99 white spaces are dealt as same. |
84 |
100 |
85 This is possible by using the =strip= method of strings. Let us define a |
101 This is possible by using the ``strip`` method of strings. Let us define a |
86 string by typing |
102 string by typing |
87 :: |
103 :: |
88 |
104 |
89 unstripped = " B " |
105 unstripped = " B " |
90 unstripped.strip() |
106 unstripped.strip() |
108 |
124 |
109 By now we know enough to seperate fields from the record and to strip out any |
125 By now we know enough to seperate fields from the record and to strip out any |
110 white space. The only road block we now have is conversion of string to float. |
126 white space. The only road block we now have is conversion of string to float. |
111 |
127 |
112 The splitting and stripping operations are done on a string and their result is |
128 The splitting and stripping operations are done on a string and their result is |
113 also a string, hence the marks that we have are still strings and mathematical |
129 also a string. hence the marks that we have are still strings and mathematical |
114 operations on them are not possible. We must convert them into integers or floats |
130 operations are not possible on them. We must convert them into numbers |
115 |
131 (integers or floats), before we can perform mathematical operations on them. |
116 We shall look at converting strings into floats. We define an float string |
132 |
117 first. Type |
133 We shall look at converting strings into floats. We define a float string |
|
134 first. Type |
118 :: |
135 :: |
119 |
136 |
120 mark_str = "1.25" |
137 mark_str = "1.25" |
121 mark = float(mark_str) |
138 mark = int(mark_str) |
122 type(mark_str) |
139 type(mark_str) |
123 type(mark) |
140 type(mark) |
124 |
141 |
125 We can see that string is converted to float. We can perform mathematical |
142 We can see that string is converted to float. We can perform mathematical |
126 operations on it now. |
143 operations on them now. |
127 |
144 |
128 {{{ Pause here and try out the following exercises }}} |
145 {{{ Pause here and try out the following exercises }}} |
129 |
146 |
130 %% 3 %% What happens if you do int("1.25") |
147 %% 3 %% What happens if you do int("1.25") |
131 |
148 |
132 {{{ continue from paused state }}} |
149 {{{ continue from paused state }}} |
133 |
150 |
134 .. #[[Amit:I think there should be some interaction first here about the |
|
135 problem before we conclude to talking about the result.]] |
|
136 It raises an error since converting a float string into integer directly is |
151 It raises an error since converting a float string into integer directly is |
137 not possible. It involves an intermediate step of converting to float. |
152 not possible. It involves an intermediate step of converting to float. |
138 :: |
153 :: |
139 |
154 |
140 dcml_str = "1.25" |
155 dcml_str = "1.25" |
141 flt = float(dcml_str) |
156 flt = float(dcml_str) |
142 flt |
157 flt |
143 number = int(flt) |
158 number = int(flt) |
144 number |
159 number |
145 |
160 |
146 Using =int= it is possible to convert float into integers. |
161 Using ``int`` it is also possible to convert float into integers. |
147 |
162 |
148 Now that we have all the machinery required to parse the file, let us solve the |
163 Now that we have all the machinery required to parse the file, let us solve the |
149 problem. We first read the file line by line and parse each record. We see if |
164 problem. We first read the file line by line and parse each record. We see if |
150 the region code is B and store the marks accordingly. |
165 the region code is B and store the marks accordingly. |
151 :: |
166 :: |
160 math_mark_str = fields[5] |
175 math_mark_str = fields[5] |
161 math_mark = float(math_mark_str) |
176 math_mark = float(math_mark_str) |
162 |
177 |
163 if region_code == "AA": |
178 if region_code == "AA": |
164 math_marks_B.append(math_mark) |
179 math_marks_B.append(math_mark) |
165 .. #[[Amit:This intutively does not seem to be what you wanted]] |
180 |
166 |
181 |
167 Now we have all the maths marks of region "B" in the list math_marks_B. |
182 Now we have all the maths marks of region "B" in the list math_marks_B. |
168 To get the mean, we just have to sum the marks and divide by the length. |
183 To get the mean, we just have to sum the marks and divide by the length. |
169 :: |
184 :: |
170 |
185 |
177 we have learnt |
192 we have learnt |
178 |
193 |
179 * how to tokenize a string using various delimiters |
194 * how to tokenize a string using various delimiters |
180 * how to get rid of extra white space around |
195 * how to get rid of extra white space around |
181 * how to convert from one type to another |
196 * how to convert from one type to another |
182 .. #[[Amit:one datatype to another may be better.]] |
|
183 * how to parse input data and perform computations on it |
197 * how to parse input data and perform computations on it |
184 |
198 |
185 {{{ Show the "sponsored by FOSSEE" slide }}} |
199 {{{ Show the "sponsored by FOSSEE" slide }}} |
186 |
200 |
187 #[Nishanth]: Will add this line after all of us fix on one. |
201 #[Nishanth]: Will add this line after all of us fix on one. |
188 This tutorial was created as a part of FOSSEE project, NME ICT, MHRD India |
202 This tutorial was created as a part of FOSSEE project, NME ICT, MHRD India |
189 |
203 |
190 Hope you have enjoyed and found it useful. |
204 Hope you have enjoyed and found it useful. |
191 Thankyou |
205 Thank you |
192 |
206 |
193 .. Author : Nishanth |
|
194 Internal Reviewer 1 : Amit Sethi |
|
195 Internal Reviewer 2 : |
|
196 External Reviewer : |
|