27 |
27 |
28 #[Nishanth]: Please note that I am not telling anything about AA since they do |
28 #[Nishanth]: Please note that I am not telling anything about AA since they do |
29 not know about any if/else yet. |
29 not know about any if/else yet. |
30 |
30 |
31 |
31 |
32 Now what is parsing data. |
32 So what exactly is parsing data? |
33 |
33 |
34 From the input file, we can see that there is data in the form of text. Hence |
34 |
35 parsing data is all about reading the data and converting it into a form which |
35 Parsing data is all about reading the data and converting it into a form which |
36 can be used for computations. In our case, that is numbers. |
36 can be used for computations. In our case, that is numbers. |
37 |
37 |
38 We can clearly see that the problem involves reading files and tokenizing. |
38 We can clearly see that the problem involves reading files and tokenizing. |
39 |
39 |
|
40 .. #[[Amit:Definition of Tokenizing here.]] |
40 Let us learn about tokenizing strings. Let us define a string first. Type |
41 Let us learn about tokenizing strings. Let us define a string first. Type |
41 :: |
42 :: |
42 |
43 |
43 line = "parse this string" |
44 line = "parse this string" |
44 |
45 |
73 {{{ continue from paused state }}} |
74 {{{ continue from paused state }}} |
74 |
75 |
75 We see that when we split on space, multiple whitespaces are not clubbed as one |
76 We see that when we split on space, multiple whitespaces are not clubbed as one |
76 and there is an empty string everytime there are two consecutive spaces. |
77 and there is an empty string everytime there are two consecutive spaces. |
77 |
78 |
78 Now that we know splitting a string, we can split the record and retreive each |
79 Now that we know how to split a string, we can split the record and retreive each |
79 field seperately. But there is one problem. The region code "B" and a "B" |
80 field seperately. But there is one problem. The region code "B" and a "B" |
80 surrounded by whitespace are treated as two different regions. We must find a |
81 surrounded by whitespace are treated as two different regions. We must find a |
81 way to remove all the whitespace around a string so that "B" and a "B" with |
82 way to remove all the whitespace around a string so that "B" and a "B" with |
82 white spaces are dealt as same. |
83 white spaces are dealt as same. |
83 |
84 |
107 |
108 |
108 By now we know enough to seperate fields from the record and to strip out any |
109 By now we know enough to seperate fields from the record and to strip out any |
109 white space. The only road block we now have is conversion of string to float. |
110 white space. The only road block we now have is conversion of string to float. |
110 |
111 |
111 The splitting and stripping operations are done on a string and their result is |
112 The splitting and stripping operations are done on a string and their result is |
112 also a string. hence the marks that we have are still strings and mathematical |
113 also a string, hence the marks that we have are still strings and mathematical |
113 operations are not possible. We must convert them into integers or floats |
114 operations on them are not possible. We must convert them into integers or floats |
114 |
115 |
115 We shall look at converting strings into floats. We define an float string |
116 We shall look at converting strings into floats. We define an float string |
116 first. Type |
117 first. Type |
117 :: |
118 :: |
118 |
119 |
119 mark_str = "1.25" |
120 mark_str = "1.25" |
120 mark = int(mark_str) |
121 mark = float(mark_str) |
121 type(mark_str) |
122 type(mark_str) |
122 type(mark) |
123 type(mark) |
123 |
124 |
124 We can see that string is converted to float. We can perform mathematical |
125 We can see that string is converted to float. We can perform mathematical |
125 operations on them now. |
126 operations on it now. |
126 |
127 |
127 {{{ Pause here and try out the following exercises }}} |
128 {{{ Pause here and try out the following exercises }}} |
128 |
129 |
129 %% 3 %% What happens if you do int("1.25") |
130 %% 3 %% What happens if you do int("1.25") |
130 |
131 |
131 {{{ continue from paused state }}} |
132 {{{ continue from paused state }}} |
132 |
133 |
|
134 .. #[[Amit:I think there should be some interaction first here about the |
|
135 problem before we conclude to talking about the result.]] |
133 It raises an error since converting a float string into integer directly is |
136 It raises an error since converting a float string into integer directly is |
134 not possible. It involves an intermediate step of converting to float. |
137 not possible. It involves an intermediate step of converting to float. |
135 :: |
138 :: |
136 |
139 |
137 dcml_str = "1.25" |
140 dcml_str = "1.25" |
138 flt = float(dcml_str) |
141 flt = float(dcml_str) |
139 flt |
142 flt |
140 number = int(flt) |
143 number = int(flt) |
141 number |
144 number |
142 |
145 |
143 Using =int= it is also possible to convert float into integers. |
146 Using =int= it is possible to convert float into integers. |
144 |
147 |
145 Now that we have all the machinery required to parse the file, let us solve the |
148 Now that we have all the machinery required to parse the file, let us solve the |
146 problem. We first read the file line by line and parse each record. We see if |
149 problem. We first read the file line by line and parse each record. We see if |
147 the region code is B and store the marks accordingly. |
150 the region code is B and store the marks accordingly. |
148 :: |
151 :: |
157 math_mark_str = fields[5] |
160 math_mark_str = fields[5] |
158 math_mark = float(math_mark_str) |
161 math_mark = float(math_mark_str) |
159 |
162 |
160 if region_code == "AA": |
163 if region_code == "AA": |
161 math_marks_B.append(math_mark) |
164 math_marks_B.append(math_mark) |
162 |
165 .. #[[Amit:This intutively does not seem to be what you wanted]] |
163 |
166 |
164 Now we have all the maths marks of region "B" in the list math_marks_B. |
167 Now we have all the maths marks of region "B" in the list math_marks_B. |
165 To get the mean, we just have to sum the marks and divide by the length. |
168 To get the mean, we just have to sum the marks and divide by the length. |
166 :: |
169 :: |
167 |
170 |
174 we have learnt |
177 we have learnt |
175 |
178 |
176 * how to tokenize a string using various delimiters |
179 * how to tokenize a string using various delimiters |
177 * how to get rid of extra white space around |
180 * how to get rid of extra white space around |
178 * how to convert from one type to another |
181 * how to convert from one type to another |
|
182 .. #[[Amit:one datatype to another may be better.]] |
179 * how to parse input data and perform computations on it |
183 * how to parse input data and perform computations on it |
180 |
184 |
181 {{{ Show the "sponsored by FOSSEE" slide }}} |
185 {{{ Show the "sponsored by FOSSEE" slide }}} |
182 |
186 |
183 #[Nishanth]: Will add this line after all of us fix on one. |
187 #[Nishanth]: Will add this line after all of us fix on one. |