1 .. Author : Nishanth |
|
2 Internal Reviewer 1 : |
|
3 Internal Reviewer 2 : |
|
4 External Reviewer : |
|
5 |
|
6 Hello friends and welcome to the tutorial on Parsing Data |
|
7 |
|
8 {{{ Show the slide containing title }}} |
|
9 |
|
10 {{{ Show the slide containing the outline slide }}} |
|
11 |
|
12 In this tutorial, we shall learn |
|
13 |
|
14 * What we mean by parsing data |
|
15 * the string operations required for parsing data |
|
16 * datatype conversion |
|
17 |
|
18 #[Puneeth]: Changed a few things, here. |
|
19 |
|
20 #[Puneeth]: I don't like the way the term "parsing data" has been used, all |
|
21 through the script. See if that can be changed. |
|
22 |
|
23 Lets us have a look at the problem |
|
24 |
|
25 {{{ Show the slide containing problem statement. }}} |
|
26 |
|
27 There is an input file containing huge no. of records. Each record corresponds |
|
28 to a student. |
|
29 |
|
30 {{{ show the slide explaining record structure }}} |
|
31 As you can see, each record consists of fields seperated by a ";". The first |
|
32 record is region code, then roll number, then name, marks of second language, |
|
33 first language, maths, science and social, total marks, pass/fail indicatd by P |
|
34 or F and finally W if with held and empty otherwise. |
|
35 |
|
36 Our job is to calculate the mean of all the maths marks in the region "B". |
|
37 |
|
38 #[Nishanth]: Please note that I am not telling anything about AA since they do |
|
39 not know about any if/else yet. |
|
40 |
|
41 #[Puneeth]: Should we talk pass/fail etc? I think we should make the problem |
|
42 simple and leave out all the columns after total marks. |
|
43 |
|
44 Now what is parsing data. |
|
45 |
|
46 From the input file, we can see that the data we have is in the form of |
|
47 text. Parsing this data is all about reading it and converting it into a form |
|
48 which can be used for computations -- in our case, sequence of numbers. |
|
49 |
|
50 #[Puneeth]: should the word tokenizing, be used? Should it be defined before |
|
51 using it? |
|
52 |
|
53 We can clearly see that the problem involves reading files and tokenizing. |
|
54 |
|
55 #[Puneeth]: the sentence above seems kinda redundant. |
|
56 |
|
57 Let us learn about tokenizing strings. Let us define a string first. Type |
|
58 :: |
|
59 |
|
60 line = "parse this string" |
|
61 |
|
62 We are now going to split this string on whitespace. |
|
63 :: |
|
64 |
|
65 line.split() |
|
66 |
|
67 As you can see, we get a list of strings. Which means, when ``split`` is called |
|
68 without any arguments, it splits on whitespace. In simple words, all the spaces |
|
69 are treated as one big space. |
|
70 |
|
71 ``split`` also can split on a string of our choice. This is acheived by passing |
|
72 that as an argument. But first lets define a sample record from the file. |
|
73 :: |
|
74 |
|
75 record = "A;015163;JOSEPH RAJ S;083;042;47;AA;72;244;;;" |
|
76 record.split(';') |
|
77 |
|
78 We can see that the string is split on ';' and we get each field seperately. |
|
79 We can also observe that an empty string appears in the list since there are |
|
80 two semi colons without anything in between. |
|
81 |
|
82 To recap, ``split`` splits on whitespace if called without an argument and |
|
83 splits on the given argument if it is called with an argument. |
|
84 |
|
85 {{{ Pause here and try out the following exercises }}} |
|
86 |
|
87 %% 1 %% split the variable line using a space as argument. Is it same as |
|
88 splitting without an argument ? |
|
89 |
|
90 {{{ continue from paused state }}} |
|
91 |
|
92 We see that when we split on space, multiple whitespaces are not clubbed as one |
|
93 and there is an empty string everytime there are two consecutive spaces. |
|
94 |
|
95 Now that we know how to split a string, we can split the record and retrieve |
|
96 each field seperately. But there is one problem. The region code "B" and a "B" |
|
97 surrounded by whitespace are treated as two different regions. We must find a |
|
98 way to remove all the whitespace around a string so that "B" and a "B" with |
|
99 white spaces are dealt as same. |
|
100 |
|
101 This is possible by using the ``strip`` method of strings. Let us define a |
|
102 string by typing |
|
103 :: |
|
104 |
|
105 unstripped = " B " |
|
106 unstripped.strip() |
|
107 |
|
108 We can see that strip removes all the whitespace around the sentence |
|
109 |
|
110 {{{ Pause here and try out the following exercises }}} |
|
111 |
|
112 %% 2 %% What happens to the white space inside the sentence when it is stripped |
|
113 |
|
114 {{{ continue from paused state }}} |
|
115 |
|
116 Type |
|
117 :: |
|
118 |
|
119 a_str = " white space " |
|
120 a_str.strip() |
|
121 |
|
122 We see that the whitespace inside the sentence is only removed and anything |
|
123 inside remains unaffected. |
|
124 |
|
125 By now we know enough to seperate fields from the record and to strip out any |
|
126 white space. The only road block we now have is conversion of string to float. |
|
127 |
|
128 The splitting and stripping operations are done on a string and their result is |
|
129 also a string. hence the marks that we have are still strings and mathematical |
|
130 operations are not possible on them. We must convert them into numbers |
|
131 (integers or floats), before we can perform mathematical operations on them. |
|
132 |
|
133 We shall look at converting strings into floats. We define a float string |
|
134 first. Type |
|
135 :: |
|
136 |
|
137 mark_str = "1.25" |
|
138 mark = int(mark_str) |
|
139 type(mark_str) |
|
140 type(mark) |
|
141 |
|
142 We can see that string is converted to float. We can perform mathematical |
|
143 operations on them now. |
|
144 |
|
145 {{{ Pause here and try out the following exercises }}} |
|
146 |
|
147 %% 3 %% What happens if you do int("1.25") |
|
148 |
|
149 {{{ continue from paused state }}} |
|
150 |
|
151 It raises an error since converting a float string into integer directly is |
|
152 not possible. It involves an intermediate step of converting to float. |
|
153 :: |
|
154 |
|
155 dcml_str = "1.25" |
|
156 flt = float(dcml_str) |
|
157 flt |
|
158 number = int(flt) |
|
159 number |
|
160 |
|
161 Using ``int`` it is also possible to convert float into integers. |
|
162 |
|
163 Now that we have all the machinery required to parse the file, let us solve the |
|
164 problem. We first read the file line by line and parse each record. We see if |
|
165 the region code is B and store the marks accordingly. |
|
166 :: |
|
167 |
|
168 math_marks_B = [] # an empty list to store the marks |
|
169 for line in open("/home/fossee/sslc1.txt"): |
|
170 fields = line.split(";") |
|
171 |
|
172 region_code = fields[0] |
|
173 region_code_stripped = region_code.strip() |
|
174 |
|
175 math_mark_str = fields[5] |
|
176 math_mark = float(math_mark_str) |
|
177 |
|
178 if region_code == "AA": |
|
179 math_marks_B.append(math_mark) |
|
180 |
|
181 |
|
182 Now we have all the maths marks of region "B" in the list math_marks_B. |
|
183 To get the mean, we just have to sum the marks and divide by the length. |
|
184 :: |
|
185 |
|
186 math_marks_mean = sum(math_marks_B) / len(math_marks_B) |
|
187 math_marks_mean |
|
188 |
|
189 {{{ Show summary slide }}} |
|
190 |
|
191 This brings us to the end of the tutorial. |
|
192 we have learnt |
|
193 |
|
194 * how to tokenize a string using various delimiters |
|
195 * how to get rid of extra white space around |
|
196 * how to convert from one type to another |
|
197 * how to parse input data and perform computations on it |
|
198 |
|
199 {{{ Show the "sponsored by FOSSEE" slide }}} |
|
200 |
|
201 #[Nishanth]: Will add this line after all of us fix on one. |
|
202 This tutorial was created as a part of FOSSEE project, NME ICT, MHRD India |
|
203 |
|
204 Hope you have enjoyed and found it useful. |
|
205 Thank you |
|
206 |
|
207 Questions |
|
208 ========= |
|
209 |
|
210 1. How do you split the string "Guido;Rossum;Python" to get the words |
|
211 |
|
212 Answer: line.split(';') |
|
213 |
|
214 2. line.split() and line.split(' ') are same |
|
215 |
|
216 a. True |
|
217 #. False |
|
218 |
|
219 Answer: False |
|
220 |
|
221 3. What is the output of the following code:: |
|
222 |
|
223 line = "Hello;;;World;;" |
|
224 sub_strs = line.split() |
|
225 print len(sub_strs) |
|
226 |
|
227 Answer: 5 |
|
228 |
|
229 4. What is the output of " Hello World ".strip() |
|
230 |
|
231 a. "Hello World" |
|
232 #. "Hello World" |
|
233 #. " Hello World" |
|
234 #. "Hello World " |
|
235 |
|
236 Answer: "Hello World" |
|
237 |
|
238 5. What does "It is a cold night".strip("It") produce |
|
239 Hint: Read the documentation of strip |
|
240 |
|
241 a. "is a cold night" |
|
242 #. " is a cold nigh" |
|
243 #. "It is a cold nigh" |
|
244 #. "is a cold nigh" |
|
245 |
|
246 Answer: " is a cold nigh" |
|
247 |
|
248 6. What does int("20") produce |
|
249 |
|
250 a. "20" |
|
251 #. 20.0 |
|
252 #. 20 |
|
253 #. Error |
|
254 |
|
255 Answer: 20 |
|
256 |
|
257 7. What does int("20.0") produce |
|
258 |
|
259 a. 20 |
|
260 #. 20.0 |
|
261 #. Error |
|
262 #. "20" |
|
263 |
|
264 Answer: Error |
|
265 |
|
266 8. What is the value of float(3/2) |
|
267 |
|
268 a. 1.0 |
|
269 #. 1.5 |
|
270 #. 1 |
|
271 #. Error |
|
272 |
|
273 Answer: 1.0 |
|
274 |
|
275 9. what doess float("3/2") produce |
|
276 |
|
277 a. 1.0 |
|
278 #. 1.5 |
|
279 #. 1 |
|
280 #. Error |
|
281 |
|
282 Answer: Error |
|
283 |
|
284 10. See if there is a function available in pylab to calculate the mean |
|
285 Hint: Use tab completion |
|
286 |
|
287 |
|