35 parsing data is all about reading the data and converting it into a form which |
35 parsing data is all about reading the data and converting it into a form which |
36 can be used for computations. In our case, that is numbers. |
36 can be used for computations. In our case, that is numbers. |
37 |
37 |
38 We can clearly see that the problem involves reading files and tokenizing. |
38 We can clearly see that the problem involves reading files and tokenizing. |
39 |
39 |
40 Let us learn about tokenizing strings. Let us define a string first. Type:: |
40 Let us learn about tokenizing strings. Let us define a string first. Type |
|
41 :: |
41 |
42 |
42 line = "parse this string" |
43 line = "parse this string" |
43 |
44 |
44 We are now going to split this string on whitespace.:: |
45 We are now going to split this string on whitespace. |
|
46 :: |
45 |
47 |
46 line.split() |
48 line.split() |
47 |
49 |
48 As you can see, we get a list of strings. Which means, when split is called |
50 As you can see, we get a list of strings. Which means, when split is called |
49 without any arguments, it splits on whitespace. In simple words, all the spaces |
51 without any arguments, it splits on whitespace. In simple words, all the spaces |
50 are treated as one big space. |
52 are treated as one big space. |
51 |
53 |
52 split also can split on a string of our choice. This is acheived by passing |
54 split also can split on a string of our choice. This is acheived by passing |
53 that as an argument. But first lets define a sample record from the file.:: |
55 that as an argument. But first lets define a sample record from the file. |
|
56 :: |
54 |
57 |
55 record = "A;015163;JOSEPH RAJ S;083;042;47;AA;72;244;;;" |
58 record = "A;015163;JOSEPH RAJ S;083;042;47;AA;72;244;;;" |
56 record.split(';') |
59 record.split(';') |
57 |
60 |
58 We can see that the string is split on ';' and we get each field seperately. |
61 We can see that the string is split on ';' and we get each field seperately. |
77 surrounded by whitespace are treated as two different regions. We must find a |
80 surrounded by whitespace are treated as two different regions. We must find a |
78 way to remove all the whitespace around a string so that "B" and a "B" with |
81 way to remove all the whitespace around a string so that "B" and a "B" with |
79 white spaces are dealt as same. |
82 white spaces are dealt as same. |
80 |
83 |
81 This is possible by using the =strip= method of strings. Let us define a |
84 This is possible by using the =strip= method of strings. Let us define a |
82 string by typing:: |
85 string by typing |
|
86 :: |
83 |
87 |
84 unstripped = " B " |
88 unstripped = " B " |
85 unstripped.strip() |
89 unstripped.strip() |
86 |
90 |
87 We can see that strip removes all the whitespace around the sentence |
91 We can see that strip removes all the whitespace around the sentence |
106 The splitting and stripping operations are done on a string and their result is |
111 The splitting and stripping operations are done on a string and their result is |
107 also a string. hence the marks that we have are still strings and mathematical |
112 also a string. hence the marks that we have are still strings and mathematical |
108 operations are not possible. We must convert them into integers or floats |
113 operations are not possible. We must convert them into integers or floats |
109 |
114 |
110 We shall look at converting strings into floats. We define an float string |
115 We shall look at converting strings into floats. We define an float string |
111 first. Type:: |
116 first. Type |
|
117 :: |
112 |
118 |
113 mark_str = "1.25" |
119 mark_str = "1.25" |
114 mark = int(mark_str) |
120 mark = int(mark_str) |
115 mark_str |
121 mark_str |
116 mark |
122 mark |
123 %% 3 %% What happens if you do int("1.25") |
129 %% 3 %% What happens if you do int("1.25") |
124 |
130 |
125 {{{ continue from paused state }}} |
131 {{{ continue from paused state }}} |
126 |
132 |
127 It raises an error since converting a float string into integer directly is |
133 It raises an error since converting a float string into integer directly is |
128 not possible. It involves an intermediate step of converting to float.:: |
134 not possible. It involves an intermediate step of converting to float. |
|
135 :: |
129 |
136 |
130 dcml_str = "1.25" |
137 dcml_str = "1.25" |
131 flt = float(dcml_str) |
138 flt = float(dcml_str) |
132 flt |
139 flt |
133 number = int(flt) |
140 number = int(flt) |
135 |
142 |
136 Using =int= it is also possible to convert float into integers. |
143 Using =int= it is also possible to convert float into integers. |
137 |
144 |
138 Now that we have all the machinery required to parse the file, let us solve the |
145 Now that we have all the machinery required to parse the file, let us solve the |
139 problem. We first read the file line by line and parse each record. We see if |
146 problem. We first read the file line by line and parse each record. We see if |
140 the region code is B and store the marks accordingly.:: |
147 the region code is B and store the marks accordingly. |
|
148 :: |
141 |
149 |
142 math_marks_B = [] # an empty list to store the marks |
150 math_marks_B = [] # an empty list to store the marks |
143 for line in open("/home/fossee/sslc1.txt"): |
151 for line in open("/home/fossee/sslc1.txt"): |
144 fields = line.split(";") |
152 fields = line.split(";") |
145 |
153 |
152 if region_code == "AA": |
160 if region_code == "AA": |
153 math_marks_B.append(math_mark) |
161 math_marks_B.append(math_mark) |
154 |
162 |
155 |
163 |
156 Now we have all the maths marks of region "B" in the list math_marks_B. |
164 Now we have all the maths marks of region "B" in the list math_marks_B. |
157 To get the mean, we just have to sum the marks and divide by the length.:: |
165 To get the mean, we just have to sum the marks and divide by the length. |
|
166 :: |
158 |
167 |
159 math_marks_mean = sum(math_marks_B) / len(math_marks_B) |
168 math_marks_mean = sum(math_marks_B) / len(math_marks_B) |
160 math_marks_mean |
169 math_marks_mean |
161 |
170 |
162 {{{ Show summary slide }}} |
171 {{{ Show summary slide }}} |