|
1 .. Objectives |
|
2 .. ---------- |
|
3 |
|
4 .. A - Students and teachers from Science and engineering backgrounds |
|
5 B - |
|
6 C - |
|
7 D - |
|
8 |
|
9 .. Prerequisites |
|
10 .. ------------- |
|
11 |
|
12 .. 1. Getting started with lists |
|
13 |
|
14 .. Author : Nishanth Amuluru |
|
15 Internal Reviewer : |
|
16 External Reviewer : |
|
17 Checklist OK? : <put date stamp here, if OK> [2010-10-05] |
|
18 |
|
19 Script |
|
20 ------ |
|
21 |
|
22 Hello friends and welcome to the tutorial on Parsing Data |
|
23 |
|
24 {{{ Show the slide containing title }}} |
|
25 |
|
26 {{{ Show the slide containing the outline slide }}} |
|
27 |
|
28 In this tutorial, we shall learn |
|
29 |
|
30 * What we mean by parsing data |
|
31 * the string operations required for parsing data |
|
32 * datatype conversion |
|
33 |
|
34 #[Puneeth]: Changed a few things, here. |
|
35 |
|
36 #[Puneeth]: I don't like the way the term "parsing data" has been used, all |
|
37 through the script. See if that can be changed. |
|
38 |
|
39 Lets us have a look at the problem |
|
40 |
|
41 {{{ Show the slide containing problem statement. }}} |
|
42 |
|
43 There is an input file containing huge no. of records. Each record corresponds |
|
44 to a student. |
|
45 |
|
46 {{{ show the slide explaining record structure }}} |
|
47 As you can see, each record consists of fields seperated by a ";". The first |
|
48 record is region code, then roll number, then name, marks of second language, |
|
49 first language, maths, science and social, total marks, pass/fail indicatd by P |
|
50 or F and finally W if with held and empty otherwise. |
|
51 |
|
52 Our job is to calculate the mean of all the maths marks in the region "B". |
|
53 |
|
54 #[Nishanth]: Please note that I am not telling anything about AA since they do |
|
55 not know about any if/else yet. |
|
56 |
|
57 #[Puneeth]: Should we talk pass/fail etc? I think we should make the problem |
|
58 simple and leave out all the columns after total marks. |
|
59 |
|
60 Now what is parsing data. |
|
61 |
|
62 From the input file, we can see that the data we have is in the form of |
|
63 text. Parsing this data is all about reading it and converting it into a form |
|
64 which can be used for computations -- in our case, sequence of numbers. |
|
65 |
|
66 #[Puneeth]: should the word tokenizing, be used? Should it be defined before |
|
67 using it? |
|
68 |
|
69 We can clearly see that the problem involves reading files and tokenizing. |
|
70 |
|
71 #[Puneeth]: the sentence above seems kinda redundant. |
|
72 |
|
73 Let us learn about tokenizing strings. Let us define a string first. Type |
|
74 :: |
|
75 |
|
76 line = "parse this string" |
|
77 |
|
78 We are now going to split this string on whitespace. |
|
79 :: |
|
80 |
|
81 line.split() |
|
82 |
|
83 As you can see, we get a list of strings. Which means, when ``split`` is called |
|
84 without any arguments, it splits on whitespace. In simple words, all the spaces |
|
85 are treated as one big space. |
|
86 |
|
87 ``split`` also can split on a string of our choice. This is acheived by passing |
|
88 that as an argument. But first lets define a sample record from the file. |
|
89 :: |
|
90 |
|
91 record = "A;015163;JOSEPH RAJ S;083;042;47;AA;72;244;;;" |
|
92 record.split(';') |
|
93 |
|
94 We can see that the string is split on ';' and we get each field seperately. |
|
95 We can also observe that an empty string appears in the list since there are |
|
96 two semi colons without anything in between. |
|
97 |
|
98 To recap, ``split`` splits on whitespace if called without an argument and |
|
99 splits on the given argument if it is called with an argument. |
|
100 |
|
101 {{{ Pause here and try out the following exercises }}} |
|
102 |
|
103 %% 1 %% split the variable line using a space as argument. Is it same as |
|
104 splitting without an argument ? |
|
105 |
|
106 {{{ continue from paused state }}} |
|
107 |
|
108 We see that when we split on space, multiple whitespaces are not clubbed as one |
|
109 and there is an empty string everytime there are two consecutive spaces. |
|
110 |
|
111 Now that we know how to split a string, we can split the record and retrieve |
|
112 each field seperately. But there is one problem. The region code "B" and a "B" |
|
113 surrounded by whitespace are treated as two different regions. We must find a |
|
114 way to remove all the whitespace around a string so that "B" and a "B" with |
|
115 white spaces are dealt as same. |
|
116 |
|
117 This is possible by using the ``strip`` method of strings. Let us define a |
|
118 string by typing |
|
119 :: |
|
120 |
|
121 unstripped = " B " |
|
122 unstripped.strip() |
|
123 |
|
124 We can see that strip removes all the whitespace around the sentence |
|
125 |
|
126 {{{ Pause here and try out the following exercises }}} |
|
127 |
|
128 %% 2 %% What happens to the white space inside the sentence when it is stripped |
|
129 |
|
130 {{{ continue from paused state }}} |
|
131 |
|
132 Type |
|
133 :: |
|
134 |
|
135 a_str = " white space " |
|
136 a_str.strip() |
|
137 |
|
138 We see that the whitespace inside the sentence is only removed and anything |
|
139 inside remains unaffected. |
|
140 |
|
141 By now we know enough to seperate fields from the record and to strip out any |
|
142 white space. The only road block we now have is conversion of string to float. |
|
143 |
|
144 The splitting and stripping operations are done on a string and their result is |
|
145 also a string. hence the marks that we have are still strings and mathematical |
|
146 operations are not possible on them. We must convert them into numbers |
|
147 (integers or floats), before we can perform mathematical operations on them. |
|
148 |
|
149 We shall look at converting strings into floats. We define a float string |
|
150 first. Type |
|
151 :: |
|
152 |
|
153 mark_str = "1.25" |
|
154 mark = int(mark_str) |
|
155 type(mark_str) |
|
156 type(mark) |
|
157 |
|
158 We can see that string is converted to float. We can perform mathematical |
|
159 operations on them now. |
|
160 |
|
161 {{{ Pause here and try out the following exercises }}} |
|
162 |
|
163 %% 3 %% What happens if you do int("1.25") |
|
164 |
|
165 {{{ continue from paused state }}} |
|
166 |
|
167 It raises an error since converting a float string into integer directly is |
|
168 not possible. It involves an intermediate step of converting to float. |
|
169 :: |
|
170 |
|
171 dcml_str = "1.25" |
|
172 flt = float(dcml_str) |
|
173 flt |
|
174 number = int(flt) |
|
175 number |
|
176 |
|
177 Using ``int`` it is also possible to convert float into integers. |
|
178 |
|
179 Now that we have all the machinery required to parse the file, let us solve the |
|
180 problem. We first read the file line by line and parse each record. We see if |
|
181 the region code is B and store the marks accordingly. |
|
182 :: |
|
183 |
|
184 math_marks_B = [] # an empty list to store the marks |
|
185 for line in open("/home/fossee/sslc1.txt"): |
|
186 fields = line.split(";") |
|
187 |
|
188 region_code = fields[0] |
|
189 region_code_stripped = region_code.strip() |
|
190 |
|
191 math_mark_str = fields[5] |
|
192 math_mark = float(math_mark_str) |
|
193 |
|
194 if region_code == "AA": |
|
195 math_marks_B.append(math_mark) |
|
196 |
|
197 |
|
198 Now we have all the maths marks of region "B" in the list math_marks_B. |
|
199 To get the mean, we just have to sum the marks and divide by the length. |
|
200 :: |
|
201 |
|
202 math_marks_mean = sum(math_marks_B) / len(math_marks_B) |
|
203 math_marks_mean |
|
204 |
|
205 {{{ Show summary slide }}} |
|
206 |
|
207 This brings us to the end of the tutorial. |
|
208 we have learnt |
|
209 |
|
210 * how to tokenize a string using various delimiters |
|
211 * how to get rid of extra white space around |
|
212 * how to convert from one type to another |
|
213 * how to parse input data and perform computations on it |
|
214 |
|
215 {{{ Show the "sponsored by FOSSEE" slide }}} |
|
216 |
|
217 #[Nishanth]: Will add this line after all of us fix on one. |
|
218 This tutorial was created as a part of FOSSEE project, NME ICT, MHRD India |
|
219 |
|
220 Hope you have enjoyed and found it useful. |
|
221 Thank you |
|
222 |
|
223 |