author | amit |
Wed, 22 Sep 2010 14:56:22 +0530 | |
changeset 179 | 1d04b6c5ff44 |
parent 140 | bc023595e167 |
child 197 | 97d859b70f51 |
permissions | -rw-r--r-- |
133 | 1 |
Hello friends and welcome to the tutorial on Parsing Data |
2 |
||
3 |
{{{ Show the slide containing title }}} |
|
4 |
||
5 |
{{{ Show the slide containing the outline slide }}} |
|
6 |
||
7 |
In this tutorial, we shall learn |
|
8 |
||
9 |
* What is parsing data |
|
10 |
* the string operations required for parsing data |
|
11 |
* datatype conversion |
|
12 |
||
13 |
Lets us have a look at the problem |
|
14 |
||
15 |
{{{ Show the slide containing problem statement. }}} |
|
16 |
||
17 |
There is an input file containing huge no.of records. Each record corresponds |
|
18 |
to a student. |
|
19 |
||
20 |
{{{ show the slide explaining record structure }}} |
|
21 |
As you can see, each record consists of fields seperated by a ";". The first |
|
22 |
record is region code, then roll number, then name, marks of second language, |
|
23 |
first language, maths, science and social, total marks, pass/fail indicatd by P |
|
24 |
or F and finally W if with held and empty otherwise. |
|
25 |
||
26 |
Our job is to calculate the mean of all the maths marks in the region "B". |
|
27 |
||
28 |
#[Nishanth]: Please note that I am not telling anything about AA since they do |
|
29 |
not know about any if/else yet. |
|
30 |
||
31 |
||
179 | 32 |
So what exactly is parsing data? |
133 | 33 |
|
179 | 34 |
|
35 |
Parsing data is all about reading the data and converting it into a form which |
|
133 | 36 |
can be used for computations. In our case, that is numbers. |
37 |
||
38 |
We can clearly see that the problem involves reading files and tokenizing. |
|
39 |
||
179 | 40 |
.. #[[Amit:Definition of Tokenizing here.]] |
137
fc545d07b0ff
added a newline before :: so that a colon does not appear in html
nishanth
parents:
134
diff
changeset
|
41 |
Let us learn about tokenizing strings. Let us define a string first. Type |
fc545d07b0ff
added a newline before :: so that a colon does not appear in html
nishanth
parents:
134
diff
changeset
|
42 |
:: |
133 | 43 |
|
44 |
line = "parse this string" |
|
45 |
||
137
fc545d07b0ff
added a newline before :: so that a colon does not appear in html
nishanth
parents:
134
diff
changeset
|
46 |
We are now going to split this string on whitespace. |
fc545d07b0ff
added a newline before :: so that a colon does not appear in html
nishanth
parents:
134
diff
changeset
|
47 |
:: |
133 | 48 |
|
49 |
line.split() |
|
50 |
||
51 |
As you can see, we get a list of strings. Which means, when split is called |
|
52 |
without any arguments, it splits on whitespace. In simple words, all the spaces |
|
53 |
are treated as one big space. |
|
54 |
||
55 |
split also can split on a string of our choice. This is acheived by passing |
|
137
fc545d07b0ff
added a newline before :: so that a colon does not appear in html
nishanth
parents:
134
diff
changeset
|
56 |
that as an argument. But first lets define a sample record from the file. |
fc545d07b0ff
added a newline before :: so that a colon does not appear in html
nishanth
parents:
134
diff
changeset
|
57 |
:: |
133 | 58 |
|
59 |
record = "A;015163;JOSEPH RAJ S;083;042;47;AA;72;244;;;" |
|
60 |
record.split(';') |
|
61 |
||
62 |
We can see that the string is split on ';' and we get each field seperately. |
|
63 |
We can also observe that an empty string appears in the list since there are |
|
64 |
two semi colons without anything in between. |
|
65 |
||
66 |
Hence split splits on whitespace if called without an argument and splits on |
|
67 |
the given argument if it is called with an argument. |
|
68 |
||
69 |
{{{ Pause here and try out the following exercises }}} |
|
70 |
||
71 |
%% 1 %% split the variable line using a space as argument. Is it same as |
|
72 |
splitting without an argument ? |
|
73 |
||
74 |
{{{ continue from paused state }}} |
|
75 |
||
76 |
We see that when we split on space, multiple whitespaces are not clubbed as one |
|
77 |
and there is an empty string everytime there are two consecutive spaces. |
|
78 |
||
179 | 79 |
Now that we know how to split a string, we can split the record and retreive each |
133 | 80 |
field seperately. But there is one problem. The region code "B" and a "B" |
81 |
surrounded by whitespace are treated as two different regions. We must find a |
|
82 |
way to remove all the whitespace around a string so that "B" and a "B" with |
|
83 |
white spaces are dealt as same. |
|
84 |
||
85 |
This is possible by using the =strip= method of strings. Let us define a |
|
137
fc545d07b0ff
added a newline before :: so that a colon does not appear in html
nishanth
parents:
134
diff
changeset
|
86 |
string by typing |
fc545d07b0ff
added a newline before :: so that a colon does not appear in html
nishanth
parents:
134
diff
changeset
|
87 |
:: |
133 | 88 |
|
89 |
unstripped = " B " |
|
90 |
unstripped.strip() |
|
91 |
||
92 |
We can see that strip removes all the whitespace around the sentence |
|
93 |
||
94 |
{{{ Pause here and try out the following exercises }}} |
|
95 |
||
96 |
%% 2 %% What happens to the white space inside the sentence when it is stripped |
|
97 |
||
98 |
{{{ continue from paused state }}} |
|
99 |
||
137
fc545d07b0ff
added a newline before :: so that a colon does not appear in html
nishanth
parents:
134
diff
changeset
|
100 |
Type |
fc545d07b0ff
added a newline before :: so that a colon does not appear in html
nishanth
parents:
134
diff
changeset
|
101 |
:: |
133 | 102 |
|
103 |
a_str = " white space " |
|
104 |
a_str.strip() |
|
105 |
||
106 |
We see that the whitespace inside the sentence is only removed and anything |
|
107 |
inside remains unaffected. |
|
108 |
||
109 |
By now we know enough to seperate fields from the record and to strip out any |
|
110 |
white space. The only road block we now have is conversion of string to float. |
|
111 |
||
112 |
The splitting and stripping operations are done on a string and their result is |
|
179 | 113 |
also a string, hence the marks that we have are still strings and mathematical |
114 |
operations on them are not possible. We must convert them into integers or floats |
|
133 | 115 |
|
116 |
We shall look at converting strings into floats. We define an float string |
|
137
fc545d07b0ff
added a newline before :: so that a colon does not appear in html
nishanth
parents:
134
diff
changeset
|
117 |
first. Type |
fc545d07b0ff
added a newline before :: so that a colon does not appear in html
nishanth
parents:
134
diff
changeset
|
118 |
:: |
133 | 119 |
|
120 |
mark_str = "1.25" |
|
179 | 121 |
mark = float(mark_str) |
140 | 122 |
type(mark_str) |
123 |
type(mark) |
|
133 | 124 |
|
125 |
We can see that string is converted to float. We can perform mathematical |
|
179 | 126 |
operations on it now. |
133 | 127 |
|
128 |
{{{ Pause here and try out the following exercises }}} |
|
129 |
||
130 |
%% 3 %% What happens if you do int("1.25") |
|
131 |
||
132 |
{{{ continue from paused state }}} |
|
133 |
||
179 | 134 |
.. #[[Amit:I think there should be some interaction first here about the |
135 |
problem before we conclude to talking about the result.]] |
|
133 | 136 |
It raises an error since converting a float string into integer directly is |
137
fc545d07b0ff
added a newline before :: so that a colon does not appear in html
nishanth
parents:
134
diff
changeset
|
137 |
not possible. It involves an intermediate step of converting to float. |
fc545d07b0ff
added a newline before :: so that a colon does not appear in html
nishanth
parents:
134
diff
changeset
|
138 |
:: |
133 | 139 |
|
140 |
dcml_str = "1.25" |
|
141 |
flt = float(dcml_str) |
|
142 |
flt |
|
143 |
number = int(flt) |
|
144 |
number |
|
145 |
||
179 | 146 |
Using =int= it is possible to convert float into integers. |
133 | 147 |
|
148 |
Now that we have all the machinery required to parse the file, let us solve the |
|
149 |
problem. We first read the file line by line and parse each record. We see if |
|
137
fc545d07b0ff
added a newline before :: so that a colon does not appear in html
nishanth
parents:
134
diff
changeset
|
150 |
the region code is B and store the marks accordingly. |
fc545d07b0ff
added a newline before :: so that a colon does not appear in html
nishanth
parents:
134
diff
changeset
|
151 |
:: |
133 | 152 |
|
153 |
math_marks_B = [] # an empty list to store the marks |
|
154 |
for line in open("/home/fossee/sslc1.txt"): |
|
155 |
fields = line.split(";") |
|
156 |
||
157 |
region_code = fields[0] |
|
158 |
region_code_stripped = region_code.strip() |
|
159 |
||
160 |
math_mark_str = fields[5] |
|
161 |
math_mark = float(math_mark_str) |
|
162 |
||
163 |
if region_code == "AA": |
|
164 |
math_marks_B.append(math_mark) |
|
179 | 165 |
.. #[[Amit:This intutively does not seem to be what you wanted]] |
133 | 166 |
|
167 |
Now we have all the maths marks of region "B" in the list math_marks_B. |
|
137
fc545d07b0ff
added a newline before :: so that a colon does not appear in html
nishanth
parents:
134
diff
changeset
|
168 |
To get the mean, we just have to sum the marks and divide by the length. |
fc545d07b0ff
added a newline before :: so that a colon does not appear in html
nishanth
parents:
134
diff
changeset
|
169 |
:: |
133 | 170 |
|
171 |
math_marks_mean = sum(math_marks_B) / len(math_marks_B) |
|
172 |
math_marks_mean |
|
173 |
||
174 |
{{{ Show summary slide }}} |
|
175 |
||
176 |
This brings us to the end of the tutorial. |
|
177 |
we have learnt |
|
134 | 178 |
|
133 | 179 |
* how to tokenize a string using various delimiters |
180 |
* how to get rid of extra white space around |
|
181 |
* how to convert from one type to another |
|
179 | 182 |
.. #[[Amit:one datatype to another may be better.]] |
133 | 183 |
* how to parse input data and perform computations on it |
184 |
||
185 |
{{{ Show the "sponsored by FOSSEE" slide }}} |
|
186 |
||
187 |
#[Nishanth]: Will add this line after all of us fix on one. |
|
188 |
This tutorial was created as a part of FOSSEE project, NME ICT, MHRD India |
|
189 |
||
190 |
Hope you have enjoyed and found it useful. |
|
191 |
Thankyou |
|
192 |
||
193 |
.. Author : Nishanth |
|
179 | 194 |
Internal Reviewer 1 : Amit Sethi |
133 | 195 |
Internal Reviewer 2 : |
196 |
External Reviewer : |