133
|
1 |
Hello friends and welcome to the tutorial on Parsing Data
|
|
2 |
|
|
3 |
{{{ Show the slide containing title }}}
|
|
4 |
|
|
5 |
{{{ Show the slide containing the outline slide }}}
|
|
6 |
|
|
7 |
In this tutorial, we shall learn
|
|
8 |
|
|
9 |
* What is parsing data
|
|
10 |
* the string operations required for parsing data
|
|
11 |
* datatype conversion
|
|
12 |
|
|
13 |
Lets us have a look at the problem
|
|
14 |
|
|
15 |
{{{ Show the slide containing problem statement. }}}
|
|
16 |
|
|
17 |
There is an input file containing huge no.of records. Each record corresponds
|
|
18 |
to a student.
|
|
19 |
|
|
20 |
{{{ show the slide explaining record structure }}}
|
|
21 |
As you can see, each record consists of fields seperated by a ";". The first
|
|
22 |
record is region code, then roll number, then name, marks of second language,
|
|
23 |
first language, maths, science and social, total marks, pass/fail indicatd by P
|
|
24 |
or F and finally W if with held and empty otherwise.
|
|
25 |
|
|
26 |
Our job is to calculate the mean of all the maths marks in the region "B".
|
|
27 |
|
|
28 |
#[Nishanth]: Please note that I am not telling anything about AA since they do
|
|
29 |
not know about any if/else yet.
|
|
30 |
|
|
31 |
|
|
32 |
Now what is parsing data.
|
|
33 |
|
|
34 |
From the input file, we can see that there is data in the form of text. Hence
|
|
35 |
parsing data is all about reading the data and converting it into a form which
|
|
36 |
can be used for computations. In our case, that is numbers.
|
|
37 |
|
|
38 |
We can clearly see that the problem involves reading files and tokenizing.
|
|
39 |
|
|
40 |
Let us learn about tokenizing strings. Let us define a string first. Type::
|
|
41 |
|
|
42 |
line = "parse this string"
|
|
43 |
|
|
44 |
We are now going to split this string on whitespace.::
|
|
45 |
|
|
46 |
line.split()
|
|
47 |
|
|
48 |
As you can see, we get a list of strings. Which means, when split is called
|
|
49 |
without any arguments, it splits on whitespace. In simple words, all the spaces
|
|
50 |
are treated as one big space.
|
|
51 |
|
|
52 |
split also can split on a string of our choice. This is acheived by passing
|
|
53 |
that as an argument. But first lets define a sample record from the file.::
|
|
54 |
|
|
55 |
record = "A;015163;JOSEPH RAJ S;083;042;47;AA;72;244;;;"
|
|
56 |
record.split(';')
|
|
57 |
|
|
58 |
We can see that the string is split on ';' and we get each field seperately.
|
|
59 |
We can also observe that an empty string appears in the list since there are
|
|
60 |
two semi colons without anything in between.
|
|
61 |
|
|
62 |
Hence split splits on whitespace if called without an argument and splits on
|
|
63 |
the given argument if it is called with an argument.
|
|
64 |
|
|
65 |
{{{ Pause here and try out the following exercises }}}
|
|
66 |
|
|
67 |
%% 1 %% split the variable line using a space as argument. Is it same as
|
|
68 |
splitting without an argument ?
|
|
69 |
|
|
70 |
{{{ continue from paused state }}}
|
|
71 |
|
|
72 |
We see that when we split on space, multiple whitespaces are not clubbed as one
|
|
73 |
and there is an empty string everytime there are two consecutive spaces.
|
|
74 |
|
|
75 |
Now that we know splitting a string, we can split the record and retreive each
|
|
76 |
field seperately. But there is one problem. The region code "B" and a "B"
|
|
77 |
surrounded by whitespace are treated as two different regions. We must find a
|
|
78 |
way to remove all the whitespace around a string so that "B" and a "B" with
|
|
79 |
white spaces are dealt as same.
|
|
80 |
|
|
81 |
This is possible by using the =strip= method of strings. Let us define a
|
|
82 |
string by typing::
|
|
83 |
|
|
84 |
unstripped = " B "
|
|
85 |
unstripped.strip()
|
|
86 |
|
|
87 |
We can see that strip removes all the whitespace around the sentence
|
|
88 |
|
|
89 |
{{{ Pause here and try out the following exercises }}}
|
|
90 |
|
|
91 |
%% 2 %% What happens to the white space inside the sentence when it is stripped
|
|
92 |
|
|
93 |
{{{ continue from paused state }}}
|
|
94 |
|
|
95 |
Type::
|
|
96 |
|
|
97 |
a_str = " white space "
|
|
98 |
a_str.strip()
|
|
99 |
|
|
100 |
We see that the whitespace inside the sentence is only removed and anything
|
|
101 |
inside remains unaffected.
|
|
102 |
|
|
103 |
By now we know enough to seperate fields from the record and to strip out any
|
|
104 |
white space. The only road block we now have is conversion of string to float.
|
|
105 |
|
|
106 |
The splitting and stripping operations are done on a string and their result is
|
|
107 |
also a string. hence the marks that we have are still strings and mathematical
|
|
108 |
operations are not possible. We must convert them into integers or floats
|
|
109 |
|
|
110 |
We shall look at converting strings into floats. We define an float string
|
|
111 |
first. Type::
|
|
112 |
|
|
113 |
mark_str = "1.25"
|
|
114 |
mark = int(mark_str)
|
|
115 |
mark_str
|
|
116 |
mark
|
|
117 |
|
|
118 |
We can see that string is converted to float. We can perform mathematical
|
|
119 |
operations on them now.
|
|
120 |
|
|
121 |
{{{ Pause here and try out the following exercises }}}
|
|
122 |
|
|
123 |
%% 3 %% What happens if you do int("1.25")
|
|
124 |
|
|
125 |
{{{ continue from paused state }}}
|
|
126 |
|
|
127 |
It raises an error since converting a float string into integer directly is
|
|
128 |
not possible. It involves an intermediate step of converting to float.::
|
|
129 |
|
|
130 |
dcml_str = "1.25"
|
|
131 |
flt = float(dcml_str)
|
|
132 |
flt
|
|
133 |
number = int(flt)
|
|
134 |
number
|
|
135 |
|
|
136 |
Using =int= it is also possible to convert float into integers.
|
|
137 |
|
|
138 |
Now that we have all the machinery required to parse the file, let us solve the
|
|
139 |
problem. We first read the file line by line and parse each record. We see if
|
|
140 |
the region code is B and store the marks accordingly.::
|
|
141 |
|
|
142 |
math_marks_B = [] # an empty list to store the marks
|
|
143 |
for line in open("/home/fossee/sslc1.txt"):
|
|
144 |
fields = line.split(";")
|
|
145 |
|
|
146 |
region_code = fields[0]
|
|
147 |
region_code_stripped = region_code.strip()
|
|
148 |
|
|
149 |
math_mark_str = fields[5]
|
|
150 |
math_mark = float(math_mark_str)
|
|
151 |
|
|
152 |
if region_code == "AA":
|
|
153 |
math_marks_B.append(math_mark)
|
|
154 |
|
|
155 |
|
|
156 |
Now we have all the maths marks of region "B" in the list math_marks_B.
|
|
157 |
To get the mean, we just have to sum the marks and divide by the length.::
|
|
158 |
|
|
159 |
math_marks_mean = sum(math_marks_B) / len(math_marks_B)
|
|
160 |
math_marks_mean
|
|
161 |
|
|
162 |
{{{ Show summary slide }}}
|
|
163 |
|
|
164 |
This brings us to the end of the tutorial.
|
|
165 |
we have learnt
|
|
166 |
* how to tokenize a string using various delimiters
|
|
167 |
* how to get rid of extra white space around
|
|
168 |
* how to convert from one type to another
|
|
169 |
* how to parse input data and perform computations on it
|
|
170 |
|
|
171 |
{{{ Show the "sponsored by FOSSEE" slide }}}
|
|
172 |
|
|
173 |
#[Nishanth]: Will add this line after all of us fix on one.
|
|
174 |
This tutorial was created as a part of FOSSEE project, NME ICT, MHRD India
|
|
175 |
|
|
176 |
Hope you have enjoyed and found it useful.
|
|
177 |
Thankyou
|
|
178 |
|
|
179 |
.. Author : Nishanth
|
|
180 |
Internal Reviewer 1 :
|
|
181 |
Internal Reviewer 2 :
|
|
182 |
External Reviewer :
|