|
1 .. Author : Nishanth |
|
2 Internal Reviewer 1 : |
|
3 Internal Reviewer 2 : |
|
4 External Reviewer : |
|
5 |
1 Hello friends and welcome to the tutorial on Parsing Data |
6 Hello friends and welcome to the tutorial on Parsing Data |
2 |
7 |
3 {{{ Show the slide containing title }}} |
8 {{{ Show the slide containing title }}} |
4 |
9 |
5 {{{ Show the slide containing the outline slide }}} |
10 {{{ Show the slide containing the outline slide }}} |
6 |
11 |
7 In this tutorial, we shall learn |
12 In this tutorial, we shall learn |
8 |
13 |
9 * What is parsing data |
14 * What we mean by parsing data |
10 * the string operations required for parsing data |
15 * the string operations required for parsing data |
11 * datatype conversion |
16 * datatype conversion |
12 |
17 |
|
18 #[Puneeth]: Changed a few things, here. |
|
19 |
|
20 #[Puneeth]: I don't like the way the term "parsing data" has been used, all |
|
21 through the script. See if that can be changed. |
|
22 |
13 Lets us have a look at the problem |
23 Lets us have a look at the problem |
14 |
24 |
15 {{{ Show the slide containing problem statement. }}} |
25 {{{ Show the slide containing problem statement. }}} |
16 |
26 |
17 There is an input file containing huge no.of records. Each record corresponds |
27 There is an input file containing huge no. of records. Each record corresponds |
18 to a student. |
28 to a student. |
19 |
29 |
20 {{{ show the slide explaining record structure }}} |
30 {{{ show the slide explaining record structure }}} |
21 As you can see, each record consists of fields seperated by a ";". The first |
31 As you can see, each record consists of fields seperated by a ";". The first |
22 record is region code, then roll number, then name, marks of second language, |
32 record is region code, then roll number, then name, marks of second language, |
26 Our job is to calculate the mean of all the maths marks in the region "B". |
36 Our job is to calculate the mean of all the maths marks in the region "B". |
27 |
37 |
28 #[Nishanth]: Please note that I am not telling anything about AA since they do |
38 #[Nishanth]: Please note that I am not telling anything about AA since they do |
29 not know about any if/else yet. |
39 not know about any if/else yet. |
30 |
40 |
|
41 #[Puneeth]: Should we talk pass/fail etc? I think we should make the problem |
|
42 simple and leave out all the columns after total marks. |
31 |
43 |
32 Now what is parsing data. |
44 Now what is parsing data. |
33 |
45 |
34 From the input file, we can see that there is data in the form of text. Hence |
46 From the input file, we can see that the data we have is in the form of |
35 parsing data is all about reading the data and converting it into a form which |
47 text. Parsing this data is all about reading it and converting it into a form |
36 can be used for computations. In our case, that is numbers. |
48 which can be used for computations -- in our case, sequence of numbers. |
|
49 |
|
50 #[Puneeth]: should the word tokenizing, be used? Should it be defined before |
|
51 using it? |
37 |
52 |
38 We can clearly see that the problem involves reading files and tokenizing. |
53 We can clearly see that the problem involves reading files and tokenizing. |
39 |
54 |
|
55 #[Puneeth]: the sentence above seems kinda redundant. |
|
56 |
40 Let us learn about tokenizing strings. Let us define a string first. Type |
57 Let us learn about tokenizing strings. Let us define a string first. Type |
41 :: |
58 :: |
42 |
59 |
43 line = "parse this string" |
60 line = "parse this string" |
44 |
61 |
45 We are now going to split this string on whitespace. |
62 We are now going to split this string on whitespace. |
46 :: |
63 :: |
47 |
64 |
48 line.split() |
65 line.split() |
49 |
66 |
50 As you can see, we get a list of strings. Which means, when split is called |
67 As you can see, we get a list of strings. Which means, when ``split`` is called |
51 without any arguments, it splits on whitespace. In simple words, all the spaces |
68 without any arguments, it splits on whitespace. In simple words, all the spaces |
52 are treated as one big space. |
69 are treated as one big space. |
53 |
70 |
54 split also can split on a string of our choice. This is acheived by passing |
71 ``split`` also can split on a string of our choice. This is acheived by passing |
55 that as an argument. But first lets define a sample record from the file. |
72 that as an argument. But first lets define a sample record from the file. |
56 :: |
73 :: |
57 |
74 |
58 record = "A;015163;JOSEPH RAJ S;083;042;47;AA;72;244;;;" |
75 record = "A;015163;JOSEPH RAJ S;083;042;47;AA;72;244;;;" |
59 record.split(';') |
76 record.split(';') |
60 |
77 |
61 We can see that the string is split on ';' and we get each field seperately. |
78 We can see that the string is split on ';' and we get each field seperately. |
62 We can also observe that an empty string appears in the list since there are |
79 We can also observe that an empty string appears in the list since there are |
63 two semi colons without anything in between. |
80 two semi colons without anything in between. |
64 |
81 |
65 Hence split splits on whitespace if called without an argument and splits on |
82 To recap, ``split`` splits on whitespace if called without an argument and |
66 the given argument if it is called with an argument. |
83 splits on the given argument if it is called with an argument. |
67 |
84 |
68 {{{ Pause here and try out the following exercises }}} |
85 {{{ Pause here and try out the following exercises }}} |
69 |
86 |
70 %% 1 %% split the variable line using a space as argument. Is it same as |
87 %% 1 %% split the variable line using a space as argument. Is it same as |
71 splitting without an argument ? |
88 splitting without an argument ? |
73 {{{ continue from paused state }}} |
90 {{{ continue from paused state }}} |
74 |
91 |
75 We see that when we split on space, multiple whitespaces are not clubbed as one |
92 We see that when we split on space, multiple whitespaces are not clubbed as one |
76 and there is an empty string everytime there are two consecutive spaces. |
93 and there is an empty string everytime there are two consecutive spaces. |
77 |
94 |
78 Now that we know splitting a string, we can split the record and retreive each |
95 Now that we know how to split a string, we can split the record and retrieve |
79 field seperately. But there is one problem. The region code "B" and a "B" |
96 each field seperately. But there is one problem. The region code "B" and a "B" |
80 surrounded by whitespace are treated as two different regions. We must find a |
97 surrounded by whitespace are treated as two different regions. We must find a |
81 way to remove all the whitespace around a string so that "B" and a "B" with |
98 way to remove all the whitespace around a string so that "B" and a "B" with |
82 white spaces are dealt as same. |
99 white spaces are dealt as same. |
83 |
100 |
84 This is possible by using the =strip= method of strings. Let us define a |
101 This is possible by using the ``strip`` method of strings. Let us define a |
85 string by typing |
102 string by typing |
86 :: |
103 :: |
87 |
104 |
88 unstripped = " B " |
105 unstripped = " B " |
89 unstripped.strip() |
106 unstripped.strip() |
108 By now we know enough to seperate fields from the record and to strip out any |
125 By now we know enough to seperate fields from the record and to strip out any |
109 white space. The only road block we now have is conversion of string to float. |
126 white space. The only road block we now have is conversion of string to float. |
110 |
127 |
111 The splitting and stripping operations are done on a string and their result is |
128 The splitting and stripping operations are done on a string and their result is |
112 also a string. hence the marks that we have are still strings and mathematical |
129 also a string. hence the marks that we have are still strings and mathematical |
113 operations are not possible. We must convert them into integers or floats |
130 operations are not possible on them. We must convert them into numbers |
114 |
131 (integers or floats), before we can perform mathematical operations on them. |
115 We shall look at converting strings into floats. We define an float string |
132 |
116 first. Type |
133 We shall look at converting strings into floats. We define a float string |
|
134 first. Type |
117 :: |
135 :: |
118 |
136 |
119 mark_str = "1.25" |
137 mark_str = "1.25" |
120 mark = int(mark_str) |
138 mark = int(mark_str) |
121 type(mark_str) |
139 type(mark_str) |