parsing_data.rst
author amit
Wed, 22 Sep 2010 14:56:22 +0530
changeset 179 1d04b6c5ff44
parent 140 bc023595e167
child 197 97d859b70f51
permissions -rw-r--r--
First Review for parsing_data.rst
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
133
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
     1
Hello friends and welcome to the tutorial on Parsing Data
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
     2
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
     3
{{{ Show the slide containing title }}}
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
     4
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
     5
{{{ Show the slide containing the outline slide }}}
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
     6
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
     7
In this tutorial, we shall learn
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
     8
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
     9
 * What is parsing data
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    10
 * the string operations required for parsing data
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    11
 * datatype conversion
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    12
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    13
 Lets us have a look at the problem
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    14
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    15
{{{ Show the slide containing problem statement. }}}
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    16
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    17
There is an input file containing huge no.of records. Each record corresponds
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    18
to a student.
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    19
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    20
{{{ show the slide explaining record structure }}}
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    21
As you can see, each record consists of fields seperated by a ";". The first
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    22
record is region code, then roll number, then name, marks of second language,
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    23
first language, maths, science and social, total marks, pass/fail indicatd by P
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    24
or F and finally W if with held and empty otherwise.
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    25
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    26
Our job is to calculate the mean of all the maths marks in the region "B".
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    27
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    28
#[Nishanth]: Please note that I am not telling anything about AA since they do
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    29
             not know about any if/else yet.
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    30
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    31
179
1d04b6c5ff44 First Review for parsing_data.rst
amit
parents: 140
diff changeset
    32
So what exactly is parsing data?
133
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    33
179
1d04b6c5ff44 First Review for parsing_data.rst
amit
parents: 140
diff changeset
    34
1d04b6c5ff44 First Review for parsing_data.rst
amit
parents: 140
diff changeset
    35
Parsing data is all about reading the data and converting it into a form which
133
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    36
can be used for computations. In our case, that is numbers.
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    37
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    38
We can clearly see that the problem involves reading files and tokenizing.
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    39
179
1d04b6c5ff44 First Review for parsing_data.rst
amit
parents: 140
diff changeset
    40
.. #[[Amit:Definition of Tokenizing here.]]
137
fc545d07b0ff added a newline before :: so that a colon does not appear in html
nishanth
parents: 134
diff changeset
    41
Let us learn about tokenizing strings. Let us define a string first. Type
fc545d07b0ff added a newline before :: so that a colon does not appear in html
nishanth
parents: 134
diff changeset
    42
::
133
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    43
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    44
    line = "parse this           string"
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    45
137
fc545d07b0ff added a newline before :: so that a colon does not appear in html
nishanth
parents: 134
diff changeset
    46
We are now going to split this string on whitespace.
fc545d07b0ff added a newline before :: so that a colon does not appear in html
nishanth
parents: 134
diff changeset
    47
::
133
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    48
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    49
    line.split()
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    50
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    51
As you can see, we get a list of strings. Which means, when split is called
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    52
without any arguments, it splits on whitespace. In simple words, all the spaces
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    53
are treated as one big space.
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    54
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    55
split also can split on a string of our choice. This is acheived by passing
137
fc545d07b0ff added a newline before :: so that a colon does not appear in html
nishanth
parents: 134
diff changeset
    56
that as an argument. But first lets define a sample record from the file.
fc545d07b0ff added a newline before :: so that a colon does not appear in html
nishanth
parents: 134
diff changeset
    57
::
133
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    58
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    59
    record = "A;015163;JOSEPH RAJ S;083;042;47;AA;72;244;;;"
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    60
    record.split(';')
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    61
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    62
We can see that the string is split on ';' and we get each field seperately.
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    63
We can also observe that an empty string appears in the list since there are
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    64
two semi colons without anything in between.
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    65
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    66
Hence split splits on whitespace if called without an argument and splits on
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    67
the given argument if it is called with an argument.
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    68
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    69
{{{ Pause here and try out the following exercises }}}
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    70
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    71
%% 1 %% split the variable line using a space as argument. Is it same as
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    72
        splitting without an argument ?
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    73
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    74
{{{ continue from paused state }}}
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    75
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    76
We see that when we split on space, multiple whitespaces are not clubbed as one
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    77
and there is an empty string everytime there are two consecutive spaces.
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    78
179
1d04b6c5ff44 First Review for parsing_data.rst
amit
parents: 140
diff changeset
    79
Now that we know how to split a string, we can split the record and retreive each
133
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    80
field seperately. But there is one problem. The region code "B" and a "B"
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    81
surrounded by whitespace are treated as two different regions. We must find a
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    82
way to remove all the whitespace around a string so that "B" and a "B" with
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    83
white spaces are dealt as same.
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    84
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    85
This is possible by using the =strip= method of strings. Let us define a
137
fc545d07b0ff added a newline before :: so that a colon does not appear in html
nishanth
parents: 134
diff changeset
    86
string by typing
fc545d07b0ff added a newline before :: so that a colon does not appear in html
nishanth
parents: 134
diff changeset
    87
::
133
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    88
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    89
    unstripped = "     B    "
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    90
    unstripped.strip()
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    91
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    92
We can see that strip removes all the whitespace around the sentence
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    93
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    94
{{{ Pause here and try out the following exercises }}}
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    95
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    96
%% 2 %% What happens to the white space inside the sentence when it is stripped
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    97
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    98
{{{ continue from paused state }}}
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    99
137
fc545d07b0ff added a newline before :: so that a colon does not appear in html
nishanth
parents: 134
diff changeset
   100
Type
fc545d07b0ff added a newline before :: so that a colon does not appear in html
nishanth
parents: 134
diff changeset
   101
::
133
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   102
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   103
    a_str = "         white      space            "
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   104
    a_str.strip()
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   105
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   106
We see that the whitespace inside the sentence is only removed and anything
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   107
inside remains unaffected.
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   108
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   109
By now we know enough to seperate fields from the record and to strip out any
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   110
white space. The only road block we now have is conversion of string to float.
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   111
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   112
The splitting and stripping operations are done on a string and their result is
179
1d04b6c5ff44 First Review for parsing_data.rst
amit
parents: 140
diff changeset
   113
also a string, hence the marks that we have are still strings and mathematical
1d04b6c5ff44 First Review for parsing_data.rst
amit
parents: 140
diff changeset
   114
operations on them are not possible. We must convert them into integers or floats
133
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   115
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   116
We shall look at converting strings into floats. We define an float string
137
fc545d07b0ff added a newline before :: so that a colon does not appear in html
nishanth
parents: 134
diff changeset
   117
first. Type
fc545d07b0ff added a newline before :: so that a colon does not appear in html
nishanth
parents: 134
diff changeset
   118
::
133
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   119
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   120
    mark_str = "1.25"
179
1d04b6c5ff44 First Review for parsing_data.rst
amit
parents: 140
diff changeset
   121
    mark = float(mark_str)
140
bc023595e167 added type into the script
nishanth
parents: 137
diff changeset
   122
    type(mark_str)
bc023595e167 added type into the script
nishanth
parents: 137
diff changeset
   123
    type(mark)
133
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   124
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   125
We can see that string is converted to float. We can perform mathematical
179
1d04b6c5ff44 First Review for parsing_data.rst
amit
parents: 140
diff changeset
   126
operations on it now.
133
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   127
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   128
{{{ Pause here and try out the following exercises }}}
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   129
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   130
%% 3 %% What happens if you do int("1.25")
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   131
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   132
{{{ continue from paused state }}}
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   133
179
1d04b6c5ff44 First Review for parsing_data.rst
amit
parents: 140
diff changeset
   134
.. #[[Amit:I think there should be some interaction first here about the
1d04b6c5ff44 First Review for parsing_data.rst
amit
parents: 140
diff changeset
   135
problem before we conclude to talking about the result.]]
133
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   136
It raises an error since converting a float string into integer directly is
137
fc545d07b0ff added a newline before :: so that a colon does not appear in html
nishanth
parents: 134
diff changeset
   137
not possible. It involves an intermediate step of converting to float.
fc545d07b0ff added a newline before :: so that a colon does not appear in html
nishanth
parents: 134
diff changeset
   138
::
133
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   139
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   140
    dcml_str = "1.25"
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   141
    flt = float(dcml_str)
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   142
    flt
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   143
    number = int(flt)
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   144
    number
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   145
179
1d04b6c5ff44 First Review for parsing_data.rst
amit
parents: 140
diff changeset
   146
Using =int= it is possible to convert float into integers.
133
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   147
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   148
Now that we have all the machinery required to parse the file, let us solve the
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   149
problem. We first read the file line by line and parse each record. We see if
137
fc545d07b0ff added a newline before :: so that a colon does not appear in html
nishanth
parents: 134
diff changeset
   150
the region code is B and store the marks accordingly.
fc545d07b0ff added a newline before :: so that a colon does not appear in html
nishanth
parents: 134
diff changeset
   151
::
133
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   152
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   153
    math_marks_B = [] # an empty list to store the marks
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   154
    for line in open("/home/fossee/sslc1.txt"):
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   155
        fields = line.split(";")
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   156
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   157
        region_code = fields[0]
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   158
        region_code_stripped = region_code.strip()
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   159
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   160
        math_mark_str = fields[5]
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   161
        math_mark = float(math_mark_str)
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   162
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   163
        if region_code == "AA":
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   164
            math_marks_B.append(math_mark)
179
1d04b6c5ff44 First Review for parsing_data.rst
amit
parents: 140
diff changeset
   165
.. #[[Amit:This intutively does not seem to be what you wanted]]
133
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   166
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   167
Now we have all the maths marks of region "B" in the list math_marks_B.
137
fc545d07b0ff added a newline before :: so that a colon does not appear in html
nishanth
parents: 134
diff changeset
   168
To get the mean, we just have to sum the marks and divide by the length.
fc545d07b0ff added a newline before :: so that a colon does not appear in html
nishanth
parents: 134
diff changeset
   169
::
133
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   170
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   171
        math_marks_mean = sum(math_marks_B) / len(math_marks_B)
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   172
        math_marks_mean
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   173
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   174
{{{ Show summary slide }}}
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   175
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   176
This brings us to the end of the tutorial.
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   177
we have learnt
134
543c1cc488ca corrected rst syntax
nishanth
parents: 133
diff changeset
   178
133
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   179
 * how to tokenize a string using various delimiters
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   180
 * how to get rid of extra white space around
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   181
 * how to convert from one type to another
179
1d04b6c5ff44 First Review for parsing_data.rst
amit
parents: 140
diff changeset
   182
.. #[[Amit:one datatype to another may be better.]]
133
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   183
 * how to parse input data and perform computations on it
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   184
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   185
{{{ Show the "sponsored by FOSSEE" slide }}}
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   186
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   187
#[Nishanth]: Will add this line after all of us fix on one.
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   188
This tutorial was created as a part of FOSSEE project, NME ICT, MHRD India
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   189
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   190
Hope you have enjoyed and found it useful.
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   191
Thankyou
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   192
 
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   193
.. Author              : Nishanth
179
1d04b6c5ff44 First Review for parsing_data.rst
amit
parents: 140
diff changeset
   194
   Internal Reviewer 1 : Amit Sethi 
133
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   195
   Internal Reviewer 2 : 
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   196
   External Reviewer   :