parsing_data.rst
author nishanth
Wed, 15 Sep 2010 18:50:17 +0530
changeset 133 bc93dd9d22c5
child 134 543c1cc488ca
permissions -rw-r--r--
initial commit parsing_data
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
133
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
     1
Hello friends and welcome to the tutorial on Parsing Data
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
     2
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
     3
{{{ Show the slide containing title }}}
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
     4
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
     5
{{{ Show the slide containing the outline slide }}}
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
     6
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
     7
In this tutorial, we shall learn
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
     8
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
     9
 * What is parsing data
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    10
 * the string operations required for parsing data
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    11
 * datatype conversion
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    12
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    13
 Lets us have a look at the problem
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    14
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    15
{{{ Show the slide containing problem statement. }}}
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    16
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    17
There is an input file containing huge no.of records. Each record corresponds
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    18
to a student.
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    19
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    20
{{{ show the slide explaining record structure }}}
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    21
As you can see, each record consists of fields seperated by a ";". The first
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    22
record is region code, then roll number, then name, marks of second language,
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    23
first language, maths, science and social, total marks, pass/fail indicatd by P
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    24
or F and finally W if with held and empty otherwise.
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    25
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    26
Our job is to calculate the mean of all the maths marks in the region "B".
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    27
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    28
#[Nishanth]: Please note that I am not telling anything about AA since they do
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    29
             not know about any if/else yet.
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    30
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    31
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    32
Now what is parsing data.
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    33
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    34
From the input file, we can see that there is data in the form of text. Hence
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    35
parsing data is all about reading the data and converting it into a form which
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    36
can be used for computations. In our case, that is numbers.
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    37
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    38
We can clearly see that the problem involves reading files and tokenizing.
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    39
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    40
Let us learn about tokenizing strings. Let us define a string first. Type::
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    41
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    42
    line = "parse this           string"
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    43
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    44
We are now going to split this string on whitespace.::
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    45
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    46
    line.split()
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    47
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    48
As you can see, we get a list of strings. Which means, when split is called
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    49
without any arguments, it splits on whitespace. In simple words, all the spaces
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    50
are treated as one big space.
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    51
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    52
split also can split on a string of our choice. This is acheived by passing
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    53
that as an argument. But first lets define a sample record from the file.::
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    54
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    55
    record = "A;015163;JOSEPH RAJ S;083;042;47;AA;72;244;;;"
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    56
    record.split(';')
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    57
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    58
We can see that the string is split on ';' and we get each field seperately.
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    59
We can also observe that an empty string appears in the list since there are
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    60
two semi colons without anything in between.
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    61
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    62
Hence split splits on whitespace if called without an argument and splits on
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    63
the given argument if it is called with an argument.
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    64
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    65
{{{ Pause here and try out the following exercises }}}
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    66
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    67
%% 1 %% split the variable line using a space as argument. Is it same as
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    68
        splitting without an argument ?
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    69
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    70
{{{ continue from paused state }}}
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    71
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    72
We see that when we split on space, multiple whitespaces are not clubbed as one
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    73
and there is an empty string everytime there are two consecutive spaces.
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    74
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    75
Now that we know splitting a string, we can split the record and retreive each
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    76
field seperately. But there is one problem. The region code "B" and a "B"
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    77
surrounded by whitespace are treated as two different regions. We must find a
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    78
way to remove all the whitespace around a string so that "B" and a "B" with
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    79
white spaces are dealt as same.
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    80
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    81
This is possible by using the =strip= method of strings. Let us define a
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    82
string by typing::
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    83
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    84
    unstripped = "     B    "
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    85
    unstripped.strip()
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    86
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    87
We can see that strip removes all the whitespace around the sentence
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    88
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    89
{{{ Pause here and try out the following exercises }}}
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    90
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    91
%% 2 %% What happens to the white space inside the sentence when it is stripped
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    92
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    93
{{{ continue from paused state }}}
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    94
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    95
Type::
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    96
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    97
    a_str = "         white      space            "
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    98
    a_str.strip()
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
    99
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   100
We see that the whitespace inside the sentence is only removed and anything
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   101
inside remains unaffected.
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   102
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   103
By now we know enough to seperate fields from the record and to strip out any
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   104
white space. The only road block we now have is conversion of string to float.
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   105
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   106
The splitting and stripping operations are done on a string and their result is
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   107
also a string. hence the marks that we have are still strings and mathematical
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   108
operations are not possible. We must convert them into integers or floats
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   109
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   110
We shall look at converting strings into floats. We define an float string
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   111
first. Type::
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   112
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   113
    mark_str = "1.25"
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   114
    mark = int(mark_str)
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   115
    mark_str
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   116
    mark
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   117
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   118
We can see that string is converted to float. We can perform mathematical
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   119
operations on them now.
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   120
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   121
{{{ Pause here and try out the following exercises }}}
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   122
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   123
%% 3 %% What happens if you do int("1.25")
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   124
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   125
{{{ continue from paused state }}}
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   126
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   127
It raises an error since converting a float string into integer directly is
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   128
not possible. It involves an intermediate step of converting to float.::
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   129
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   130
    dcml_str = "1.25"
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   131
    flt = float(dcml_str)
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   132
    flt
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   133
    number = int(flt)
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   134
    number
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   135
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   136
Using =int= it is also possible to convert float into integers.
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   137
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   138
Now that we have all the machinery required to parse the file, let us solve the
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   139
problem. We first read the file line by line and parse each record. We see if
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   140
the region code is B and store the marks accordingly.::
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   141
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   142
    math_marks_B = [] # an empty list to store the marks
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   143
    for line in open("/home/fossee/sslc1.txt"):
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   144
        fields = line.split(";")
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   145
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   146
        region_code = fields[0]
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   147
        region_code_stripped = region_code.strip()
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   148
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   149
        math_mark_str = fields[5]
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   150
        math_mark = float(math_mark_str)
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   151
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   152
        if region_code == "AA":
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   153
            math_marks_B.append(math_mark)
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   154
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   155
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   156
Now we have all the maths marks of region "B" in the list math_marks_B.
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   157
To get the mean, we just have to sum the marks and divide by the length.::
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   158
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   159
        math_marks_mean = sum(math_marks_B) / len(math_marks_B)
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   160
        math_marks_mean
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   161
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   162
{{{ Show summary slide }}}
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   163
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   164
This brings us to the end of the tutorial.
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   165
we have learnt
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   166
 * how to tokenize a string using various delimiters
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   167
 * how to get rid of extra white space around
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   168
 * how to convert from one type to another
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   169
 * how to parse input data and perform computations on it
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   170
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   171
{{{ Show the "sponsored by FOSSEE" slide }}}
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   172
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   173
#[Nishanth]: Will add this line after all of us fix on one.
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   174
This tutorial was created as a part of FOSSEE project, NME ICT, MHRD India
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   175
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   176
Hope you have enjoyed and found it useful.
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   177
Thankyou
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   178
 
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   179
.. Author              : Nishanth
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   180
   Internal Reviewer 1 : 
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   181
   Internal Reviewer 2 : 
bc93dd9d22c5 initial commit parsing_data
nishanth
parents:
diff changeset
   182
   External Reviewer   :