parsing_data.rst
changeset 149 b9ae88095ade
parent 140 bc023595e167
child 179 1d04b6c5ff44
child 194 ca81c0a67c75
equal deleted inserted replaced
148:60a4616dbf55 149:b9ae88095ade
       
     1 Hello friends and welcome to the tutorial on Parsing Data
       
     2 
       
     3 {{{ Show the slide containing title }}}
       
     4 
       
     5 {{{ Show the slide containing the outline slide }}}
       
     6 
       
     7 In this tutorial, we shall learn
       
     8 
       
     9  * What is parsing data
       
    10  * the string operations required for parsing data
       
    11  * datatype conversion
       
    12 
       
    13  Lets us have a look at the problem
       
    14 
       
    15 {{{ Show the slide containing problem statement. }}}
       
    16 
       
    17 There is an input file containing huge no.of records. Each record corresponds
       
    18 to a student.
       
    19 
       
    20 {{{ show the slide explaining record structure }}}
       
    21 As you can see, each record consists of fields seperated by a ";". The first
       
    22 record is region code, then roll number, then name, marks of second language,
       
    23 first language, maths, science and social, total marks, pass/fail indicatd by P
       
    24 or F and finally W if with held and empty otherwise.
       
    25 
       
    26 Our job is to calculate the mean of all the maths marks in the region "B".
       
    27 
       
    28 #[Nishanth]: Please note that I am not telling anything about AA since they do
       
    29              not know about any if/else yet.
       
    30 
       
    31 
       
    32 Now what is parsing data.
       
    33 
       
    34 From the input file, we can see that there is data in the form of text. Hence
       
    35 parsing data is all about reading the data and converting it into a form which
       
    36 can be used for computations. In our case, that is numbers.
       
    37 
       
    38 We can clearly see that the problem involves reading files and tokenizing.
       
    39 
       
    40 Let us learn about tokenizing strings. Let us define a string first. Type
       
    41 ::
       
    42 
       
    43     line = "parse this           string"
       
    44 
       
    45 We are now going to split this string on whitespace.
       
    46 ::
       
    47 
       
    48     line.split()
       
    49 
       
    50 As you can see, we get a list of strings. Which means, when split is called
       
    51 without any arguments, it splits on whitespace. In simple words, all the spaces
       
    52 are treated as one big space.
       
    53 
       
    54 split also can split on a string of our choice. This is acheived by passing
       
    55 that as an argument. But first lets define a sample record from the file.
       
    56 ::
       
    57 
       
    58     record = "A;015163;JOSEPH RAJ S;083;042;47;AA;72;244;;;"
       
    59     record.split(';')
       
    60 
       
    61 We can see that the string is split on ';' and we get each field seperately.
       
    62 We can also observe that an empty string appears in the list since there are
       
    63 two semi colons without anything in between.
       
    64 
       
    65 Hence split splits on whitespace if called without an argument and splits on
       
    66 the given argument if it is called with an argument.
       
    67 
       
    68 {{{ Pause here and try out the following exercises }}}
       
    69 
       
    70 %% 1 %% split the variable line using a space as argument. Is it same as
       
    71         splitting without an argument ?
       
    72 
       
    73 {{{ continue from paused state }}}
       
    74 
       
    75 We see that when we split on space, multiple whitespaces are not clubbed as one
       
    76 and there is an empty string everytime there are two consecutive spaces.
       
    77 
       
    78 Now that we know splitting a string, we can split the record and retreive each
       
    79 field seperately. But there is one problem. The region code "B" and a "B"
       
    80 surrounded by whitespace are treated as two different regions. We must find a
       
    81 way to remove all the whitespace around a string so that "B" and a "B" with
       
    82 white spaces are dealt as same.
       
    83 
       
    84 This is possible by using the =strip= method of strings. Let us define a
       
    85 string by typing
       
    86 ::
       
    87 
       
    88     unstripped = "     B    "
       
    89     unstripped.strip()
       
    90 
       
    91 We can see that strip removes all the whitespace around the sentence
       
    92 
       
    93 {{{ Pause here and try out the following exercises }}}
       
    94 
       
    95 %% 2 %% What happens to the white space inside the sentence when it is stripped
       
    96 
       
    97 {{{ continue from paused state }}}
       
    98 
       
    99 Type
       
   100 ::
       
   101 
       
   102     a_str = "         white      space            "
       
   103     a_str.strip()
       
   104 
       
   105 We see that the whitespace inside the sentence is only removed and anything
       
   106 inside remains unaffected.
       
   107 
       
   108 By now we know enough to seperate fields from the record and to strip out any
       
   109 white space. The only road block we now have is conversion of string to float.
       
   110 
       
   111 The splitting and stripping operations are done on a string and their result is
       
   112 also a string. hence the marks that we have are still strings and mathematical
       
   113 operations are not possible. We must convert them into integers or floats
       
   114 
       
   115 We shall look at converting strings into floats. We define an float string
       
   116 first. Type
       
   117 ::
       
   118 
       
   119     mark_str = "1.25"
       
   120     mark = int(mark_str)
       
   121     type(mark_str)
       
   122     type(mark)
       
   123 
       
   124 We can see that string is converted to float. We can perform mathematical
       
   125 operations on them now.
       
   126 
       
   127 {{{ Pause here and try out the following exercises }}}
       
   128 
       
   129 %% 3 %% What happens if you do int("1.25")
       
   130 
       
   131 {{{ continue from paused state }}}
       
   132 
       
   133 It raises an error since converting a float string into integer directly is
       
   134 not possible. It involves an intermediate step of converting to float.
       
   135 ::
       
   136 
       
   137     dcml_str = "1.25"
       
   138     flt = float(dcml_str)
       
   139     flt
       
   140     number = int(flt)
       
   141     number
       
   142 
       
   143 Using =int= it is also possible to convert float into integers.
       
   144 
       
   145 Now that we have all the machinery required to parse the file, let us solve the
       
   146 problem. We first read the file line by line and parse each record. We see if
       
   147 the region code is B and store the marks accordingly.
       
   148 ::
       
   149 
       
   150     math_marks_B = [] # an empty list to store the marks
       
   151     for line in open("/home/fossee/sslc1.txt"):
       
   152         fields = line.split(";")
       
   153 
       
   154         region_code = fields[0]
       
   155         region_code_stripped = region_code.strip()
       
   156 
       
   157         math_mark_str = fields[5]
       
   158         math_mark = float(math_mark_str)
       
   159 
       
   160         if region_code == "AA":
       
   161             math_marks_B.append(math_mark)
       
   162 
       
   163 
       
   164 Now we have all the maths marks of region "B" in the list math_marks_B.
       
   165 To get the mean, we just have to sum the marks and divide by the length.
       
   166 ::
       
   167 
       
   168         math_marks_mean = sum(math_marks_B) / len(math_marks_B)
       
   169         math_marks_mean
       
   170 
       
   171 {{{ Show summary slide }}}
       
   172 
       
   173 This brings us to the end of the tutorial.
       
   174 we have learnt
       
   175 
       
   176  * how to tokenize a string using various delimiters
       
   177  * how to get rid of extra white space around
       
   178  * how to convert from one type to another
       
   179  * how to parse input data and perform computations on it
       
   180 
       
   181 {{{ Show the "sponsored by FOSSEE" slide }}}
       
   182 
       
   183 #[Nishanth]: Will add this line after all of us fix on one.
       
   184 This tutorial was created as a part of FOSSEE project, NME ICT, MHRD India
       
   185 
       
   186 Hope you have enjoyed and found it useful.
       
   187 Thankyou
       
   188  
       
   189 .. Author              : Nishanth
       
   190    Internal Reviewer 1 : 
       
   191    Internal Reviewer 2 : 
       
   192    External Reviewer   :