parsing_data.rst
changeset 133 bc93dd9d22c5
child 134 543c1cc488ca
equal deleted inserted replaced
132:b8f7ee434b91 133:bc93dd9d22c5
       
     1 Hello friends and welcome to the tutorial on Parsing Data
       
     2 
       
     3 {{{ Show the slide containing title }}}
       
     4 
       
     5 {{{ Show the slide containing the outline slide }}}
       
     6 
       
     7 In this tutorial, we shall learn
       
     8 
       
     9  * What is parsing data
       
    10  * the string operations required for parsing data
       
    11  * datatype conversion
       
    12 
       
    13  Lets us have a look at the problem
       
    14 
       
    15 {{{ Show the slide containing problem statement. }}}
       
    16 
       
    17 There is an input file containing huge no.of records. Each record corresponds
       
    18 to a student.
       
    19 
       
    20 {{{ show the slide explaining record structure }}}
       
    21 As you can see, each record consists of fields seperated by a ";". The first
       
    22 record is region code, then roll number, then name, marks of second language,
       
    23 first language, maths, science and social, total marks, pass/fail indicatd by P
       
    24 or F and finally W if with held and empty otherwise.
       
    25 
       
    26 Our job is to calculate the mean of all the maths marks in the region "B".
       
    27 
       
    28 #[Nishanth]: Please note that I am not telling anything about AA since they do
       
    29              not know about any if/else yet.
       
    30 
       
    31 
       
    32 Now what is parsing data.
       
    33 
       
    34 From the input file, we can see that there is data in the form of text. Hence
       
    35 parsing data is all about reading the data and converting it into a form which
       
    36 can be used for computations. In our case, that is numbers.
       
    37 
       
    38 We can clearly see that the problem involves reading files and tokenizing.
       
    39 
       
    40 Let us learn about tokenizing strings. Let us define a string first. Type::
       
    41 
       
    42     line = "parse this           string"
       
    43 
       
    44 We are now going to split this string on whitespace.::
       
    45 
       
    46     line.split()
       
    47 
       
    48 As you can see, we get a list of strings. Which means, when split is called
       
    49 without any arguments, it splits on whitespace. In simple words, all the spaces
       
    50 are treated as one big space.
       
    51 
       
    52 split also can split on a string of our choice. This is acheived by passing
       
    53 that as an argument. But first lets define a sample record from the file.::
       
    54 
       
    55     record = "A;015163;JOSEPH RAJ S;083;042;47;AA;72;244;;;"
       
    56     record.split(';')
       
    57 
       
    58 We can see that the string is split on ';' and we get each field seperately.
       
    59 We can also observe that an empty string appears in the list since there are
       
    60 two semi colons without anything in between.
       
    61 
       
    62 Hence split splits on whitespace if called without an argument and splits on
       
    63 the given argument if it is called with an argument.
       
    64 
       
    65 {{{ Pause here and try out the following exercises }}}
       
    66 
       
    67 %% 1 %% split the variable line using a space as argument. Is it same as
       
    68         splitting without an argument ?
       
    69 
       
    70 {{{ continue from paused state }}}
       
    71 
       
    72 We see that when we split on space, multiple whitespaces are not clubbed as one
       
    73 and there is an empty string everytime there are two consecutive spaces.
       
    74 
       
    75 Now that we know splitting a string, we can split the record and retreive each
       
    76 field seperately. But there is one problem. The region code "B" and a "B"
       
    77 surrounded by whitespace are treated as two different regions. We must find a
       
    78 way to remove all the whitespace around a string so that "B" and a "B" with
       
    79 white spaces are dealt as same.
       
    80 
       
    81 This is possible by using the =strip= method of strings. Let us define a
       
    82 string by typing::
       
    83 
       
    84     unstripped = "     B    "
       
    85     unstripped.strip()
       
    86 
       
    87 We can see that strip removes all the whitespace around the sentence
       
    88 
       
    89 {{{ Pause here and try out the following exercises }}}
       
    90 
       
    91 %% 2 %% What happens to the white space inside the sentence when it is stripped
       
    92 
       
    93 {{{ continue from paused state }}}
       
    94 
       
    95 Type::
       
    96 
       
    97     a_str = "         white      space            "
       
    98     a_str.strip()
       
    99 
       
   100 We see that the whitespace inside the sentence is only removed and anything
       
   101 inside remains unaffected.
       
   102 
       
   103 By now we know enough to seperate fields from the record and to strip out any
       
   104 white space. The only road block we now have is conversion of string to float.
       
   105 
       
   106 The splitting and stripping operations are done on a string and their result is
       
   107 also a string. hence the marks that we have are still strings and mathematical
       
   108 operations are not possible. We must convert them into integers or floats
       
   109 
       
   110 We shall look at converting strings into floats. We define an float string
       
   111 first. Type::
       
   112 
       
   113     mark_str = "1.25"
       
   114     mark = int(mark_str)
       
   115     mark_str
       
   116     mark
       
   117 
       
   118 We can see that string is converted to float. We can perform mathematical
       
   119 operations on them now.
       
   120 
       
   121 {{{ Pause here and try out the following exercises }}}
       
   122 
       
   123 %% 3 %% What happens if you do int("1.25")
       
   124 
       
   125 {{{ continue from paused state }}}
       
   126 
       
   127 It raises an error since converting a float string into integer directly is
       
   128 not possible. It involves an intermediate step of converting to float.::
       
   129 
       
   130     dcml_str = "1.25"
       
   131     flt = float(dcml_str)
       
   132     flt
       
   133     number = int(flt)
       
   134     number
       
   135 
       
   136 Using =int= it is also possible to convert float into integers.
       
   137 
       
   138 Now that we have all the machinery required to parse the file, let us solve the
       
   139 problem. We first read the file line by line and parse each record. We see if
       
   140 the region code is B and store the marks accordingly.::
       
   141 
       
   142     math_marks_B = [] # an empty list to store the marks
       
   143     for line in open("/home/fossee/sslc1.txt"):
       
   144         fields = line.split(";")
       
   145 
       
   146         region_code = fields[0]
       
   147         region_code_stripped = region_code.strip()
       
   148 
       
   149         math_mark_str = fields[5]
       
   150         math_mark = float(math_mark_str)
       
   151 
       
   152         if region_code == "AA":
       
   153             math_marks_B.append(math_mark)
       
   154 
       
   155 
       
   156 Now we have all the maths marks of region "B" in the list math_marks_B.
       
   157 To get the mean, we just have to sum the marks and divide by the length.::
       
   158 
       
   159         math_marks_mean = sum(math_marks_B) / len(math_marks_B)
       
   160         math_marks_mean
       
   161 
       
   162 {{{ Show summary slide }}}
       
   163 
       
   164 This brings us to the end of the tutorial.
       
   165 we have learnt
       
   166  * how to tokenize a string using various delimiters
       
   167  * how to get rid of extra white space around
       
   168  * how to convert from one type to another
       
   169  * how to parse input data and perform computations on it
       
   170 
       
   171 {{{ Show the "sponsored by FOSSEE" slide }}}
       
   172 
       
   173 #[Nishanth]: Will add this line after all of us fix on one.
       
   174 This tutorial was created as a part of FOSSEE project, NME ICT, MHRD India
       
   175 
       
   176 Hope you have enjoyed and found it useful.
       
   177 Thankyou
       
   178  
       
   179 .. Author              : Nishanth
       
   180    Internal Reviewer 1 : 
       
   181    Internal Reviewer 2 : 
       
   182    External Reviewer   :