parsing_data/script.rst
changeset 238 c507e9c413c6
child 332 b702c10e5919
equal deleted inserted replaced
237:6c203780bfbe 238:c507e9c413c6
       
     1 .. Objectives
       
     2 .. ----------
       
     3 
       
     4 .. A - Students and teachers from Science and engineering backgrounds
       
     5    B - 
       
     6    C - 
       
     7    D - 
       
     8 
       
     9 .. Prerequisites
       
    10 .. -------------
       
    11 
       
    12 ..   1. Getting started with lists
       
    13      
       
    14 .. Author              : Nishanth Amuluru
       
    15    Internal Reviewer   : 
       
    16    External Reviewer   :
       
    17    Checklist OK?       : <put date stamp here, if OK> [2010-10-05]
       
    18 
       
    19 Script
       
    20 ------
       
    21 
       
    22 Hello friends and welcome to the tutorial on Parsing Data
       
    23 
       
    24 {{{ Show the slide containing title }}}
       
    25 
       
    26 {{{ Show the slide containing the outline slide }}}
       
    27 
       
    28 In this tutorial, we shall learn
       
    29 
       
    30  * What we mean by parsing data
       
    31  * the string operations required for parsing data
       
    32  * datatype conversion
       
    33 
       
    34 #[Puneeth]: Changed a few things, here.  
       
    35 
       
    36 #[Puneeth]: I don't like the way the term "parsing data" has been used, all
       
    37 through the script. See if that can be changed.
       
    38 
       
    39  Lets us have a look at the problem
       
    40 
       
    41 {{{ Show the slide containing problem statement. }}}
       
    42 
       
    43 There is an input file containing huge no. of records. Each record corresponds
       
    44 to a student.
       
    45 
       
    46 {{{ show the slide explaining record structure }}}
       
    47 As you can see, each record consists of fields seperated by a ";". The first
       
    48 record is region code, then roll number, then name, marks of second language,
       
    49 first language, maths, science and social, total marks, pass/fail indicatd by P
       
    50 or F and finally W if with held and empty otherwise.
       
    51 
       
    52 Our job is to calculate the mean of all the maths marks in the region "B".
       
    53 
       
    54 #[Nishanth]: Please note that I am not telling anything about AA since they do
       
    55              not know about any if/else yet.
       
    56 
       
    57 #[Puneeth]: Should we talk pass/fail etc? I think we should make the problem
       
    58  simple and leave out all the columns after total marks. 
       
    59 
       
    60 Now what is parsing data.
       
    61 
       
    62 From the input file, we can see that the data we have is in the form of
       
    63 text. Parsing this data is all about reading it and converting it into a form
       
    64 which can be used for computations -- in our case, sequence of numbers.
       
    65 
       
    66 #[Puneeth]: should the word tokenizing, be used? Should it be defined before
       
    67  using it?
       
    68 
       
    69 We can clearly see that the problem involves reading files and tokenizing.
       
    70 
       
    71 #[Puneeth]: the sentence above seems kinda redundant. 
       
    72 
       
    73 Let us learn about tokenizing strings. Let us define a string first. Type
       
    74 ::
       
    75 
       
    76     line = "parse this           string"
       
    77 
       
    78 We are now going to split this string on whitespace.
       
    79 ::
       
    80 
       
    81     line.split()
       
    82 
       
    83 As you can see, we get a list of strings. Which means, when ``split`` is called
       
    84 without any arguments, it splits on whitespace. In simple words, all the spaces
       
    85 are treated as one big space.
       
    86 
       
    87 ``split`` also can split on a string of our choice. This is acheived by passing
       
    88 that as an argument. But first lets define a sample record from the file.
       
    89 ::
       
    90 
       
    91     record = "A;015163;JOSEPH RAJ S;083;042;47;AA;72;244;;;"
       
    92     record.split(';')
       
    93 
       
    94 We can see that the string is split on ';' and we get each field seperately.
       
    95 We can also observe that an empty string appears in the list since there are
       
    96 two semi colons without anything in between.
       
    97 
       
    98 To recap, ``split`` splits on whitespace if called without an argument and
       
    99 splits on the given argument if it is called with an argument.
       
   100 
       
   101 {{{ Pause here and try out the following exercises }}}
       
   102 
       
   103 %% 1 %% split the variable line using a space as argument. Is it same as
       
   104         splitting without an argument ?
       
   105 
       
   106 {{{ continue from paused state }}}
       
   107 
       
   108 We see that when we split on space, multiple whitespaces are not clubbed as one
       
   109 and there is an empty string everytime there are two consecutive spaces.
       
   110 
       
   111 Now that we know how to split a string, we can split the record and retrieve
       
   112 each field seperately. But there is one problem. The region code "B" and a "B"
       
   113 surrounded by whitespace are treated as two different regions. We must find a
       
   114 way to remove all the whitespace around a string so that "B" and a "B" with
       
   115 white spaces are dealt as same.
       
   116 
       
   117 This is possible by using the ``strip`` method of strings. Let us define a
       
   118 string by typing
       
   119 ::
       
   120 
       
   121     unstripped = "     B    "
       
   122     unstripped.strip()
       
   123 
       
   124 We can see that strip removes all the whitespace around the sentence
       
   125 
       
   126 {{{ Pause here and try out the following exercises }}}
       
   127 
       
   128 %% 2 %% What happens to the white space inside the sentence when it is stripped
       
   129 
       
   130 {{{ continue from paused state }}}
       
   131 
       
   132 Type
       
   133 ::
       
   134 
       
   135     a_str = "         white      space            "
       
   136     a_str.strip()
       
   137 
       
   138 We see that the whitespace inside the sentence is only removed and anything
       
   139 inside remains unaffected.
       
   140 
       
   141 By now we know enough to seperate fields from the record and to strip out any
       
   142 white space. The only road block we now have is conversion of string to float.
       
   143 
       
   144 The splitting and stripping operations are done on a string and their result is
       
   145 also a string. hence the marks that we have are still strings and mathematical
       
   146 operations are not possible on them. We must convert them into numbers
       
   147 (integers or floats), before we can perform mathematical operations on them. 
       
   148 
       
   149 We shall look at converting strings into floats. We define a float string
       
   150 first. Type 
       
   151 ::
       
   152 
       
   153     mark_str = "1.25"
       
   154     mark = int(mark_str)
       
   155     type(mark_str)
       
   156     type(mark)
       
   157 
       
   158 We can see that string is converted to float. We can perform mathematical
       
   159 operations on them now.
       
   160 
       
   161 {{{ Pause here and try out the following exercises }}}
       
   162 
       
   163 %% 3 %% What happens if you do int("1.25")
       
   164 
       
   165 {{{ continue from paused state }}}
       
   166 
       
   167 It raises an error since converting a float string into integer directly is
       
   168 not possible. It involves an intermediate step of converting to float.
       
   169 ::
       
   170 
       
   171     dcml_str = "1.25"
       
   172     flt = float(dcml_str)
       
   173     flt
       
   174     number = int(flt)
       
   175     number
       
   176 
       
   177 Using ``int`` it is also possible to convert float into integers.
       
   178 
       
   179 Now that we have all the machinery required to parse the file, let us solve the
       
   180 problem. We first read the file line by line and parse each record. We see if
       
   181 the region code is B and store the marks accordingly.
       
   182 ::
       
   183 
       
   184     math_marks_B = [] # an empty list to store the marks
       
   185     for line in open("/home/fossee/sslc1.txt"):
       
   186         fields = line.split(";")
       
   187 
       
   188         region_code = fields[0]
       
   189         region_code_stripped = region_code.strip()
       
   190 
       
   191         math_mark_str = fields[5]
       
   192         math_mark = float(math_mark_str)
       
   193 
       
   194         if region_code == "AA":
       
   195             math_marks_B.append(math_mark)
       
   196 
       
   197 
       
   198 Now we have all the maths marks of region "B" in the list math_marks_B.
       
   199 To get the mean, we just have to sum the marks and divide by the length.
       
   200 ::
       
   201 
       
   202         math_marks_mean = sum(math_marks_B) / len(math_marks_B)
       
   203         math_marks_mean
       
   204 
       
   205 {{{ Show summary slide }}}
       
   206 
       
   207 This brings us to the end of the tutorial.
       
   208 we have learnt
       
   209 
       
   210  * how to tokenize a string using various delimiters
       
   211  * how to get rid of extra white space around
       
   212  * how to convert from one type to another
       
   213  * how to parse input data and perform computations on it
       
   214 
       
   215 {{{ Show the "sponsored by FOSSEE" slide }}}
       
   216 
       
   217 #[Nishanth]: Will add this line after all of us fix on one.
       
   218 This tutorial was created as a part of FOSSEE project, NME ICT, MHRD India
       
   219 
       
   220 Hope you have enjoyed and found it useful.
       
   221 Thank you
       
   222  
       
   223