parsing_data.rst
changeset 252 0ff3f1a97068
parent 251 9bc78792904b
parent 238 c507e9c413c6
child 253 8a117c6e75f1
equal deleted inserted replaced
251:9bc78792904b 252:0ff3f1a97068
     1 .. Author              : Nishanth
       
     2    Internal Reviewer 1 : 
       
     3    Internal Reviewer 2 : 
       
     4    External Reviewer   :
       
     5 
       
     6 Hello friends and welcome to the tutorial on Parsing Data
       
     7 
       
     8 {{{ Show the slide containing title }}}
       
     9 
       
    10 {{{ Show the slide containing the outline slide }}}
       
    11 
       
    12 In this tutorial, we shall learn
       
    13 
       
    14  * What we mean by parsing data
       
    15  * the string operations required for parsing data
       
    16  * datatype conversion
       
    17 
       
    18 #[Puneeth]: Changed a few things, here.  
       
    19 
       
    20 #[Puneeth]: I don't like the way the term "parsing data" has been used, all
       
    21 through the script. See if that can be changed.
       
    22 
       
    23  Lets us have a look at the problem
       
    24 
       
    25 {{{ Show the slide containing problem statement. }}}
       
    26 
       
    27 There is an input file containing huge no. of records. Each record corresponds
       
    28 to a student.
       
    29 
       
    30 {{{ show the slide explaining record structure }}}
       
    31 As you can see, each record consists of fields seperated by a ";". The first
       
    32 record is region code, then roll number, then name, marks of second language,
       
    33 first language, maths, science and social, total marks, pass/fail indicatd by P
       
    34 or F and finally W if with held and empty otherwise.
       
    35 
       
    36 Our job is to calculate the mean of all the maths marks in the region "B".
       
    37 
       
    38 #[Nishanth]: Please note that I am not telling anything about AA since they do
       
    39              not know about any if/else yet.
       
    40 
       
    41 #[Puneeth]: Should we talk pass/fail etc? I think we should make the problem
       
    42  simple and leave out all the columns after total marks. 
       
    43 
       
    44 Now what is parsing data.
       
    45 
       
    46 From the input file, we can see that the data we have is in the form of
       
    47 text. Parsing this data is all about reading it and converting it into a form
       
    48 which can be used for computations -- in our case, sequence of numbers.
       
    49 
       
    50 #[Puneeth]: should the word tokenizing, be used? Should it be defined before
       
    51  using it?
       
    52 
       
    53 We can clearly see that the problem involves reading files and tokenizing.
       
    54 
       
    55 #[Puneeth]: the sentence above seems kinda redundant. 
       
    56 
       
    57 Let us learn about tokenizing strings. Let us define a string first. Type
       
    58 ::
       
    59 
       
    60     line = "parse this           string"
       
    61 
       
    62 We are now going to split this string on whitespace.
       
    63 ::
       
    64 
       
    65     line.split()
       
    66 
       
    67 As you can see, we get a list of strings. Which means, when ``split`` is called
       
    68 without any arguments, it splits on whitespace. In simple words, all the spaces
       
    69 are treated as one big space.
       
    70 
       
    71 ``split`` also can split on a string of our choice. This is acheived by passing
       
    72 that as an argument. But first lets define a sample record from the file.
       
    73 ::
       
    74 
       
    75     record = "A;015163;JOSEPH RAJ S;083;042;47;AA;72;244;;;"
       
    76     record.split(';')
       
    77 
       
    78 We can see that the string is split on ';' and we get each field seperately.
       
    79 We can also observe that an empty string appears in the list since there are
       
    80 two semi colons without anything in between.
       
    81 
       
    82 To recap, ``split`` splits on whitespace if called without an argument and
       
    83 splits on the given argument if it is called with an argument.
       
    84 
       
    85 {{{ Pause here and try out the following exercises }}}
       
    86 
       
    87 %% 1 %% split the variable line using a space as argument. Is it same as
       
    88         splitting without an argument ?
       
    89 
       
    90 {{{ continue from paused state }}}
       
    91 
       
    92 We see that when we split on space, multiple whitespaces are not clubbed as one
       
    93 and there is an empty string everytime there are two consecutive spaces.
       
    94 
       
    95 Now that we know how to split a string, we can split the record and retrieve
       
    96 each field seperately. But there is one problem. The region code "B" and a "B"
       
    97 surrounded by whitespace are treated as two different regions. We must find a
       
    98 way to remove all the whitespace around a string so that "B" and a "B" with
       
    99 white spaces are dealt as same.
       
   100 
       
   101 This is possible by using the ``strip`` method of strings. Let us define a
       
   102 string by typing
       
   103 ::
       
   104 
       
   105     unstripped = "     B    "
       
   106     unstripped.strip()
       
   107 
       
   108 We can see that strip removes all the whitespace around the sentence
       
   109 
       
   110 {{{ Pause here and try out the following exercises }}}
       
   111 
       
   112 %% 2 %% What happens to the white space inside the sentence when it is stripped
       
   113 
       
   114 {{{ continue from paused state }}}
       
   115 
       
   116 Type
       
   117 ::
       
   118 
       
   119     a_str = "         white      space            "
       
   120     a_str.strip()
       
   121 
       
   122 We see that the whitespace inside the sentence is only removed and anything
       
   123 inside remains unaffected.
       
   124 
       
   125 By now we know enough to seperate fields from the record and to strip out any
       
   126 white space. The only road block we now have is conversion of string to float.
       
   127 
       
   128 The splitting and stripping operations are done on a string and their result is
       
   129 also a string. hence the marks that we have are still strings and mathematical
       
   130 operations are not possible on them. We must convert them into numbers
       
   131 (integers or floats), before we can perform mathematical operations on them. 
       
   132 
       
   133 We shall look at converting strings into floats. We define a float string
       
   134 first. Type 
       
   135 ::
       
   136 
       
   137     mark_str = "1.25"
       
   138     mark = int(mark_str)
       
   139     type(mark_str)
       
   140     type(mark)
       
   141 
       
   142 We can see that string is converted to float. We can perform mathematical
       
   143 operations on them now.
       
   144 
       
   145 {{{ Pause here and try out the following exercises }}}
       
   146 
       
   147 %% 3 %% What happens if you do int("1.25")
       
   148 
       
   149 {{{ continue from paused state }}}
       
   150 
       
   151 It raises an error since converting a float string into integer directly is
       
   152 not possible. It involves an intermediate step of converting to float.
       
   153 ::
       
   154 
       
   155     dcml_str = "1.25"
       
   156     flt = float(dcml_str)
       
   157     flt
       
   158     number = int(flt)
       
   159     number
       
   160 
       
   161 Using ``int`` it is also possible to convert float into integers.
       
   162 
       
   163 Now that we have all the machinery required to parse the file, let us solve the
       
   164 problem. We first read the file line by line and parse each record. We see if
       
   165 the region code is B and store the marks accordingly.
       
   166 ::
       
   167 
       
   168     math_marks_B = [] # an empty list to store the marks
       
   169     for line in open("/home/fossee/sslc1.txt"):
       
   170         fields = line.split(";")
       
   171 
       
   172         region_code = fields[0]
       
   173         region_code_stripped = region_code.strip()
       
   174 
       
   175         math_mark_str = fields[5]
       
   176         math_mark = float(math_mark_str)
       
   177 
       
   178         if region_code == "AA":
       
   179             math_marks_B.append(math_mark)
       
   180 
       
   181 
       
   182 Now we have all the maths marks of region "B" in the list math_marks_B.
       
   183 To get the mean, we just have to sum the marks and divide by the length.
       
   184 ::
       
   185 
       
   186         math_marks_mean = sum(math_marks_B) / len(math_marks_B)
       
   187         math_marks_mean
       
   188 
       
   189 {{{ Show summary slide }}}
       
   190 
       
   191 This brings us to the end of the tutorial.
       
   192 we have learnt
       
   193 
       
   194  * how to tokenize a string using various delimiters
       
   195  * how to get rid of extra white space around
       
   196  * how to convert from one type to another
       
   197  * how to parse input data and perform computations on it
       
   198 
       
   199 {{{ Show the "sponsored by FOSSEE" slide }}}
       
   200 
       
   201 #[Nishanth]: Will add this line after all of us fix on one.
       
   202 This tutorial was created as a part of FOSSEE project, NME ICT, MHRD India
       
   203 
       
   204 Hope you have enjoyed and found it useful.
       
   205 Thank you
       
   206  
       
   207 Questions
       
   208 =========
       
   209 
       
   210  1. How do you split the string "Guido;Rossum;Python" to get the words
       
   211 
       
   212    Answer: line.split(';')
       
   213 
       
   214  2. line.split() and line.split(' ') are same
       
   215 
       
   216    a. True
       
   217    #. False
       
   218 
       
   219    Answer: False
       
   220 
       
   221  3. What is the output of the following code::
       
   222 
       
   223       line = "Hello;;;World;;"
       
   224       sub_strs = line.split()
       
   225       print len(sub_strs)
       
   226 
       
   227     Answer: 5
       
   228 
       
   229  4. What is the output of "      Hello    World    ".strip()
       
   230 
       
   231    a. "Hello World"
       
   232    #. "Hello     World"
       
   233    #. "      Hello World"
       
   234    #. "Hello World     "
       
   235    
       
   236    Answer: "Hello    World"
       
   237 
       
   238  5. What does "It is a cold night".strip("It") produce
       
   239     Hint: Read the documentation of strip
       
   240 
       
   241    a. "is a cold night"
       
   242    #. " is a cold nigh" 
       
   243    #. "It is a cold nigh"
       
   244    #. "is a cold nigh"
       
   245 
       
   246    Answer: " is a cold nigh"
       
   247 
       
   248  6. What does int("20") produce
       
   249 
       
   250    a. "20"
       
   251    #. 20.0
       
   252    #. 20
       
   253    #. Error
       
   254 
       
   255    Answer: 20
       
   256 
       
   257  7. What does int("20.0") produce
       
   258 
       
   259    a. 20
       
   260    #. 20.0
       
   261    #. Error
       
   262    #. "20"
       
   263 
       
   264    Answer: Error
       
   265 
       
   266  8. What is the value of float(3/2)
       
   267 
       
   268    a. 1.0
       
   269    #. 1.5
       
   270    #. 1
       
   271    #. Error
       
   272 
       
   273    Answer: 1.0
       
   274 
       
   275  9. what doess float("3/2") produce
       
   276 
       
   277    a. 1.0
       
   278    #. 1.5
       
   279    #. 1
       
   280    #. Error
       
   281 
       
   282    Answer: Error
       
   283    
       
   284  10. See if there is a function available in pylab to calculate the mean
       
   285      Hint: Use tab completion
       
   286 
       
   287