parsing_data.rst
changeset 203 846d71a4e915
parent 197 97d859b70f51
child 219 901b78003917
equal deleted inserted replaced
202:069d4e86207e 203:846d71a4e915
       
     1 .. Author              : Nishanth
       
     2    Internal Reviewer 1 : 
       
     3    Internal Reviewer 2 : 
       
     4    External Reviewer   :
       
     5 
     1 Hello friends and welcome to the tutorial on Parsing Data
     6 Hello friends and welcome to the tutorial on Parsing Data
     2 
     7 
     3 {{{ Show the slide containing title }}}
     8 {{{ Show the slide containing title }}}
     4 
     9 
     5 {{{ Show the slide containing the outline slide }}}
    10 {{{ Show the slide containing the outline slide }}}
     6 
    11 
     7 In this tutorial, we shall learn
    12 In this tutorial, we shall learn
     8 
    13 
     9  * What is parsing data
    14  * What we mean by parsing data
    10  * the string operations required for parsing data
    15  * the string operations required for parsing data
    11  * datatype conversion
    16  * datatype conversion
    12 
    17 
       
    18 #[Puneeth]: Changed a few things, here.  
       
    19 
       
    20 #[Puneeth]: I don't like the way the term "parsing data" has been used, all
       
    21 through the script. See if that can be changed.
       
    22 
    13  Lets us have a look at the problem
    23  Lets us have a look at the problem
    14 
    24 
    15 {{{ Show the slide containing problem statement. }}}
    25 {{{ Show the slide containing problem statement. }}}
    16 
    26 
    17 There is an input file containing huge no.of records. Each record corresponds
    27 There is an input file containing huge no. of records. Each record corresponds
    18 to a student.
    28 to a student.
    19 
    29 
    20 {{{ show the slide explaining record structure }}}
    30 {{{ show the slide explaining record structure }}}
    21 As you can see, each record consists of fields seperated by a ";". The first
    31 As you can see, each record consists of fields seperated by a ";". The first
    22 record is region code, then roll number, then name, marks of second language,
    32 record is region code, then roll number, then name, marks of second language,
    26 Our job is to calculate the mean of all the maths marks in the region "B".
    36 Our job is to calculate the mean of all the maths marks in the region "B".
    27 
    37 
    28 #[Nishanth]: Please note that I am not telling anything about AA since they do
    38 #[Nishanth]: Please note that I am not telling anything about AA since they do
    29              not know about any if/else yet.
    39              not know about any if/else yet.
    30 
    40 
    31 
    41 #[Puneeth]: Should we talk pass/fail etc? I think we should make the problem
    32 So what exactly is parsing data?
    42  simple and leave out all the columns after total marks. 
    33 
    43 
    34 
    44 Now what is parsing data.
    35 Parsing data is all about reading the data and converting it into a form which
    45 
    36 can be used for computations. In our case, that is numbers.
    46 From the input file, we can see that the data we have is in the form of
       
    47 text. Parsing this data is all about reading it and converting it into a form
       
    48 which can be used for computations -- in our case, sequence of numbers.
       
    49 
       
    50 #[Puneeth]: should the word tokenizing, be used? Should it be defined before
       
    51  using it?
    37 
    52 
    38 We can clearly see that the problem involves reading files and tokenizing.
    53 We can clearly see that the problem involves reading files and tokenizing.
    39 
    54 
    40 .. #[[Amit:Definition of Tokenizing here.]]
    55 #[Puneeth]: the sentence above seems kinda redundant. 
       
    56 
    41 Let us learn about tokenizing strings. Let us define a string first. Type
    57 Let us learn about tokenizing strings. Let us define a string first. Type
    42 ::
    58 ::
    43 
    59 
    44     line = "parse this           string"
    60     line = "parse this           string"
    45 
    61 
    46 We are now going to split this string on whitespace.
    62 We are now going to split this string on whitespace.
    47 ::
    63 ::
    48 
    64 
    49     line.split()
    65     line.split()
    50 
    66 
    51 As you can see, we get a list of strings. Which means, when split is called
    67 As you can see, we get a list of strings. Which means, when ``split`` is called
    52 without any arguments, it splits on whitespace. In simple words, all the spaces
    68 without any arguments, it splits on whitespace. In simple words, all the spaces
    53 are treated as one big space.
    69 are treated as one big space.
    54 
    70 
    55 split also can split on a string of our choice. This is acheived by passing
    71 ``split`` also can split on a string of our choice. This is acheived by passing
    56 that as an argument. But first lets define a sample record from the file.
    72 that as an argument. But first lets define a sample record from the file.
    57 ::
    73 ::
    58 
    74 
    59     record = "A;015163;JOSEPH RAJ S;083;042;47;AA;72;244;;;"
    75     record = "A;015163;JOSEPH RAJ S;083;042;47;AA;72;244;;;"
    60     record.split(';')
    76     record.split(';')
    61 
    77 
    62 We can see that the string is split on ';' and we get each field seperately.
    78 We can see that the string is split on ';' and we get each field seperately.
    63 We can also observe that an empty string appears in the list since there are
    79 We can also observe that an empty string appears in the list since there are
    64 two semi colons without anything in between.
    80 two semi colons without anything in between.
    65 
    81 
    66 Hence split splits on whitespace if called without an argument and splits on
    82 To recap, ``split`` splits on whitespace if called without an argument and
    67 the given argument if it is called with an argument.
    83 splits on the given argument if it is called with an argument.
    68 
    84 
    69 {{{ Pause here and try out the following exercises }}}
    85 {{{ Pause here and try out the following exercises }}}
    70 
    86 
    71 %% 1 %% split the variable line using a space as argument. Is it same as
    87 %% 1 %% split the variable line using a space as argument. Is it same as
    72         splitting without an argument ?
    88         splitting without an argument ?
    74 {{{ continue from paused state }}}
    90 {{{ continue from paused state }}}
    75 
    91 
    76 We see that when we split on space, multiple whitespaces are not clubbed as one
    92 We see that when we split on space, multiple whitespaces are not clubbed as one
    77 and there is an empty string everytime there are two consecutive spaces.
    93 and there is an empty string everytime there are two consecutive spaces.
    78 
    94 
    79 Now that we know how to split a string, we can split the record and retreive each
    95 Now that we know how to split a string, we can split the record and retrieve
    80 field seperately. But there is one problem. The region code "B" and a "B"
    96 each field seperately. But there is one problem. The region code "B" and a "B"
    81 surrounded by whitespace are treated as two different regions. We must find a
    97 surrounded by whitespace are treated as two different regions. We must find a
    82 way to remove all the whitespace around a string so that "B" and a "B" with
    98 way to remove all the whitespace around a string so that "B" and a "B" with
    83 white spaces are dealt as same.
    99 white spaces are dealt as same.
    84 
   100 
    85 This is possible by using the =strip= method of strings. Let us define a
   101 This is possible by using the ``strip`` method of strings. Let us define a
    86 string by typing
   102 string by typing
    87 ::
   103 ::
    88 
   104 
    89     unstripped = "     B    "
   105     unstripped = "     B    "
    90     unstripped.strip()
   106     unstripped.strip()
   108 
   124 
   109 By now we know enough to seperate fields from the record and to strip out any
   125 By now we know enough to seperate fields from the record and to strip out any
   110 white space. The only road block we now have is conversion of string to float.
   126 white space. The only road block we now have is conversion of string to float.
   111 
   127 
   112 The splitting and stripping operations are done on a string and their result is
   128 The splitting and stripping operations are done on a string and their result is
   113 also a string, hence the marks that we have are still strings and mathematical
   129 also a string. hence the marks that we have are still strings and mathematical
   114 operations on them are not possible. We must convert them into integers or floats
   130 operations are not possible on them. We must convert them into numbers
   115 
   131 (integers or floats), before we can perform mathematical operations on them. 
   116 We shall look at converting strings into floats. We define an float string
   132 
   117 first. Type
   133 We shall look at converting strings into floats. We define a float string
       
   134 first. Type 
   118 ::
   135 ::
   119 
   136 
   120     mark_str = "1.25"
   137     mark_str = "1.25"
   121     mark = float(mark_str)
   138     mark = int(mark_str)
   122     type(mark_str)
   139     type(mark_str)
   123     type(mark)
   140     type(mark)
   124 
   141 
   125 We can see that string is converted to float. We can perform mathematical
   142 We can see that string is converted to float. We can perform mathematical
   126 operations on it now.
   143 operations on them now.
   127 
   144 
   128 {{{ Pause here and try out the following exercises }}}
   145 {{{ Pause here and try out the following exercises }}}
   129 
   146 
   130 %% 3 %% What happens if you do int("1.25")
   147 %% 3 %% What happens if you do int("1.25")
   131 
   148 
   132 {{{ continue from paused state }}}
   149 {{{ continue from paused state }}}
   133 
   150 
   134 .. #[[Amit:I think there should be some interaction first here about the
       
   135 problem before we conclude to talking about the result.]]
       
   136 It raises an error since converting a float string into integer directly is
   151 It raises an error since converting a float string into integer directly is
   137 not possible. It involves an intermediate step of converting to float.
   152 not possible. It involves an intermediate step of converting to float.
   138 ::
   153 ::
   139 
   154 
   140     dcml_str = "1.25"
   155     dcml_str = "1.25"
   141     flt = float(dcml_str)
   156     flt = float(dcml_str)
   142     flt
   157     flt
   143     number = int(flt)
   158     number = int(flt)
   144     number
   159     number
   145 
   160 
   146 Using =int= it is possible to convert float into integers.
   161 Using ``int`` it is also possible to convert float into integers.
   147 
   162 
   148 Now that we have all the machinery required to parse the file, let us solve the
   163 Now that we have all the machinery required to parse the file, let us solve the
   149 problem. We first read the file line by line and parse each record. We see if
   164 problem. We first read the file line by line and parse each record. We see if
   150 the region code is B and store the marks accordingly.
   165 the region code is B and store the marks accordingly.
   151 ::
   166 ::
   160         math_mark_str = fields[5]
   175         math_mark_str = fields[5]
   161         math_mark = float(math_mark_str)
   176         math_mark = float(math_mark_str)
   162 
   177 
   163         if region_code == "AA":
   178         if region_code == "AA":
   164             math_marks_B.append(math_mark)
   179             math_marks_B.append(math_mark)
   165 .. #[[Amit:This intutively does not seem to be what you wanted]]
   180 
   166 
   181 
   167 Now we have all the maths marks of region "B" in the list math_marks_B.
   182 Now we have all the maths marks of region "B" in the list math_marks_B.
   168 To get the mean, we just have to sum the marks and divide by the length.
   183 To get the mean, we just have to sum the marks and divide by the length.
   169 ::
   184 ::
   170 
   185 
   177 we have learnt
   192 we have learnt
   178 
   193 
   179  * how to tokenize a string using various delimiters
   194  * how to tokenize a string using various delimiters
   180  * how to get rid of extra white space around
   195  * how to get rid of extra white space around
   181  * how to convert from one type to another
   196  * how to convert from one type to another
   182 .. #[[Amit:one datatype to another may be better.]]
       
   183  * how to parse input data and perform computations on it
   197  * how to parse input data and perform computations on it
   184 
   198 
   185 {{{ Show the "sponsored by FOSSEE" slide }}}
   199 {{{ Show the "sponsored by FOSSEE" slide }}}
   186 
   200 
   187 #[Nishanth]: Will add this line after all of us fix on one.
   201 #[Nishanth]: Will add this line after all of us fix on one.
   188 This tutorial was created as a part of FOSSEE project, NME ICT, MHRD India
   202 This tutorial was created as a part of FOSSEE project, NME ICT, MHRD India
   189 
   203 
   190 Hope you have enjoyed and found it useful.
   204 Hope you have enjoyed and found it useful.
   191 Thankyou
   205 Thank you
   192  
   206  
   193 .. Author              : Nishanth
       
   194    Internal Reviewer 1 : Amit Sethi 
       
   195    Internal Reviewer 2 : 
       
   196    External Reviewer   :