parsing_data.rst
changeset 179 1d04b6c5ff44
parent 140 bc023595e167
child 197 97d859b70f51
equal deleted inserted replaced
178:4c7b906e0d21 179:1d04b6c5ff44
    27 
    27 
    28 #[Nishanth]: Please note that I am not telling anything about AA since they do
    28 #[Nishanth]: Please note that I am not telling anything about AA since they do
    29              not know about any if/else yet.
    29              not know about any if/else yet.
    30 
    30 
    31 
    31 
    32 Now what is parsing data.
    32 So what exactly is parsing data?
    33 
    33 
    34 From the input file, we can see that there is data in the form of text. Hence
    34 
    35 parsing data is all about reading the data and converting it into a form which
    35 Parsing data is all about reading the data and converting it into a form which
    36 can be used for computations. In our case, that is numbers.
    36 can be used for computations. In our case, that is numbers.
    37 
    37 
    38 We can clearly see that the problem involves reading files and tokenizing.
    38 We can clearly see that the problem involves reading files and tokenizing.
    39 
    39 
       
    40 .. #[[Amit:Definition of Tokenizing here.]]
    40 Let us learn about tokenizing strings. Let us define a string first. Type
    41 Let us learn about tokenizing strings. Let us define a string first. Type
    41 ::
    42 ::
    42 
    43 
    43     line = "parse this           string"
    44     line = "parse this           string"
    44 
    45 
    73 {{{ continue from paused state }}}
    74 {{{ continue from paused state }}}
    74 
    75 
    75 We see that when we split on space, multiple whitespaces are not clubbed as one
    76 We see that when we split on space, multiple whitespaces are not clubbed as one
    76 and there is an empty string everytime there are two consecutive spaces.
    77 and there is an empty string everytime there are two consecutive spaces.
    77 
    78 
    78 Now that we know splitting a string, we can split the record and retreive each
    79 Now that we know how to split a string, we can split the record and retreive each
    79 field seperately. But there is one problem. The region code "B" and a "B"
    80 field seperately. But there is one problem. The region code "B" and a "B"
    80 surrounded by whitespace are treated as two different regions. We must find a
    81 surrounded by whitespace are treated as two different regions. We must find a
    81 way to remove all the whitespace around a string so that "B" and a "B" with
    82 way to remove all the whitespace around a string so that "B" and a "B" with
    82 white spaces are dealt as same.
    83 white spaces are dealt as same.
    83 
    84 
   107 
   108 
   108 By now we know enough to seperate fields from the record and to strip out any
   109 By now we know enough to seperate fields from the record and to strip out any
   109 white space. The only road block we now have is conversion of string to float.
   110 white space. The only road block we now have is conversion of string to float.
   110 
   111 
   111 The splitting and stripping operations are done on a string and their result is
   112 The splitting and stripping operations are done on a string and their result is
   112 also a string. hence the marks that we have are still strings and mathematical
   113 also a string, hence the marks that we have are still strings and mathematical
   113 operations are not possible. We must convert them into integers or floats
   114 operations on them are not possible. We must convert them into integers or floats
   114 
   115 
   115 We shall look at converting strings into floats. We define an float string
   116 We shall look at converting strings into floats. We define an float string
   116 first. Type
   117 first. Type
   117 ::
   118 ::
   118 
   119 
   119     mark_str = "1.25"
   120     mark_str = "1.25"
   120     mark = int(mark_str)
   121     mark = float(mark_str)
   121     type(mark_str)
   122     type(mark_str)
   122     type(mark)
   123     type(mark)
   123 
   124 
   124 We can see that string is converted to float. We can perform mathematical
   125 We can see that string is converted to float. We can perform mathematical
   125 operations on them now.
   126 operations on it now.
   126 
   127 
   127 {{{ Pause here and try out the following exercises }}}
   128 {{{ Pause here and try out the following exercises }}}
   128 
   129 
   129 %% 3 %% What happens if you do int("1.25")
   130 %% 3 %% What happens if you do int("1.25")
   130 
   131 
   131 {{{ continue from paused state }}}
   132 {{{ continue from paused state }}}
   132 
   133 
       
   134 .. #[[Amit:I think there should be some interaction first here about the
       
   135 problem before we conclude to talking about the result.]]
   133 It raises an error since converting a float string into integer directly is
   136 It raises an error since converting a float string into integer directly is
   134 not possible. It involves an intermediate step of converting to float.
   137 not possible. It involves an intermediate step of converting to float.
   135 ::
   138 ::
   136 
   139 
   137     dcml_str = "1.25"
   140     dcml_str = "1.25"
   138     flt = float(dcml_str)
   141     flt = float(dcml_str)
   139     flt
   142     flt
   140     number = int(flt)
   143     number = int(flt)
   141     number
   144     number
   142 
   145 
   143 Using =int= it is also possible to convert float into integers.
   146 Using =int= it is possible to convert float into integers.
   144 
   147 
   145 Now that we have all the machinery required to parse the file, let us solve the
   148 Now that we have all the machinery required to parse the file, let us solve the
   146 problem. We first read the file line by line and parse each record. We see if
   149 problem. We first read the file line by line and parse each record. We see if
   147 the region code is B and store the marks accordingly.
   150 the region code is B and store the marks accordingly.
   148 ::
   151 ::
   157         math_mark_str = fields[5]
   160         math_mark_str = fields[5]
   158         math_mark = float(math_mark_str)
   161         math_mark = float(math_mark_str)
   159 
   162 
   160         if region_code == "AA":
   163         if region_code == "AA":
   161             math_marks_B.append(math_mark)
   164             math_marks_B.append(math_mark)
   162 
   165 .. #[[Amit:This intutively does not seem to be what you wanted]]
   163 
   166 
   164 Now we have all the maths marks of region "B" in the list math_marks_B.
   167 Now we have all the maths marks of region "B" in the list math_marks_B.
   165 To get the mean, we just have to sum the marks and divide by the length.
   168 To get the mean, we just have to sum the marks and divide by the length.
   166 ::
   169 ::
   167 
   170 
   174 we have learnt
   177 we have learnt
   175 
   178 
   176  * how to tokenize a string using various delimiters
   179  * how to tokenize a string using various delimiters
   177  * how to get rid of extra white space around
   180  * how to get rid of extra white space around
   178  * how to convert from one type to another
   181  * how to convert from one type to another
       
   182 .. #[[Amit:one datatype to another may be better.]]
   179  * how to parse input data and perform computations on it
   183  * how to parse input data and perform computations on it
   180 
   184 
   181 {{{ Show the "sponsored by FOSSEE" slide }}}
   185 {{{ Show the "sponsored by FOSSEE" slide }}}
   182 
   186 
   183 #[Nishanth]: Will add this line after all of us fix on one.
   187 #[Nishanth]: Will add this line after all of us fix on one.
   185 
   189 
   186 Hope you have enjoyed and found it useful.
   190 Hope you have enjoyed and found it useful.
   187 Thankyou
   191 Thankyou
   188  
   192  
   189 .. Author              : Nishanth
   193 .. Author              : Nishanth
   190    Internal Reviewer 1 : 
   194    Internal Reviewer 1 : Amit Sethi 
   191    Internal Reviewer 2 : 
   195    Internal Reviewer 2 : 
   192    External Reviewer   :
   196    External Reviewer   :