parsing_data.rst
changeset 137 fc545d07b0ff
parent 134 543c1cc488ca
child 140 bc023595e167
equal deleted inserted replaced
136:7f8b6a9fb61d 137:fc545d07b0ff
    35 parsing data is all about reading the data and converting it into a form which
    35 parsing data is all about reading the data and converting it into a form which
    36 can be used for computations. In our case, that is numbers.
    36 can be used for computations. In our case, that is numbers.
    37 
    37 
    38 We can clearly see that the problem involves reading files and tokenizing.
    38 We can clearly see that the problem involves reading files and tokenizing.
    39 
    39 
    40 Let us learn about tokenizing strings. Let us define a string first. Type::
    40 Let us learn about tokenizing strings. Let us define a string first. Type
       
    41 ::
    41 
    42 
    42     line = "parse this           string"
    43     line = "parse this           string"
    43 
    44 
    44 We are now going to split this string on whitespace.::
    45 We are now going to split this string on whitespace.
       
    46 ::
    45 
    47 
    46     line.split()
    48     line.split()
    47 
    49 
    48 As you can see, we get a list of strings. Which means, when split is called
    50 As you can see, we get a list of strings. Which means, when split is called
    49 without any arguments, it splits on whitespace. In simple words, all the spaces
    51 without any arguments, it splits on whitespace. In simple words, all the spaces
    50 are treated as one big space.
    52 are treated as one big space.
    51 
    53 
    52 split also can split on a string of our choice. This is acheived by passing
    54 split also can split on a string of our choice. This is acheived by passing
    53 that as an argument. But first lets define a sample record from the file.::
    55 that as an argument. But first lets define a sample record from the file.
       
    56 ::
    54 
    57 
    55     record = "A;015163;JOSEPH RAJ S;083;042;47;AA;72;244;;;"
    58     record = "A;015163;JOSEPH RAJ S;083;042;47;AA;72;244;;;"
    56     record.split(';')
    59     record.split(';')
    57 
    60 
    58 We can see that the string is split on ';' and we get each field seperately.
    61 We can see that the string is split on ';' and we get each field seperately.
    77 surrounded by whitespace are treated as two different regions. We must find a
    80 surrounded by whitespace are treated as two different regions. We must find a
    78 way to remove all the whitespace around a string so that "B" and a "B" with
    81 way to remove all the whitespace around a string so that "B" and a "B" with
    79 white spaces are dealt as same.
    82 white spaces are dealt as same.
    80 
    83 
    81 This is possible by using the =strip= method of strings. Let us define a
    84 This is possible by using the =strip= method of strings. Let us define a
    82 string by typing::
    85 string by typing
       
    86 ::
    83 
    87 
    84     unstripped = "     B    "
    88     unstripped = "     B    "
    85     unstripped.strip()
    89     unstripped.strip()
    86 
    90 
    87 We can see that strip removes all the whitespace around the sentence
    91 We can see that strip removes all the whitespace around the sentence
    90 
    94 
    91 %% 2 %% What happens to the white space inside the sentence when it is stripped
    95 %% 2 %% What happens to the white space inside the sentence when it is stripped
    92 
    96 
    93 {{{ continue from paused state }}}
    97 {{{ continue from paused state }}}
    94 
    98 
    95 Type::
    99 Type
       
   100 ::
    96 
   101 
    97     a_str = "         white      space            "
   102     a_str = "         white      space            "
    98     a_str.strip()
   103     a_str.strip()
    99 
   104 
   100 We see that the whitespace inside the sentence is only removed and anything
   105 We see that the whitespace inside the sentence is only removed and anything
   106 The splitting and stripping operations are done on a string and their result is
   111 The splitting and stripping operations are done on a string and their result is
   107 also a string. hence the marks that we have are still strings and mathematical
   112 also a string. hence the marks that we have are still strings and mathematical
   108 operations are not possible. We must convert them into integers or floats
   113 operations are not possible. We must convert them into integers or floats
   109 
   114 
   110 We shall look at converting strings into floats. We define an float string
   115 We shall look at converting strings into floats. We define an float string
   111 first. Type::
   116 first. Type
       
   117 ::
   112 
   118 
   113     mark_str = "1.25"
   119     mark_str = "1.25"
   114     mark = int(mark_str)
   120     mark = int(mark_str)
   115     mark_str
   121     mark_str
   116     mark
   122     mark
   123 %% 3 %% What happens if you do int("1.25")
   129 %% 3 %% What happens if you do int("1.25")
   124 
   130 
   125 {{{ continue from paused state }}}
   131 {{{ continue from paused state }}}
   126 
   132 
   127 It raises an error since converting a float string into integer directly is
   133 It raises an error since converting a float string into integer directly is
   128 not possible. It involves an intermediate step of converting to float.::
   134 not possible. It involves an intermediate step of converting to float.
       
   135 ::
   129 
   136 
   130     dcml_str = "1.25"
   137     dcml_str = "1.25"
   131     flt = float(dcml_str)
   138     flt = float(dcml_str)
   132     flt
   139     flt
   133     number = int(flt)
   140     number = int(flt)
   135 
   142 
   136 Using =int= it is also possible to convert float into integers.
   143 Using =int= it is also possible to convert float into integers.
   137 
   144 
   138 Now that we have all the machinery required to parse the file, let us solve the
   145 Now that we have all the machinery required to parse the file, let us solve the
   139 problem. We first read the file line by line and parse each record. We see if
   146 problem. We first read the file line by line and parse each record. We see if
   140 the region code is B and store the marks accordingly.::
   147 the region code is B and store the marks accordingly.
       
   148 ::
   141 
   149 
   142     math_marks_B = [] # an empty list to store the marks
   150     math_marks_B = [] # an empty list to store the marks
   143     for line in open("/home/fossee/sslc1.txt"):
   151     for line in open("/home/fossee/sslc1.txt"):
   144         fields = line.split(";")
   152         fields = line.split(";")
   145 
   153 
   152         if region_code == "AA":
   160         if region_code == "AA":
   153             math_marks_B.append(math_mark)
   161             math_marks_B.append(math_mark)
   154 
   162 
   155 
   163 
   156 Now we have all the maths marks of region "B" in the list math_marks_B.
   164 Now we have all the maths marks of region "B" in the list math_marks_B.
   157 To get the mean, we just have to sum the marks and divide by the length.::
   165 To get the mean, we just have to sum the marks and divide by the length.
       
   166 ::
   158 
   167 
   159         math_marks_mean = sum(math_marks_B) / len(math_marks_B)
   168         math_marks_mean = sum(math_marks_B) / len(math_marks_B)
   160         math_marks_mean
   169         math_marks_mean
   161 
   170 
   162 {{{ Show summary slide }}}
   171 {{{ Show summary slide }}}