--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/parsing_data.rst Wed Sep 15 18:50:17 2010 +0530
@@ -0,0 +1,182 @@
+Hello friends and welcome to the tutorial on Parsing Data
+
+{{{ Show the slide containing title }}}
+
+{{{ Show the slide containing the outline slide }}}
+
+In this tutorial, we shall learn
+
+ * What is parsing data
+ * the string operations required for parsing data
+ * datatype conversion
+
+ Lets us have a look at the problem
+
+{{{ Show the slide containing problem statement. }}}
+
+There is an input file containing huge no.of records. Each record corresponds
+to a student.
+
+{{{ show the slide explaining record structure }}}
+As you can see, each record consists of fields seperated by a ";". The first
+record is region code, then roll number, then name, marks of second language,
+first language, maths, science and social, total marks, pass/fail indicatd by P
+or F and finally W if with held and empty otherwise.
+
+Our job is to calculate the mean of all the maths marks in the region "B".
+
+#[Nishanth]: Please note that I am not telling anything about AA since they do
+ not know about any if/else yet.
+
+
+Now what is parsing data.
+
+From the input file, we can see that there is data in the form of text. Hence
+parsing data is all about reading the data and converting it into a form which
+can be used for computations. In our case, that is numbers.
+
+We can clearly see that the problem involves reading files and tokenizing.
+
+Let us learn about tokenizing strings. Let us define a string first. Type::
+
+ line = "parse this string"
+
+We are now going to split this string on whitespace.::
+
+ line.split()
+
+As you can see, we get a list of strings. Which means, when split is called
+without any arguments, it splits on whitespace. In simple words, all the spaces
+are treated as one big space.
+
+split also can split on a string of our choice. This is acheived by passing
+that as an argument. But first lets define a sample record from the file.::
+
+ record = "A;015163;JOSEPH RAJ S;083;042;47;AA;72;244;;;"
+ record.split(';')
+
+We can see that the string is split on ';' and we get each field seperately.
+We can also observe that an empty string appears in the list since there are
+two semi colons without anything in between.
+
+Hence split splits on whitespace if called without an argument and splits on
+the given argument if it is called with an argument.
+
+{{{ Pause here and try out the following exercises }}}
+
+%% 1 %% split the variable line using a space as argument. Is it same as
+ splitting without an argument ?
+
+{{{ continue from paused state }}}
+
+We see that when we split on space, multiple whitespaces are not clubbed as one
+and there is an empty string everytime there are two consecutive spaces.
+
+Now that we know splitting a string, we can split the record and retreive each
+field seperately. But there is one problem. The region code "B" and a "B"
+surrounded by whitespace are treated as two different regions. We must find a
+way to remove all the whitespace around a string so that "B" and a "B" with
+white spaces are dealt as same.
+
+This is possible by using the =strip= method of strings. Let us define a
+string by typing::
+
+ unstripped = " B "
+ unstripped.strip()
+
+We can see that strip removes all the whitespace around the sentence
+
+{{{ Pause here and try out the following exercises }}}
+
+%% 2 %% What happens to the white space inside the sentence when it is stripped
+
+{{{ continue from paused state }}}
+
+Type::
+
+ a_str = " white space "
+ a_str.strip()
+
+We see that the whitespace inside the sentence is only removed and anything
+inside remains unaffected.
+
+By now we know enough to seperate fields from the record and to strip out any
+white space. The only road block we now have is conversion of string to float.
+
+The splitting and stripping operations are done on a string and their result is
+also a string. hence the marks that we have are still strings and mathematical
+operations are not possible. We must convert them into integers or floats
+
+We shall look at converting strings into floats. We define an float string
+first. Type::
+
+ mark_str = "1.25"
+ mark = int(mark_str)
+ mark_str
+ mark
+
+We can see that string is converted to float. We can perform mathematical
+operations on them now.
+
+{{{ Pause here and try out the following exercises }}}
+
+%% 3 %% What happens if you do int("1.25")
+
+{{{ continue from paused state }}}
+
+It raises an error since converting a float string into integer directly is
+not possible. It involves an intermediate step of converting to float.::
+
+ dcml_str = "1.25"
+ flt = float(dcml_str)
+ flt
+ number = int(flt)
+ number
+
+Using =int= it is also possible to convert float into integers.
+
+Now that we have all the machinery required to parse the file, let us solve the
+problem. We first read the file line by line and parse each record. We see if
+the region code is B and store the marks accordingly.::
+
+ math_marks_B = [] # an empty list to store the marks
+ for line in open("/home/fossee/sslc1.txt"):
+ fields = line.split(";")
+
+ region_code = fields[0]
+ region_code_stripped = region_code.strip()
+
+ math_mark_str = fields[5]
+ math_mark = float(math_mark_str)
+
+ if region_code == "AA":
+ math_marks_B.append(math_mark)
+
+
+Now we have all the maths marks of region "B" in the list math_marks_B.
+To get the mean, we just have to sum the marks and divide by the length.::
+
+ math_marks_mean = sum(math_marks_B) / len(math_marks_B)
+ math_marks_mean
+
+{{{ Show summary slide }}}
+
+This brings us to the end of the tutorial.
+we have learnt
+ * how to tokenize a string using various delimiters
+ * how to get rid of extra white space around
+ * how to convert from one type to another
+ * how to parse input data and perform computations on it
+
+{{{ Show the "sponsored by FOSSEE" slide }}}
+
+#[Nishanth]: Will add this line after all of us fix on one.
+This tutorial was created as a part of FOSSEE project, NME ICT, MHRD India
+
+Hope you have enjoyed and found it useful.
+Thankyou
+
+.. Author : Nishanth
+ Internal Reviewer 1 :
+ Internal Reviewer 2 :
+ External Reviewer :