# HG changeset patch # User nishanth # Date 1284556817 -19800 # Node ID bc93dd9d22c58da1509f2e6eee207598fc03cdf2 # Parent b8f7ee434b9171f9b640e471edd1f6fbb187ed31 initial commit parsing_data diff -r b8f7ee434b91 -r bc93dd9d22c5 parsing_data.rst --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/parsing_data.rst Wed Sep 15 18:50:17 2010 +0530 @@ -0,0 +1,182 @@ +Hello friends and welcome to the tutorial on Parsing Data + +{{{ Show the slide containing title }}} + +{{{ Show the slide containing the outline slide }}} + +In this tutorial, we shall learn + + * What is parsing data + * the string operations required for parsing data + * datatype conversion + + Lets us have a look at the problem + +{{{ Show the slide containing problem statement. }}} + +There is an input file containing huge no.of records. Each record corresponds +to a student. + +{{{ show the slide explaining record structure }}} +As you can see, each record consists of fields seperated by a ";". The first +record is region code, then roll number, then name, marks of second language, +first language, maths, science and social, total marks, pass/fail indicatd by P +or F and finally W if with held and empty otherwise. + +Our job is to calculate the mean of all the maths marks in the region "B". + +#[Nishanth]: Please note that I am not telling anything about AA since they do + not know about any if/else yet. + + +Now what is parsing data. + +From the input file, we can see that there is data in the form of text. Hence +parsing data is all about reading the data and converting it into a form which +can be used for computations. In our case, that is numbers. + +We can clearly see that the problem involves reading files and tokenizing. + +Let us learn about tokenizing strings. Let us define a string first. Type:: + + line = "parse this string" + +We are now going to split this string on whitespace.:: + + line.split() + +As you can see, we get a list of strings. Which means, when split is called +without any arguments, it splits on whitespace. In simple words, all the spaces +are treated as one big space. + +split also can split on a string of our choice. This is acheived by passing +that as an argument. But first lets define a sample record from the file.:: + + record = "A;015163;JOSEPH RAJ S;083;042;47;AA;72;244;;;" + record.split(';') + +We can see that the string is split on ';' and we get each field seperately. +We can also observe that an empty string appears in the list since there are +two semi colons without anything in between. + +Hence split splits on whitespace if called without an argument and splits on +the given argument if it is called with an argument. + +{{{ Pause here and try out the following exercises }}} + +%% 1 %% split the variable line using a space as argument. Is it same as + splitting without an argument ? + +{{{ continue from paused state }}} + +We see that when we split on space, multiple whitespaces are not clubbed as one +and there is an empty string everytime there are two consecutive spaces. + +Now that we know splitting a string, we can split the record and retreive each +field seperately. But there is one problem. The region code "B" and a "B" +surrounded by whitespace are treated as two different regions. We must find a +way to remove all the whitespace around a string so that "B" and a "B" with +white spaces are dealt as same. + +This is possible by using the =strip= method of strings. Let us define a +string by typing:: + + unstripped = " B " + unstripped.strip() + +We can see that strip removes all the whitespace around the sentence + +{{{ Pause here and try out the following exercises }}} + +%% 2 %% What happens to the white space inside the sentence when it is stripped + +{{{ continue from paused state }}} + +Type:: + + a_str = " white space " + a_str.strip() + +We see that the whitespace inside the sentence is only removed and anything +inside remains unaffected. + +By now we know enough to seperate fields from the record and to strip out any +white space. The only road block we now have is conversion of string to float. + +The splitting and stripping operations are done on a string and their result is +also a string. hence the marks that we have are still strings and mathematical +operations are not possible. We must convert them into integers or floats + +We shall look at converting strings into floats. We define an float string +first. Type:: + + mark_str = "1.25" + mark = int(mark_str) + mark_str + mark + +We can see that string is converted to float. We can perform mathematical +operations on them now. + +{{{ Pause here and try out the following exercises }}} + +%% 3 %% What happens if you do int("1.25") + +{{{ continue from paused state }}} + +It raises an error since converting a float string into integer directly is +not possible. It involves an intermediate step of converting to float.:: + + dcml_str = "1.25" + flt = float(dcml_str) + flt + number = int(flt) + number + +Using =int= it is also possible to convert float into integers. + +Now that we have all the machinery required to parse the file, let us solve the +problem. We first read the file line by line and parse each record. We see if +the region code is B and store the marks accordingly.:: + + math_marks_B = [] # an empty list to store the marks + for line in open("/home/fossee/sslc1.txt"): + fields = line.split(";") + + region_code = fields[0] + region_code_stripped = region_code.strip() + + math_mark_str = fields[5] + math_mark = float(math_mark_str) + + if region_code == "AA": + math_marks_B.append(math_mark) + + +Now we have all the maths marks of region "B" in the list math_marks_B. +To get the mean, we just have to sum the marks and divide by the length.:: + + math_marks_mean = sum(math_marks_B) / len(math_marks_B) + math_marks_mean + +{{{ Show summary slide }}} + +This brings us to the end of the tutorial. +we have learnt + * how to tokenize a string using various delimiters + * how to get rid of extra white space around + * how to convert from one type to another + * how to parse input data and perform computations on it + +{{{ Show the "sponsored by FOSSEE" slide }}} + +#[Nishanth]: Will add this line after all of us fix on one. +This tutorial was created as a part of FOSSEE project, NME ICT, MHRD India + +Hope you have enjoyed and found it useful. +Thankyou + +.. Author : Nishanth + Internal Reviewer 1 : + Internal Reviewer 2 : + External Reviewer :