Branches merged.
--- a/statistics.txt Wed Apr 14 12:20:26 2010 +0530
+++ b/statistics.txt Wed Apr 14 12:35:48 2010 +0530
@@ -1,16 +1,17 @@
Hello and welcome to the tutorial on handling large data files and processing them.
-Till now we have covered:
+Up until now we have covered:
* How to create plots.
* How to read data from files and process it.
-In this session, we will use these concepts and some new ones, to solve a problem/exercise.
+In this tutorial, we shall use these concepts and some new ones, to solve a problem/exercise.
-We have a file named sslc.txt.
+We have a file named sslc.txt on our desktop.
It contains record of students and their performance in one of the State Secondary Board Examination. It has 180, 000 lines of record. We are going to read it and process this data.
-We can see the content of file by opening with any text editor.
+We can see the content of file by double clicking on it. It might take some time to open since it is quite a large file.
Please don't edit the data.
This file has a particular structure. Each line in the file is a set of 11 fields separated by semi-colons
+Consider a sample line from this file.
A;015163;JOSEPH RAJ S;083;042;47;AA;72;244;;;
The following are the fields in any given line.
* Region Code which is 'A'
@@ -38,12 +39,20 @@
Dictionaries - we shall be introducing the concept of dictionaries here.
And finally plotting - which we have been doing all along.
+Since this file is on our Desktop, let's navigate by typing
+
+cd Desktop
+
+Let's get started, by opening the IPython prompt by typing,
+
+ipython -pylab
+
Let's first start off with dictionaries.
We earlier used lists briefly. Back then we just created lists and appended items into them.
x = [1, 4, 2, 7, 6]
In order to access any element in a list, we use its index number. Index starts from 0.
-For eg. x[0] will give 1 and x[3] will give 7.
+For eg. x[0] gives 1 and x[3] gives 7.
But, using integer indexes isn't always convenient. For example, consider a telephone directory. We give it a name and it should return a corresponding number. A list is not well suited for such problems. Python's dictionaries are better, for such problems. Dictionaries are just key-value pairs. For example:
@@ -51,6 +60,8 @@
'txt' : 'text',
'py' : 'python'}
+And that is how we create a dictionary. Dictionaries are created by typing the key-value pairs within flower brackets.
+
d
d is a dictionary. The first element in the pair is called the `key' and the second is called the `value'. The key always has to be a string while the value can be of any type.
@@ -69,6 +80,8 @@
'jpg' in d
False
+
+
Please note that keys, and not values, are searched.
'In a telephone directory one can search for a number based on a name, but not for a name based on a number'
@@ -81,16 +94,17 @@
['python', 'text', 'image']
is used to obtain the list of all values in a dictionary
-Let's now see what the dictionary contains
-d
+one more thing to note about dictionaries, in this case for d,
-Please observe that dictionaries do not preserve the order in which the items were entered. The order of the elements in a dictionary should not be relied upon.
+d
+
+is that dictionaries do not preserve the order in which the items were entered. The order of the elements in a dictionary should not be relied upon.
------------------------------------------------------------------------------------------------------------------
Parsing and string processing
-As we saw previously we will be dealing with lines with content of the form
+As we saw previously we shall be dealing with lines with content of the form
A;015162;JENIL T P;081;060;77;41;74;333;P;;
Here ';' is delimiter, that is ';' is used to separate the fields.
@@ -108,7 +122,7 @@
is list containing all the fields separately.
a[0] is the region code, a[1] the roll no., a[2] the name and so on.
-Similarly, a[6] will give us the science marks of that particular region.
+Similarly, a[6] gives us the science marks of that particular region.
So we create a dictionary of all the regions with number of students having more than 90 marks.
@@ -147,7 +161,7 @@
Hit return twice to exit the for loop
-by end of this loop we will have our desired output in the dictionary 'science'
+by end of this loop we shall have our desired output in the dictionary 'science'
we can check the values by
science