statistics/script.rst
changeset 450 d49aee7ab1b9
parent 406 a534e9e79599
--- a/statistics/script.rst	Wed Nov 10 17:25:18 2010 +0530
+++ b/statistics/script.rst	Thu Nov 11 01:37:32 2010 +0530
@@ -13,6 +13,8 @@
 .. Getting started with IPython
 .. Loading Data from files
 .. Getting started with Lists
+.. Accessing Pieces of Arrays
+
      
 .. Author              : Amit Sethi
    Internal Reviewer   : Puneeth
@@ -28,8 +30,12 @@
 {{{ Show the slide containing the outline slide }}}
 
 In this tutorial, we shall learn
- * Doing simple statistical operations in Python  
- * Applying these to real world problems 
+ * Doing statistical operations in Python  
+   * Summing set of numbers
+   * Finding there mean
+   * Finding there Median
+   * Finding there Standard Deviation 
+   
 
 
 .. #[punch: since loadtxt is anyway a pre-req, I would recommend you
@@ -45,88 +51,13 @@
 .. smaller data-set or something. Using lists doesn't seem natural.]
 
 
-We will first start with the most necessary statistical operation i.e
-finding mean.
-
-We have a list of ages of a random group of people ::
-   
-   age_list = [4,45,23,34,34,38,65,42,32,7]
-
-One way of getting the mean could be getting sum of all the ages and
-dividing by the number of people in the group. ::
-
-    sum_age_list = sum(age_list)
-
-sum function gives us the sum of the elements. Note that the
-``sum_age_list`` variable is an integer and the number of people or
-length of the list is also an integer. We will need to convert one of
-them to a float before carrying out the division. ::
-
-    mean_using_sum = float(sum_age_list)/len(age_list)
-
-This obviously gives the mean age but there is a simpler way to do
-this in Python - using the mean function::
-
-       mean(age_list)
-
-Mean can be used in more ways in case of 2 dimensional lists.  Take a
-two dimensional list ::
-     
-     two_dimension=[[1,5,6,8],[1,3,4,5]]
-
-The mean function by default gives the mean of the flattened sequence.
-A Flattened sequence means a list obtained by concatenating all the
-smaller lists into a large long list. In this case, the list obtained
-by writing the two lists one after the other. ::
-
-    mean(two_dimension)
-    flattened_seq=[1,5,6,8,1,3,4,5]
-    mean(flattened_seq)
-
-As you can see both the results are same. ``mean`` function can also
-give us the mean of each column, or the mean of corresponding elements
-in the smaller lists. ::
-   
-   mean(two_dimension, 0)
-   array([ 1. ,  4. ,  5. ,  6.5])
-
-we pass an extra argument 0 in that case.
-
-If we use an argument 1, we obtain the mean along the rows. ::
-   
-   mean(two_dimension, 1)
-   array([ 5.  ,  3.25])
-
-We can see more option of mean using ::
-   
-   mean?
-
-Similarly we can calculate median and stanard deviation of a list
-using the functions median and std::
-      
-      median(age_list)
-      std(age_list)
-
-Median and std can also be calculated for two dimensional arrays along
-columns and rows just like mean.
-
-For example ::
-       
-       median(two_dimension, 0)
-       std(two_dimension, 1)
-
-This gives us the median along the colums and standard devition along
-the rows.
-       
-Now lets apply this to a real world example 
-    
-We will a data file that is at the a path ``/home/fossee/sslc2.txt``.
-It contains record of students and their performance in one of the
-State Secondary Board Examination. It has 180, 000 lines of record. We
-are going to read it and process this data.  We can see the content of
-file by double clicking on it. It might take some time to open since
-it is quite a large file.  Please don't edit the data.  This file has
-a particular structure.
+For this tutorial We will use data file that is at the a path
+``/home/fossee/sslc2.txt``.  It contains record of students and their
+performance in one of the State Secondary Board Examination. It has
+180,000 lines of record. We are going to read it and process this
+data.  We can see the content of file by double clicking on it. It
+might take some time to open since it is quite a large file.  Please
+don't edit the data.  This file has a particular structure.
 
 We can do ::
    
@@ -134,6 +65,9 @@
 
 to check the contents of the file.
 
+
+{{{ Show the data structure on a slide }}}
+
 Each line in the file is a set of 11 fields separated 
 by semi-colons Consider a sample line from this file.  
 A;015163;JOSEPH RAJ S;083;042;47;00;72;244;;; 
@@ -147,45 +81,97 @@
 * Total marks 244
 
 
-Now lets try and find the mean of English marks of all students.
-
-For this we do. ::
+Lets try and load this data as an array and then run various function on
+it.
 
-     L=loadtxt('/home/fossee/sslc2.txt',usecols=(3,),delimiter=';')
+To get the data as an array we do. ::
+   
+     L=loadtxt('/home/amit/sslc2.txt',usecols=(3,4,5,6,7,),delimiter=';')
      L
-     mean(L)
+     
 
 loadtxt function loads data from an external file.Delimiter specifies
-the kind of character are the fields of data seperated by. 
-usecols specifies  the columns to be used so (3,). The 'comma' is added
-because usecols is a sequence.
+the kind of character are the fields of data seperated by.  usecols
+specifies the columns to be used so (3,4,5,6,7) loads those
+colums. The 'comma' is added because usecols is a sequence.
 
-To get the median marks. ::
+As we can see L is an array. We can get the shape of this array using::
    
-    median(L)
+   L.shape
+   (185667, 5)
+
+Lets start applying statistics operations on these. We will start with
+the most basic, summing. How do you find the sum of marks of all
+subjects for the first student.
+
+As we know from our knowledge of accessing pieces of arrays. To acess
+the first row we will do ::
    
-Standard deviation. ::
-	
-    std(L)
+   L[0,:]
+
+Now to sum this we can say ::
 
+    totalmarks=sum(L[0,:]) 
+    totalmarks
 
-Now lets try and and get the mean for all the subjects ::
+To get the mean we can do ::
+
+   totalmarks/len(L[0,:])
+
+or simply ::
+
+   mean(L[0,:])
 
-     L=loadtxt('/home/fossee/sslc2.txt',usecols=(3,4,5,6,7),delimiter=';')
-     mean(L,0)
-     array([ 73.55452504,  53.79828941,  62.83342759,  50.69806158,  63.17056881])
+But we have such a large data set calculating one by one the mean of
+each student is impossible. Is there a way to reduce the work.
+
+For this we will look into the documentation of mean by doing::
 
-As we can see from the result mean(L,0). The resultant sequence  
-is the mean marks of all students that gave the exam for the five subjects.
+    mean?
 
-and ::
-    
+As we know L is a two dimensional array. We can calculate the mean
+across each of the axis of the array. The axis of rows is referred by
+number 0 and columns by 1. So to calculate mean accross all colums we
+will pass extra parameter 1 for the axis.::
+
     mean(L,1)
 
-    
-is the average accumalative marks of individual students. Clearly, mean(L,0)
-was a row wise calcultaion while mean(L,1) was a column wise calculation.
+L here is the two dimensional array.
+
+Similarly to calculate average marks scored by all the students for each
+subject can be calculated using ::
+
+   mean(L,0)
+
+Next lets now calculate the median of English marks for the all the students
+We can access English marks of all students using ::
+
+   L[:,0]
+   
+To get the median we will do ::
+
+   median(L[:,0])
 
+For all the subjects we can use the same syntax as mean and calculate
+median across all rows using ::
+
+       median(L,0)
+  
+
+Similarly to calculate standard deviation for English we can do::
+
+	  std(L[:,0])
+
+and for all rows::
+
+    std(L,0)
+
+Following is an exercise that you must do. 
+
+%% %% In the given file football.txt at path /home/fossee/football.txt , one column is player name,second is goals at home and third goals away.
+   1.Find the total goals for each player
+   2.Mean home and away goals
+   3.Standard deviation of home and away goals 
 
 {{{ Show summary slide }}}