statistics/script.rst
changeset 457 68813d8d80fb
parent 450 d49aee7ab1b9
equal deleted inserted replaced
456:be96dc6c9743 457:68813d8d80fb
    11 .. -------------
    11 .. -------------
    12 
    12 
    13 .. Getting started with IPython
    13 .. Getting started with IPython
    14 .. Loading Data from files
    14 .. Loading Data from files
    15 .. Getting started with Lists
    15 .. Getting started with Lists
       
    16 .. Accessing Pieces of Arrays
       
    17 
    16      
    18      
    17 .. Author              : Puneeth 
    19 .. Author              : Amit Sethi
    18    Internal Reviewer   : Anoop Jacob Thomas<anoop@fossee.in>
    20    Internal Reviewer   : Puneeth
    19    External Reviewer   :
    21    External Reviewer   :
    20    Checklist OK?       : <put date stamp here, if OK> [2010-10-05]
    22    Checklist OK?       : <put date stamp here, if OK> [2010-10-05]
    21 
    23 
    22 .. #[punch; add slides, exercises!]
    24 .. #[punch; add slides, exercises!]
    23 
    25 
    26 {{{ Show the slide containing title }}}
    28 {{{ Show the slide containing title }}}
    27 
    29 
    28 {{{ Show the slide containing the outline slide }}}
    30 {{{ Show the slide containing the outline slide }}}
    29 
    31 
    30 In this tutorial, we shall learn
    32 In this tutorial, we shall learn
    31  * Doing simple statistical operations in Python  
    33  * Doing statistical operations in Python  
    32  * Applying these to real world problems 
    34    * Summing set of numbers
       
    35    * Finding there mean
       
    36    * Finding there Median
       
    37    * Finding there Standard Deviation 
       
    38    
    33 
    39 
    34 .. #[punch: the prerequisites part may be skipped in the tutorial. It
       
    35 .. will be provided separately.]
       
    36 
       
    37 You will need Ipython with pylab running on your computer to use this
       
    38 tutorial.
       
    39 
       
    40 Also you will need to know about loading data using loadtxt to be able
       
    41 to follow the real world application.
       
    42 
    40 
    43 .. #[punch: since loadtxt is anyway a pre-req, I would recommend you
    41 .. #[punch: since loadtxt is anyway a pre-req, I would recommend you
    44 .. to use a data file and load data from that. that is good, since you
    42 .. to use a data file and load data from that. that is good, since you
    45 .. would get to deal with arrays, instead of lists. 
    43 .. would get to deal with arrays, instead of lists. 
    46 
    44 
    51 .. The idea of separating the main problem and giving toy examples
    49 .. The idea of separating the main problem and giving toy examples
    52 .. doesn't sound good. Use the same problem to explain stuff. Or use a
    50 .. doesn't sound good. Use the same problem to explain stuff. Or use a
    53 .. smaller data-set or something. Using lists doesn't seem natural.]
    51 .. smaller data-set or something. Using lists doesn't seem natural.]
    54 
    52 
    55 
    53 
    56 We will first start with the most necessary statistical operation i.e
    54 For this tutorial We will use data file that is at the a path
    57 finding mean.
    55 ``/home/fossee/sslc2.txt``.  It contains record of students and their
    58 
    56 performance in one of the State Secondary Board Examination. It has
    59 We have a list of ages of a random group of people ::
    57 180,000 lines of record. We are going to read it and process this
    60    
    58 data.  We can see the content of file by double clicking on it. It
    61    age_list = [4,45,23,34,34,38,65,42,32,7]
    59 might take some time to open since it is quite a large file.  Please
    62 
    60 don't edit the data.  This file has a particular structure.
    63 One way of getting the mean could be getting sum of all the ages and
       
    64 dividing by the number of people in the group. ::
       
    65 
       
    66     sum_age_list = sum(age_list)
       
    67 
       
    68 sum function gives us the sum of the elements. Note that the
       
    69 ``sum_age_list`` variable is an integer and the number of people or
       
    70 length of the list is also an integer. We will need to convert one of
       
    71 them to a float before carrying out the division. ::
       
    72 
       
    73     mean_using_sum = float(sum_age_list)/len(age_list)
       
    74 
       
    75 This obviously gives the mean age but there is a simpler way to do
       
    76 this in Python - using the mean function::
       
    77 
       
    78        mean(age_list)
       
    79 
       
    80 Mean can be used in more ways in case of 2 dimensional lists.  Take a
       
    81 two dimensional list ::
       
    82      
       
    83      two_dimension=[[1,5,6,8],[1,3,4,5]]
       
    84 
       
    85 The mean function by default gives the mean of the flattened sequence.
       
    86 A Flattened sequence means a list obtained by concatenating all the
       
    87 smaller lists into a large long list. In this case, the list obtained
       
    88 by writing the two lists one after the other. ::
       
    89 
       
    90     mean(two_dimension)
       
    91     flattened_seq=[1,5,6,8,1,3,4,5]
       
    92     mean(flattened_seq)
       
    93 
       
    94 As you can see both the results are same. ``mean`` function can also
       
    95 give us the mean of each column, or the mean of corresponding elements
       
    96 in the smaller lists. ::
       
    97    
       
    98    mean(two_dimension, 0)
       
    99    array([ 1. ,  4. ,  5. ,  6.5])
       
   100 
       
   101 we pass an extra argument 0 in that case.
       
   102 
       
   103 If we use an argument 1, we obtain the mean along the rows. ::
       
   104    
       
   105    mean(two_dimension, 1)
       
   106    array([ 5.  ,  3.25])
       
   107 
       
   108 We can see more option of mean using ::
       
   109    
       
   110    mean?
       
   111 
       
   112 Similarly we can calculate median and stanard deviation of a list
       
   113 using the functions median and std::
       
   114       
       
   115       median(age_list)
       
   116       std(age_list)
       
   117 
       
   118 Median and std can also be calculated for two dimensional arrays along
       
   119 columns and rows just like mean.
       
   120 
       
   121 For example ::
       
   122        
       
   123        median(two_dimension, 0)
       
   124        std(two_dimension, 1)
       
   125 
       
   126 This gives us the median along the colums and standard devition along
       
   127 the rows.
       
   128        
       
   129 Now lets apply this to a real world example 
       
   130     
       
   131 We will a data file that is at the a path ``/home/fossee/sslc2.txt``.
       
   132 It contains record of students and their performance in one of the
       
   133 State Secondary Board Examination. It has 180, 000 lines of record. We
       
   134 are going to read it and process this data.  We can see the content of
       
   135 file by double clicking on it. It might take some time to open since
       
   136 it is quite a large file.  Please don't edit the data.  This file has
       
   137 a particular structure.
       
   138 
    61 
   139 We can do ::
    62 We can do ::
   140    
    63    
   141    cat /home/fossee/sslc2.txt
    64    cat /home/fossee/sslc2.txt
   142 
    65 
   143 to check the contents of the file.
    66 to check the contents of the file.
       
    67 
       
    68 
       
    69 {{{ Show the data structure on a slide }}}
   144 
    70 
   145 Each line in the file is a set of 11 fields separated 
    71 Each line in the file is a set of 11 fields separated 
   146 by semi-colons Consider a sample line from this file.  
    72 by semi-colons Consider a sample line from this file.  
   147 A;015163;JOSEPH RAJ S;083;042;47;00;72;244;;; 
    73 A;015163;JOSEPH RAJ S;083;042;47;00;72;244;;; 
   148 
    74 
   153 * Marks of 5 subjects: ** English 083 ** Hindi 042 ** Maths 47 **
    79 * Marks of 5 subjects: ** English 083 ** Hindi 042 ** Maths 47 **
   154 Science 35 ** Social 72
    80 Science 35 ** Social 72
   155 * Total marks 244
    81 * Total marks 244
   156 
    82 
   157 
    83 
   158 Now lets try and find the mean of English marks of all students.
    84 Lets try and load this data as an array and then run various function on
       
    85 it.
   159 
    86 
   160 For this we do. ::
    87 To get the data as an array we do. ::
   161 
    88    
   162      L=loadtxt('/home/fossee/sslc2.txt',usecols=(3,),delimiter=';')
    89      L=loadtxt('/home/amit/sslc2.txt',usecols=(3,4,5,6,7,),delimiter=';')
   163      L
    90      L
   164      mean(L)
    91      
   165 
    92 
   166 loadtxt function loads data from an external file.Delimiter specifies
    93 loadtxt function loads data from an external file.Delimiter specifies
   167 the kind of character are the fields of data seperated by. 
    94 the kind of character are the fields of data seperated by.  usecols
   168 usecols specifies  the columns to be used so (3,). The 'comma' is added
    95 specifies the columns to be used so (3,4,5,6,7) loads those
   169 because usecols is a sequence.
    96 colums. The 'comma' is added because usecols is a sequence.
   170 
    97 
   171 To get the median marks. ::
    98 As we can see L is an array. We can get the shape of this array using::
   172    
    99    
   173     median(L)
   100    L.shape
       
   101    (185667, 5)
       
   102 
       
   103 Lets start applying statistics operations on these. We will start with
       
   104 the most basic, summing. How do you find the sum of marks of all
       
   105 subjects for the first student.
       
   106 
       
   107 As we know from our knowledge of accessing pieces of arrays. To acess
       
   108 the first row we will do ::
   174    
   109    
   175 Standard deviation. ::
   110    L[0,:]
   176 	
       
   177     std(L)
       
   178 
   111 
       
   112 Now to sum this we can say ::
   179 
   113 
   180 Now lets try and and get the mean for all the subjects ::
   114     totalmarks=sum(L[0,:]) 
       
   115     totalmarks
   181 
   116 
   182      L=loadtxt('/home/fossee/sslc2.txt',usecols=(3,4,5,6,7),delimiter=';')
   117 To get the mean we can do ::
   183      mean(L,0)
       
   184      array([ 73.55452504,  53.79828941,  62.83342759,  50.69806158,  63.17056881])
       
   185 
   118 
   186 As we can see from the result mean(L,0). The resultant sequence  
   119    totalmarks/len(L[0,:])
   187 is the mean marks of all students that gave the exam for the five subjects.
       
   188 
   120 
   189 and ::
   121 or simply ::
   190     
   122 
       
   123    mean(L[0,:])
       
   124 
       
   125 But we have such a large data set calculating one by one the mean of
       
   126 each student is impossible. Is there a way to reduce the work.
       
   127 
       
   128 For this we will look into the documentation of mean by doing::
       
   129 
       
   130     mean?
       
   131 
       
   132 As we know L is a two dimensional array. We can calculate the mean
       
   133 across each of the axis of the array. The axis of rows is referred by
       
   134 number 0 and columns by 1. So to calculate mean accross all colums we
       
   135 will pass extra parameter 1 for the axis.::
       
   136 
   191     mean(L,1)
   137     mean(L,1)
   192 
   138 
   193     
   139 L here is the two dimensional array.
   194 is the average accumalative marks of individual students. Clearly, mean(L,0)
       
   195 was a row wise calcultaion while mean(L,1) was a column wise calculation.
       
   196 
   140 
       
   141 Similarly to calculate average marks scored by all the students for each
       
   142 subject can be calculated using ::
       
   143 
       
   144    mean(L,0)
       
   145 
       
   146 Next lets now calculate the median of English marks for the all the students
       
   147 We can access English marks of all students using ::
       
   148 
       
   149    L[:,0]
       
   150    
       
   151 To get the median we will do ::
       
   152 
       
   153    median(L[:,0])
       
   154 
       
   155 For all the subjects we can use the same syntax as mean and calculate
       
   156 median across all rows using ::
       
   157 
       
   158        median(L,0)
       
   159   
       
   160 
       
   161 Similarly to calculate standard deviation for English we can do::
       
   162 
       
   163 	  std(L[:,0])
       
   164 
       
   165 and for all rows::
       
   166 
       
   167     std(L,0)
       
   168 
       
   169 Following is an exercise that you must do. 
       
   170 
       
   171 %% %% In the given file football.txt at path /home/fossee/football.txt , one column is player name,second is goals at home and third goals away.
       
   172    1.Find the total goals for each player
       
   173    2.Mean home and away goals
       
   174    3.Standard deviation of home and away goals 
   197 
   175 
   198 {{{ Show summary slide }}}
   176 {{{ Show summary slide }}}
   199 
   177 
   200 This brings us to the end of the tutorial.
   178 This brings us to the end of the tutorial.
   201 we have learnt
   179 we have learnt