statistics/script.rst
changeset 450 d49aee7ab1b9
parent 406 a534e9e79599
equal deleted inserted replaced
443:79a7ca3073d4 450:d49aee7ab1b9
    11 .. -------------
    11 .. -------------
    12 
    12 
    13 .. Getting started with IPython
    13 .. Getting started with IPython
    14 .. Loading Data from files
    14 .. Loading Data from files
    15 .. Getting started with Lists
    15 .. Getting started with Lists
       
    16 .. Accessing Pieces of Arrays
       
    17 
    16      
    18      
    17 .. Author              : Amit Sethi
    19 .. Author              : Amit Sethi
    18    Internal Reviewer   : Puneeth
    20    Internal Reviewer   : Puneeth
    19    External Reviewer   :
    21    External Reviewer   :
    20    Checklist OK?       : <put date stamp here, if OK> [2010-10-05]
    22    Checklist OK?       : <put date stamp here, if OK> [2010-10-05]
    26 {{{ Show the slide containing title }}}
    28 {{{ Show the slide containing title }}}
    27 
    29 
    28 {{{ Show the slide containing the outline slide }}}
    30 {{{ Show the slide containing the outline slide }}}
    29 
    31 
    30 In this tutorial, we shall learn
    32 In this tutorial, we shall learn
    31  * Doing simple statistical operations in Python  
    33  * Doing statistical operations in Python  
    32  * Applying these to real world problems 
    34    * Summing set of numbers
       
    35    * Finding there mean
       
    36    * Finding there Median
       
    37    * Finding there Standard Deviation 
       
    38    
    33 
    39 
    34 
    40 
    35 .. #[punch: since loadtxt is anyway a pre-req, I would recommend you
    41 .. #[punch: since loadtxt is anyway a pre-req, I would recommend you
    36 .. to use a data file and load data from that. that is good, since you
    42 .. to use a data file and load data from that. that is good, since you
    37 .. would get to deal with arrays, instead of lists. 
    43 .. would get to deal with arrays, instead of lists. 
    43 .. The idea of separating the main problem and giving toy examples
    49 .. The idea of separating the main problem and giving toy examples
    44 .. doesn't sound good. Use the same problem to explain stuff. Or use a
    50 .. doesn't sound good. Use the same problem to explain stuff. Or use a
    45 .. smaller data-set or something. Using lists doesn't seem natural.]
    51 .. smaller data-set or something. Using lists doesn't seem natural.]
    46 
    52 
    47 
    53 
    48 We will first start with the most necessary statistical operation i.e
    54 For this tutorial We will use data file that is at the a path
    49 finding mean.
    55 ``/home/fossee/sslc2.txt``.  It contains record of students and their
    50 
    56 performance in one of the State Secondary Board Examination. It has
    51 We have a list of ages of a random group of people ::
    57 180,000 lines of record. We are going to read it and process this
    52    
    58 data.  We can see the content of file by double clicking on it. It
    53    age_list = [4,45,23,34,34,38,65,42,32,7]
    59 might take some time to open since it is quite a large file.  Please
    54 
    60 don't edit the data.  This file has a particular structure.
    55 One way of getting the mean could be getting sum of all the ages and
       
    56 dividing by the number of people in the group. ::
       
    57 
       
    58     sum_age_list = sum(age_list)
       
    59 
       
    60 sum function gives us the sum of the elements. Note that the
       
    61 ``sum_age_list`` variable is an integer and the number of people or
       
    62 length of the list is also an integer. We will need to convert one of
       
    63 them to a float before carrying out the division. ::
       
    64 
       
    65     mean_using_sum = float(sum_age_list)/len(age_list)
       
    66 
       
    67 This obviously gives the mean age but there is a simpler way to do
       
    68 this in Python - using the mean function::
       
    69 
       
    70        mean(age_list)
       
    71 
       
    72 Mean can be used in more ways in case of 2 dimensional lists.  Take a
       
    73 two dimensional list ::
       
    74      
       
    75      two_dimension=[[1,5,6,8],[1,3,4,5]]
       
    76 
       
    77 The mean function by default gives the mean of the flattened sequence.
       
    78 A Flattened sequence means a list obtained by concatenating all the
       
    79 smaller lists into a large long list. In this case, the list obtained
       
    80 by writing the two lists one after the other. ::
       
    81 
       
    82     mean(two_dimension)
       
    83     flattened_seq=[1,5,6,8,1,3,4,5]
       
    84     mean(flattened_seq)
       
    85 
       
    86 As you can see both the results are same. ``mean`` function can also
       
    87 give us the mean of each column, or the mean of corresponding elements
       
    88 in the smaller lists. ::
       
    89    
       
    90    mean(two_dimension, 0)
       
    91    array([ 1. ,  4. ,  5. ,  6.5])
       
    92 
       
    93 we pass an extra argument 0 in that case.
       
    94 
       
    95 If we use an argument 1, we obtain the mean along the rows. ::
       
    96    
       
    97    mean(two_dimension, 1)
       
    98    array([ 5.  ,  3.25])
       
    99 
       
   100 We can see more option of mean using ::
       
   101    
       
   102    mean?
       
   103 
       
   104 Similarly we can calculate median and stanard deviation of a list
       
   105 using the functions median and std::
       
   106       
       
   107       median(age_list)
       
   108       std(age_list)
       
   109 
       
   110 Median and std can also be calculated for two dimensional arrays along
       
   111 columns and rows just like mean.
       
   112 
       
   113 For example ::
       
   114        
       
   115        median(two_dimension, 0)
       
   116        std(two_dimension, 1)
       
   117 
       
   118 This gives us the median along the colums and standard devition along
       
   119 the rows.
       
   120        
       
   121 Now lets apply this to a real world example 
       
   122     
       
   123 We will a data file that is at the a path ``/home/fossee/sslc2.txt``.
       
   124 It contains record of students and their performance in one of the
       
   125 State Secondary Board Examination. It has 180, 000 lines of record. We
       
   126 are going to read it and process this data.  We can see the content of
       
   127 file by double clicking on it. It might take some time to open since
       
   128 it is quite a large file.  Please don't edit the data.  This file has
       
   129 a particular structure.
       
   130 
    61 
   131 We can do ::
    62 We can do ::
   132    
    63    
   133    cat /home/fossee/sslc2.txt
    64    cat /home/fossee/sslc2.txt
   134 
    65 
   135 to check the contents of the file.
    66 to check the contents of the file.
       
    67 
       
    68 
       
    69 {{{ Show the data structure on a slide }}}
   136 
    70 
   137 Each line in the file is a set of 11 fields separated 
    71 Each line in the file is a set of 11 fields separated 
   138 by semi-colons Consider a sample line from this file.  
    72 by semi-colons Consider a sample line from this file.  
   139 A;015163;JOSEPH RAJ S;083;042;47;00;72;244;;; 
    73 A;015163;JOSEPH RAJ S;083;042;47;00;72;244;;; 
   140 
    74 
   145 * Marks of 5 subjects: ** English 083 ** Hindi 042 ** Maths 47 **
    79 * Marks of 5 subjects: ** English 083 ** Hindi 042 ** Maths 47 **
   146 Science 35 ** Social 72
    80 Science 35 ** Social 72
   147 * Total marks 244
    81 * Total marks 244
   148 
    82 
   149 
    83 
   150 Now lets try and find the mean of English marks of all students.
    84 Lets try and load this data as an array and then run various function on
       
    85 it.
   151 
    86 
   152 For this we do. ::
    87 To get the data as an array we do. ::
   153 
    88    
   154      L=loadtxt('/home/fossee/sslc2.txt',usecols=(3,),delimiter=';')
    89      L=loadtxt('/home/amit/sslc2.txt',usecols=(3,4,5,6,7,),delimiter=';')
   155      L
    90      L
   156      mean(L)
    91      
   157 
    92 
   158 loadtxt function loads data from an external file.Delimiter specifies
    93 loadtxt function loads data from an external file.Delimiter specifies
   159 the kind of character are the fields of data seperated by. 
    94 the kind of character are the fields of data seperated by.  usecols
   160 usecols specifies  the columns to be used so (3,). The 'comma' is added
    95 specifies the columns to be used so (3,4,5,6,7) loads those
   161 because usecols is a sequence.
    96 colums. The 'comma' is added because usecols is a sequence.
   162 
    97 
   163 To get the median marks. ::
    98 As we can see L is an array. We can get the shape of this array using::
   164    
    99    
   165     median(L)
   100    L.shape
       
   101    (185667, 5)
       
   102 
       
   103 Lets start applying statistics operations on these. We will start with
       
   104 the most basic, summing. How do you find the sum of marks of all
       
   105 subjects for the first student.
       
   106 
       
   107 As we know from our knowledge of accessing pieces of arrays. To acess
       
   108 the first row we will do ::
   166    
   109    
   167 Standard deviation. ::
   110    L[0,:]
   168 	
       
   169     std(L)
       
   170 
   111 
       
   112 Now to sum this we can say ::
   171 
   113 
   172 Now lets try and and get the mean for all the subjects ::
   114     totalmarks=sum(L[0,:]) 
       
   115     totalmarks
   173 
   116 
   174      L=loadtxt('/home/fossee/sslc2.txt',usecols=(3,4,5,6,7),delimiter=';')
   117 To get the mean we can do ::
   175      mean(L,0)
       
   176      array([ 73.55452504,  53.79828941,  62.83342759,  50.69806158,  63.17056881])
       
   177 
   118 
   178 As we can see from the result mean(L,0). The resultant sequence  
   119    totalmarks/len(L[0,:])
   179 is the mean marks of all students that gave the exam for the five subjects.
       
   180 
   120 
   181 and ::
   121 or simply ::
   182     
   122 
       
   123    mean(L[0,:])
       
   124 
       
   125 But we have such a large data set calculating one by one the mean of
       
   126 each student is impossible. Is there a way to reduce the work.
       
   127 
       
   128 For this we will look into the documentation of mean by doing::
       
   129 
       
   130     mean?
       
   131 
       
   132 As we know L is a two dimensional array. We can calculate the mean
       
   133 across each of the axis of the array. The axis of rows is referred by
       
   134 number 0 and columns by 1. So to calculate mean accross all colums we
       
   135 will pass extra parameter 1 for the axis.::
       
   136 
   183     mean(L,1)
   137     mean(L,1)
   184 
   138 
   185     
   139 L here is the two dimensional array.
   186 is the average accumalative marks of individual students. Clearly, mean(L,0)
       
   187 was a row wise calcultaion while mean(L,1) was a column wise calculation.
       
   188 
   140 
       
   141 Similarly to calculate average marks scored by all the students for each
       
   142 subject can be calculated using ::
       
   143 
       
   144    mean(L,0)
       
   145 
       
   146 Next lets now calculate the median of English marks for the all the students
       
   147 We can access English marks of all students using ::
       
   148 
       
   149    L[:,0]
       
   150    
       
   151 To get the median we will do ::
       
   152 
       
   153    median(L[:,0])
       
   154 
       
   155 For all the subjects we can use the same syntax as mean and calculate
       
   156 median across all rows using ::
       
   157 
       
   158        median(L,0)
       
   159   
       
   160 
       
   161 Similarly to calculate standard deviation for English we can do::
       
   162 
       
   163 	  std(L[:,0])
       
   164 
       
   165 and for all rows::
       
   166 
       
   167     std(L,0)
       
   168 
       
   169 Following is an exercise that you must do. 
       
   170 
       
   171 %% %% In the given file football.txt at path /home/fossee/football.txt , one column is player name,second is goals at home and third goals away.
       
   172    1.Find the total goals for each player
       
   173    2.Mean home and away goals
       
   174    3.Standard deviation of home and away goals 
   189 
   175 
   190 {{{ Show summary slide }}}
   176 {{{ Show summary slide }}}
   191 
   177 
   192 This brings us to the end of the tutorial.
   178 This brings us to the end of the tutorial.
   193 we have learnt
   179 we have learnt