statistics/script.rst
changeset 382 aa8ea9119476
parent 362 a77a27916f81
child 383 4a6d548d4369
equal deleted inserted replaced
381:5415cb1bb4af 382:aa8ea9119476
    17 .. Author              : Puneeth 
    17 .. Author              : Puneeth 
    18    Internal Reviewer   : Anoop Jacob Thomas<anoop@fossee.in>
    18    Internal Reviewer   : Anoop Jacob Thomas<anoop@fossee.in>
    19    External Reviewer   :
    19    External Reviewer   :
    20    Checklist OK?       : <put date stamp here, if OK> [2010-10-05]
    20    Checklist OK?       : <put date stamp here, if OK> [2010-10-05]
    21 
    21 
    22 Hello friends and welcome to the tutorial on statistics using Python
    22 Hello friends and welcome to the tutorial on Statistics using Python
    23 
    23 
    24 {{{ Show the slide containing title }}}
    24 {{{ Show the slide containing title }}}
    25 
    25 
    26 {{{ Show the slide containing the outline slide }}}
    26 {{{ Show the slide containing the outline slide }}}
    27 
    27 
    28 In this tutorial, we shall learn
    28 In this tutorial, we shall learn
    29  * Doing simple statistical operations in Python  
    29  * Doing simple statistical operations in Python  
    30  * Applying these to real world problems 
    30  * Applying these to real world problems 
    31 
    31 
    32 You will need Ipython with pylab running on your computer
    32 .. #[punch: the prerequisites part may be skipped in the tutorial. It
    33 to use this tutorial.
    33 .. will be provided separately.]
    34 
    34 
    35 Also you will need to know about loading data using loadtxt to be 
    35 You will need Ipython with pylab running on your computer to use this
    36 able to follow the real world application.
    36 tutorial.
    37 
    37 
    38 We will first start with the most necessary statistical 
    38 Also you will need to know about loading data using loadtxt to be able
    39 operation i.e finding mean.
    39 to follow the real world application.
       
    40 
       
    41 .. #[punch: since loadtxt is anyway a pre-req, I would recommend you
       
    42 .. to use a data file and load data from that. that is good, since you
       
    43 .. would get to deal with arrays, instead of lists. 
       
    44 
       
    45 .. Talking of rows and columns of 2-D lists etc is confusing. Also,
       
    46 .. converting to float can be avoided. The tutorial will feel more
       
    47 .. natural, is what I think. 
       
    48 
       
    49 .. The idea of separating the main problem and giving toy examples
       
    50 .. doesn't sound good. Use the same problem to explain stuff. Or use a
       
    51 .. smaller data-set or something. Using lists doesn't seem natural.]
       
    52 
       
    53 
       
    54 We will first start with the most necessary statistical operation i.e
       
    55 finding mean.
    40 
    56 
    41 We have a list of ages of a random group of people ::
    57 We have a list of ages of a random group of people ::
    42    
    58    
    43    age_list=[4,45,23,34,34,38,65,42,32,7]
    59    age_list = [4,45,23,34,34,38,65,42,32,7]
    44 
    60 
    45 One way of getting the mean could be getting sum of 
    61 One way of getting the mean could be getting sum of all the ages and
    46 all the elements and dividing by length of the list.::
    62 dividing by the number of people in the group. ::
    47 
    63 
    48     sum_age_list =sum(age_list)
    64     sum_age_list = sum(age_list)
    49 
    65 
    50 sum function gives us the sum of the elements.::
    66 sum function gives us the sum of the elements. Note that the
    51 
    67 ``sum_age_list`` variable is an integer and the number of people or
    52     mean_using_sum=float(sum_age_list)/len(age_list)
    68 length of the list is also an integer. We will need to convert one of
    53 
    69 them to a float before carrying out the division. ::
    54 This obviously gives the mean age but python has another 
    70 
    55 method for getting the mean. This is the mean function::
    71     mean_using_sum = float(sum_age_list)/len(age_list)
       
    72 
       
    73 This obviously gives the mean age but there is a simpler way to do
       
    74 this in Python - using the mean function::
    56 
    75 
    57        mean(age_list)
    76        mean(age_list)
    58 
    77 
    59 Mean can be used in more ways in case of 2 dimensional lists.
    78 Mean can be used in more ways in case of 2 dimensional lists.  Take a
    60 Take a two dimensional list ::
    79 two dimensional list ::
    61      
    80      
    62      two_dimension=[[1,5,6,8],[1,3,4,5]]
    81      two_dimension=[[1,5,6,8],[1,3,4,5]]
    63 
    82 
    64 the mean function used in default manner will give the mean of the 
    83 The mean function by default gives the mean of the flattened sequence.
    65 flattened sequence. Flattened sequence means the two lists taken 
    84 A Flattened sequence means a list obtained by concatenating all the
    66 as if it was a single list of elements ::
    85 smaller lists into a large long list. In this case, the list obtained
       
    86 by writing the two lists one after the other. ::
    67 
    87 
    68     mean(two_dimension)
    88     mean(two_dimension)
    69     flattened_seq=[1,5,6,8,1,3,4,5]
    89     flattened_seq=[1,5,6,8,1,3,4,5]
    70     mean(flattened_seq)
    90     mean(flattened_seq)
    71 
    91 
    72 As you can see both the results are same. The other way is mean 
    92 As you can see both the results are same. ``mean`` function can also
    73 of each column.::
    93 give us the mean of each column, or the mean of corresponding elements
    74    
    94 in the smaller lists. ::
    75    mean(two_dimension,0)
    95    
       
    96    mean(two_dimension, 0)
    76    array([ 1. ,  4. ,  5. ,  6.5])
    97    array([ 1. ,  4. ,  5. ,  6.5])
    77 
    98 
    78 we pass an extra argument 0 in that case.
    99 we pass an extra argument 0 in that case.
    79 
   100 
    80 In case of getting mean along the rows the argument is 1::
   101 If we use an argument 1, we obtain the mean along the rows. ::
    81    
   102    
    82    mean(two_dimension,1)
   103    mean(two_dimension, 1)
    83    array([ 5.  ,  3.25])
   104    array([ 5.  ,  3.25])
    84 
   105 
    85 We can see more option of mean using ::
   106 We can see more option of mean using ::
    86    
   107    
    87    mean?
   108    mean?
    90 using the functions median and std::
   111 using the functions median and std::
    91       
   112       
    92       median(age_list)
   113       median(age_list)
    93       std(age_list)
   114       std(age_list)
    94 
   115 
    95 Median and std can also be calculated for two dimensional arrays along columns and rows just like mean.
   116 Median and std can also be calculated for two dimensional arrays along
    96 
   117 columns and rows just like mean.
    97        For example ::
   118 
       
   119 For example ::
    98        
   120        
    99        median(two_dimension,0)
   121        median(two_dimension, 0)
   100        std(two_dimension,1)
   122        std(two_dimension, 1)
   101 
   123 
   102 This gives us the median along the colums and standard devition along the rows.
   124 This gives us the median along the colums and standard devition along
       
   125 the rows.
   103        
   126        
   104 Now lets apply this to a real world example 
   127 Now lets apply this to a real world example 
   105     
   128     
   106 We will a data file that is at the a path
   129 We will a data file that is at the a path ``/home/fossee/sslc2.txt``.
   107 ``/home/fossee/sslc2.txt``.It contains record of students and their
   130 It contains record of students and their performance in one of the
   108 performance in one of the State Secondary Board Examination. It has
   131 State Secondary Board Examination. It has 180, 000 lines of record. We
   109 180, 000 lines of record. We are going to read it and process this
   132 are going to read it and process this data.  We can see the content of
   110 data.  We can see the content of file by double clicking on it. It
   133 file by double clicking on it. It might take some time to open since
   111 might take some time to open since it is quite a large file.  Please
   134 it is quite a large file.  Please don't edit the data.  This file has
   112 don't edit the data.  This file has a particular structure.
   135 a particular structure.
   113 
   136 
   114 We can do ::
   137 We can do ::
   115    
   138    
   116    cat /home/fossee/sslc2.txt
   139    cat /home/fossee/sslc2.txt
   117 
   140 
   126 * Roll Number 015163
   149 * Roll Number 015163
   127 * Name JOSEPH RAJ S
   150 * Name JOSEPH RAJ S
   128 * Marks of 5 subjects: ** English 083 ** Hindi 042 ** Maths 47 **
   151 * Marks of 5 subjects: ** English 083 ** Hindi 042 ** Maths 47 **
   129 Science 35 ** Social 72
   152 Science 35 ** Social 72
   130 * Total marks 244
   153 * Total marks 244
   131 *
   154 
   132 
   155 
   133 Now lets try and find the mean of English marks of all students.
   156 Now lets try and find the mean of English marks of all students.
   134 
   157 
   135 For this we do. ::
   158 For this we do. ::
   136 
   159 
   143 usecols specifies  the columns to be used so (3,). The 'comma' is added
   166 usecols specifies  the columns to be used so (3,). The 'comma' is added
   144 because usecols is a sequence.
   167 because usecols is a sequence.
   145 
   168 
   146 To get the median marks. ::
   169 To get the median marks. ::
   147    
   170    
   148    median(L)
   171     median(L)
   149    
   172    
   150 Standard deviation. ::
   173 Standard deviation. ::
   151 	
   174 	
   152 	std(L)
   175     std(L)
   153 
   176 
   154 
   177 
   155 Now lets try and and get the mean for all the subjects ::
   178 Now lets try and and get the mean for all the subjects ::
   156 
   179 
   157      L=loadtxt('/home/fossee/sslc2.txt',usecols=(3,4,5,6,7),delimiter=';')
   180      L=loadtxt('/home/fossee/sslc2.txt',usecols=(3,4,5,6,7),delimiter=';')
   185 
   208 
   186 This tutorial was created as a part of FOSSEE project, NME ICT, MHRD India
   209 This tutorial was created as a part of FOSSEE project, NME ICT, MHRD India
   187 
   210 
   188 Hope you have enjoyed and found it useful.
   211 Hope you have enjoyed and found it useful.
   189 
   212 
   190 Thankyou
   213 Thank you!
   191 
   214 
   192 .. Author              : Amit Sethi
       
   193    Internal Reviewer 1 : 
       
   194    Internal Reviewer 2 : 
       
   195    External Reviewer   :
       
   196