statistics/script.rst
author Puneeth Chaganti <punchagan@fossee.in>
Tue, 19 Oct 2010 14:26:02 +0530
changeset 337 c65d0d9fc0c8
parent 321 2e49b1b72996
child 349 9ced58c5c3b6
permissions -rw-r--r--
Reviewed Basic datatypes LO.
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
321
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
     1
Hello friends and welcome to the tutorial on statistics using Python
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
     2
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
     3
{{{ Show the slide containing title }}}
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
     4
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
     5
{{{ Show the slide containing the outline slide }}}
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
     6
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
     7
In this tutorial, we shall learn
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
     8
 * Doing simple statistical operations in Python  
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
     9
 * Applying these to real world problems 
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
    10
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
    11
You will need Ipython with pylab running on your computer
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
    12
to use this tutorial.
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
    13
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
    14
Also you will need to know about loading data using loadtxt to be 
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
    15
able to follow the real world application.
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
    16
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
    17
We will first start with the most necessary statistical 
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
    18
operation i.e finding mean.
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
    19
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
    20
We have a list of ages of a random group of people ::
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
    21
   
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
    22
   age_list=[4,45,23,34,34,38,65,42,32,7]
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
    23
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
    24
One way of getting the mean could be getting sum of 
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
    25
all the elements and dividing by length of the list.::
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
    26
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
    27
    sum_age_list =sum(age_list)
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
    28
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
    29
sum function gives us the sum of the elements.::
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
    30
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
    31
    mean_using_sum=float(sum_age_list)/len(age_list)
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
    32
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
    33
This obviously gives the mean age but python has another 
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
    34
method for getting the mean. This is the mean function::
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
    35
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
    36
       mean(age_list)
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
    37
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
    38
Mean can be used in more ways in case of 2 dimensional lists.
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
    39
Take a two dimensional list ::
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
    40
     
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
    41
     two_dimension=[[1,5,6,8],[1,3,4,5]]
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
    42
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
    43
the mean function used in default manner will give the mean of the 
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
    44
flattened sequence. Flattened sequence means the two lists taken 
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
    45
as if it was a single list of elements ::
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
    46
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
    47
    mean(two_dimension)
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
    48
    flattened_seq=[1,5,6,8,1,3,4,5]
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
    49
    mean(flattened_seq)
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
    50
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
    51
As you can see both the results are same. The other way is mean 
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
    52
of each column.::
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
    53
   
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
    54
   mean(two_dimension,0)
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
    55
   array([ 1. ,  4. ,  5. ,  6.5])
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
    56
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
    57
we pass an extra argument 0 in that case.
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
    58
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
    59
In case of getting mean along the rows the argument is 1::
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
    60
   
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
    61
   mean(two_dimension,1)
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
    62
   array([ 5.  ,  3.25])
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
    63
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
    64
We can see more option of mean using ::
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
    65
   
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
    66
   mean?
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
    67
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
    68
Similarly we can calculate median and stanard deviation of a list
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
    69
using the functions median and std::
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
    70
      
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
    71
      median(age_list)
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
    72
      std(age_list)
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
    73
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
    74
Median and std can also be calculated for two dimensional arrays along columns and rows just like mean.
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
    75
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
    76
       For example ::
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
    77
       
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
    78
       median(two_dimension,0)
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
    79
       std(two_dimension,1)
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
    80
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
    81
This gives us the median along the colums and standard devition along the rows.
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
    82
       
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
    83
Now lets apply this to a real world example 
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
    84
    
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
    85
We will a data file that is at the a path
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
    86
``/home/fossee/sslc2.txt``.It contains record of students and their
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
    87
performance in one of the State Secondary Board Examination. It has
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
    88
180, 000 lines of record. We are going to read it and process this
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
    89
data.  We can see the content of file by double clicking on it. It
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
    90
might take some time to open since it is quite a large file.  Please
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
    91
don't edit the data.  This file has a particular structure.
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
    92
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
    93
We can do ::
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
    94
   
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
    95
   cat /home/fossee/sslc2.txt
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
    96
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
    97
to check the contents of the file.
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
    98
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
    99
Each line in the file is a set of 11 fields separated 
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
   100
by semi-colons Consider a sample line from this file.  
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
   101
A;015163;JOSEPH RAJ S;083;042;47;00;72;244;;; 
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
   102
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
   103
The following are the fields in any given line.
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
   104
* Region Code which is 'A'
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
   105
* Roll Number 015163
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
   106
* Name JOSEPH RAJ S
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
   107
* Marks of 5 subjects: ** English 083 ** Hindi 042 ** Maths 47 **
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
   108
Science AA (Absent) ** Social 72
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
   109
* Total marks 244
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
   110
*
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
   111
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
   112
Now lets try and find the mean of English marks of all students.
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
   113
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
   114
For this we do. ::
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
   115
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
   116
     L=loadtxt('/home/fossee/sslc2.txt',usecols=(3,),delimiter=';')
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
   117
     L
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
   118
     mean(L)
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
   119
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
   120
loadtxt function loads data from an external file.Delimiter specifies
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
   121
the kind of character are the fields of data seperated by. 
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
   122
usecols specifies  the columns to be used so (3,). The 'comma' is added
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
   123
because usecols is a sequence.
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
   124
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
   125
To get the median marks. ::
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
   126
   
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
   127
   median(L)
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
   128
   
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
   129
Standard deviation. ::
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
   130
	
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
   131
	std(L)
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
   132
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
   133
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
   134
Now lets try and and get the mean for all the subjects ::
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
   135
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
   136
     L=loadtxt('/home/fossee/sslc2.txt',usecols=(3,4,5,6,7),delimiter=';')
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
   137
     mean(L,0)
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
   138
     array([ 73.55452504,  53.79828941,  62.83342759,  50.69806158,  63.17056881])
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
   139
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
   140
As we can see from the result mean(L,0). The resultant sequence  
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
   141
is the mean marks of all students that gave the exam for the five subjects.
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
   142
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
   143
and ::
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
   144
    
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
   145
    mean(L,1)
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
   146
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
   147
    
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
   148
is the average accumalative marks of individual students. Clearly, mean(L,0)
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
   149
was a row wise calcultaion while mean(L,1) was a column wise calculation.
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
   150
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
   151
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
   152
{{{ Show summary slide }}}
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
   153
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
   154
This brings us to the end of the tutorial.
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
   155
we have learnt
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
   156
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
   157
 * How to do the standard statistical operations sum , mean
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
   158
   median and standard deviation in Python.
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
   159
 * Combine text loading and the statistical operation to solve
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
   160
   real world problems.
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
   161
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
   162
{{{ Show the "sponsored by FOSSEE" slide }}}
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
   163
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
   164
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
   165
This tutorial was created as a part of FOSSEE project, NME ICT, MHRD India
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
   166
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
   167
Hope you have enjoyed and found it useful.
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
   168
Thankyou
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
   169
 
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
   170
.. Author              : Amit Sethi
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
   171
   Internal Reviewer 1 : 
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
   172
   Internal Reviewer 2 : 
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
   173
   External Reviewer   :
2e49b1b72996 adding questions for all other LO needs to be cleaned
amit
parents:
diff changeset
   174