statistics.rst
author Puneeth Chaganti <punchagan@fossee.in>
Thu, 07 Oct 2010 12:28:12 +0530
changeset 242 a33e942379d7
parent 177 eb5dd4c7c5be
permissions -rw-r--r--
Cleaned up script for getting started with files LO.
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
177
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
     1
Hello friends and welcome to the tutorial on statistics using Python
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
     2
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
     3
{{{ Show the slide containing title }}}
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
     4
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
     5
{{{ Show the slide containing the outline slide }}}
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
     6
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
     7
In this tutorial, we shall learn
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
     8
 * Doing simple statistical operations in Python  
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
     9
 * Applying these to real world problems 
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
    10
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
    11
You will need Ipython with pylab running on your computer
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
    12
to use this tutorial.
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
    13
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
    14
Also you will need to know about loading data using loadtxt to be 
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
    15
able to follow the real world application.
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
    16
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
    17
We will first start with the most necessary statistical 
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
    18
operation i.e finding mean.
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
    19
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
    20
We have a list of ages of a random group of people ::
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
    21
   
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
    22
   age_list=[4,45,23,34,34,38,65,42,32,7]
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
    23
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
    24
One way of getting the mean could be getting sum of 
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
    25
all the elements and dividing by length of the list.::
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
    26
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
    27
    sum_age_list =sum(age_list)
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
    28
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
    29
sum function gives us the sum of the elements.::
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
    30
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
    31
    mean_using_sum=sum_age_list/len(age_list)
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
    32
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
    33
This obviously gives the mean age but python has another 
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
    34
method for getting the mean. This is the mean function::
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
    35
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
    36
       mean(age_list)
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
    37
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
    38
Mean can be used in more ways in case of 2 dimensional lists.
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
    39
Take a two dimensional list ::
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
    40
     
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
    41
     two_dimension=[[1,5,6,8],[1,3,4,5]]
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
    42
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
    43
the mean function used in default manner will give the mean of the 
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
    44
flattened sequence. Flattened sequence means the two lists taken 
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
    45
as if it was a single list of elements ::
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
    46
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
    47
    mean(two_dimension)
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
    48
    flattened_seq=[1,5,6,8,1,3,4,5]
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
    49
    mean(flattened_seq)
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
    50
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
    51
As you can see both the results are same. The other is mean 
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
    52
of each column.::
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
    53
   
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
    54
   mean(two_dimension,0)
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
    55
   array([ 1. ,  4. ,  5. ,  6.5])
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
    56
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
    57
or along the two rows seperately.::
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
    58
   
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
    59
   mean(two_dimension,1)
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
    60
   array([ 5.  ,  3.25])
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
    61
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
    62
We can see more option of mean using ::
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
    63
   
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
    64
   mean?
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
    65
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
    66
Similarly we can calculate median and stanard deviation of a list
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
    67
using the functions median and std::
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
    68
      
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
    69
      median(age_list)
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
    70
      std(age_list)
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
    71
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
    72
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
    73
    
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
    74
Now lets apply this to a real world example ::
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
    75
    
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
    76
We will a data file that is at the a path
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
    77
``/home/fossee/sslc2.txt``.It contains record of students and their
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
    78
performance in one of the State Secondary Board Examination. It has
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
    79
180, 000 lines of record. We are going to read it and process this
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
    80
data.  We can see the content of file by double clicking on it. It
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
    81
might take some time to open since it is quite a large file.  Please
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
    82
don't edit the data.  This file has a particular structure.
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
    83
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
    84
We can do ::
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
    85
   
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
    86
   cat /home/fossee/sslc2.txt
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
    87
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
    88
to check the contents of the file.
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
    89
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
    90
Each line in the file is a set of 11 fields separated 
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
    91
by semi-colons Consider a sample line from this file.  
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
    92
A;015163;JOSEPH RAJ S;083;042;47;00;72;244;;; 
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
    93
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
    94
The following are the fields in any given line.
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
    95
* Region Code which is 'A'
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
    96
* Roll Number 015163
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
    97
* Name JOSEPH RAJ S
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
    98
* Marks of 5 subjects: ** English 083 ** Hindi 042 ** Maths 47 **
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
    99
Science AA (Absent) ** Social 72
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
   100
* Total marks 244
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
   101
*
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
   102
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
   103
Now lets try and find the mean of English marks of all students.
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
   104
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
   105
For this we do. ::
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
   106
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
   107
     L=loadtxt('/home/fossee/sslc2.txt',usecols=(3,),delimiter=';')
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
   108
     L
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
   109
     mean(L)
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
   110
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
   111
loadtxt function loads data from an external file.Delimiter specifies
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
   112
the kind of character are the fields of data seperated by. 
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
   113
usecols specifies  the columns to be used so (3,). The 'comma' is added
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
   114
because usecols is a sequence.
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
   115
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
   116
To get the median marks. ::
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
   117
   
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
   118
   median(L)
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
   119
   
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
   120
Standard deviation. ::
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
   121
	
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
   122
	std(L)
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
   123
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
   124
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
   125
Now lets try and and get the mean for all the subjects ::
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
   126
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
   127
     L=loadtxt('sslc2.txt',usecols=(3,4,5,6,7),delimiter=';')
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
   128
     mean(L,0)
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
   129
     array([ 73.55452504,  53.79828941,  62.83342759,  50.69806158,  63.17056881])
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
   130
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
   131
As we can see from the result mean(L,0). The resultant sequence  
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
   132
is the mean marks of all students that gave the exam for the five subjects.
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
   133
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
   134
and ::
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
   135
    
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
   136
    mean(L,1)
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
   137
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
   138
    
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
   139
is the average accumalative marks of individual students. Clearly, mean(L,0)
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
   140
was a row wise calcultaion while mean(L,1) was a column wise calculation.
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
   141
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
   142
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
   143
{{{ Show summary slide }}}
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
   144
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
   145
This brings us to the end of the tutorial.
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
   146
we have learnt
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
   147
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
   148
 * How to do the standard statistical operations sum , mean
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
   149
   median and standard deviation in Python.
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
   150
 * Combine text loading and the statistical operation to solve
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
   151
   real world problems.
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
   152
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
   153
{{{ Show the "sponsored by FOSSEE" slide }}}
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
   154
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
   155
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
   156
This tutorial was created as a part of FOSSEE project, NME ICT, MHRD India
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
   157
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
   158
Hope you have enjoyed and found it useful.
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
   159
Thankyou
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
   160
 
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
   161
.. Author              : Amit Sethi
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
   162
   Internal Reviewer 1 : 
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
   163
   Internal Reviewer 2 : 
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
   164
   External Reviewer   :
eb5dd4c7c5be Initial commit statistics
amit
parents:
diff changeset
   165