author | anand |
Mon, 08 Nov 2010 01:36:47 +0530 | |
changeset 455 | f5b7d0b693d9 |
parent 383 | 4a6d548d4369 |
child 406 | a534e9e79599 |
permissions | -rw-r--r-- |
362 | 1 |
.. Objectives |
2 |
.. ---------- |
|
3 |
||
4 |
.. By the end of this tutorial you will -- |
|
5 |
||
6 |
.. 1. Get to know simple statistics functions like mean,std etc .. (Remembering) |
|
7 |
.. #. Apply them on a real world example. (Applying) |
|
8 |
||
9 |
||
10 |
.. Prerequisites |
|
11 |
.. ------------- |
|
12 |
||
13 |
.. Getting started with IPython |
|
14 |
.. Loading Data from files |
|
15 |
.. Getting started with Lists |
|
16 |
||
17 |
.. Author : Puneeth |
|
18 |
Internal Reviewer : Anoop Jacob Thomas<anoop@fossee.in> |
|
19 |
External Reviewer : |
|
20 |
Checklist OK? : <put date stamp here, if OK> [2010-10-05] |
|
21 |
||
383
4a6d548d4369
Minor comments on Statistics.
Puneeth Chaganti <punchagan@fossee.in>
parents:
382
diff
changeset
|
22 |
.. #[punch; add slides, exercises!] |
4a6d548d4369
Minor comments on Statistics.
Puneeth Chaganti <punchagan@fossee.in>
parents:
382
diff
changeset
|
23 |
|
382
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
24 |
Hello friends and welcome to the tutorial on Statistics using Python |
321 | 25 |
|
26 |
{{{ Show the slide containing title }}} |
|
27 |
||
28 |
{{{ Show the slide containing the outline slide }}} |
|
29 |
||
30 |
In this tutorial, we shall learn |
|
31 |
* Doing simple statistical operations in Python |
|
32 |
* Applying these to real world problems |
|
33 |
||
382
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
34 |
.. #[punch: the prerequisites part may be skipped in the tutorial. It |
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
35 |
.. will be provided separately.] |
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
36 |
|
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
37 |
You will need Ipython with pylab running on your computer to use this |
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
38 |
tutorial. |
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
39 |
|
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
40 |
Also you will need to know about loading data using loadtxt to be able |
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
41 |
to follow the real world application. |
321 | 42 |
|
382
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
43 |
.. #[punch: since loadtxt is anyway a pre-req, I would recommend you |
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
44 |
.. to use a data file and load data from that. that is good, since you |
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
45 |
.. would get to deal with arrays, instead of lists. |
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
46 |
|
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
47 |
.. Talking of rows and columns of 2-D lists etc is confusing. Also, |
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
48 |
.. converting to float can be avoided. The tutorial will feel more |
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
49 |
.. natural, is what I think. |
321 | 50 |
|
382
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
51 |
.. The idea of separating the main problem and giving toy examples |
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
52 |
.. doesn't sound good. Use the same problem to explain stuff. Or use a |
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
53 |
.. smaller data-set or something. Using lists doesn't seem natural.] |
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
54 |
|
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
55 |
|
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
56 |
We will first start with the most necessary statistical operation i.e |
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
57 |
finding mean. |
321 | 58 |
|
59 |
We have a list of ages of a random group of people :: |
|
60 |
||
382
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
61 |
age_list = [4,45,23,34,34,38,65,42,32,7] |
321 | 62 |
|
382
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
63 |
One way of getting the mean could be getting sum of all the ages and |
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
64 |
dividing by the number of people in the group. :: |
321 | 65 |
|
382
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
66 |
sum_age_list = sum(age_list) |
321 | 67 |
|
382
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
68 |
sum function gives us the sum of the elements. Note that the |
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
69 |
``sum_age_list`` variable is an integer and the number of people or |
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
70 |
length of the list is also an integer. We will need to convert one of |
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
71 |
them to a float before carrying out the division. :: |
321 | 72 |
|
382
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
73 |
mean_using_sum = float(sum_age_list)/len(age_list) |
321 | 74 |
|
382
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
75 |
This obviously gives the mean age but there is a simpler way to do |
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
76 |
this in Python - using the mean function:: |
321 | 77 |
|
78 |
mean(age_list) |
|
79 |
||
382
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
80 |
Mean can be used in more ways in case of 2 dimensional lists. Take a |
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
81 |
two dimensional list :: |
321 | 82 |
|
83 |
two_dimension=[[1,5,6,8],[1,3,4,5]] |
|
84 |
||
382
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
85 |
The mean function by default gives the mean of the flattened sequence. |
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
86 |
A Flattened sequence means a list obtained by concatenating all the |
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
87 |
smaller lists into a large long list. In this case, the list obtained |
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
88 |
by writing the two lists one after the other. :: |
321 | 89 |
|
90 |
mean(two_dimension) |
|
91 |
flattened_seq=[1,5,6,8,1,3,4,5] |
|
92 |
mean(flattened_seq) |
|
93 |
||
382
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
94 |
As you can see both the results are same. ``mean`` function can also |
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
95 |
give us the mean of each column, or the mean of corresponding elements |
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
96 |
in the smaller lists. :: |
321 | 97 |
|
382
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
98 |
mean(two_dimension, 0) |
321 | 99 |
array([ 1. , 4. , 5. , 6.5]) |
100 |
||
101 |
we pass an extra argument 0 in that case. |
|
102 |
||
382
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
103 |
If we use an argument 1, we obtain the mean along the rows. :: |
321 | 104 |
|
382
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
105 |
mean(two_dimension, 1) |
321 | 106 |
array([ 5. , 3.25]) |
107 |
||
108 |
We can see more option of mean using :: |
|
109 |
||
110 |
mean? |
|
111 |
||
112 |
Similarly we can calculate median and stanard deviation of a list |
|
113 |
using the functions median and std:: |
|
114 |
||
115 |
median(age_list) |
|
116 |
std(age_list) |
|
117 |
||
382
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
118 |
Median and std can also be calculated for two dimensional arrays along |
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
119 |
columns and rows just like mean. |
321 | 120 |
|
382
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
121 |
For example :: |
321 | 122 |
|
382
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
123 |
median(two_dimension, 0) |
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
124 |
std(two_dimension, 1) |
321 | 125 |
|
382
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
126 |
This gives us the median along the colums and standard devition along |
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
127 |
the rows. |
321 | 128 |
|
129 |
Now lets apply this to a real world example |
|
130 |
||
382
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
131 |
We will a data file that is at the a path ``/home/fossee/sslc2.txt``. |
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
132 |
It contains record of students and their performance in one of the |
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
133 |
State Secondary Board Examination. It has 180, 000 lines of record. We |
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
134 |
are going to read it and process this data. We can see the content of |
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
135 |
file by double clicking on it. It might take some time to open since |
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
136 |
it is quite a large file. Please don't edit the data. This file has |
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
137 |
a particular structure. |
321 | 138 |
|
139 |
We can do :: |
|
140 |
||
141 |
cat /home/fossee/sslc2.txt |
|
142 |
||
143 |
to check the contents of the file. |
|
144 |
||
145 |
Each line in the file is a set of 11 fields separated |
|
146 |
by semi-colons Consider a sample line from this file. |
|
147 |
A;015163;JOSEPH RAJ S;083;042;47;00;72;244;;; |
|
148 |
||
149 |
The following are the fields in any given line. |
|
150 |
* Region Code which is 'A' |
|
151 |
* Roll Number 015163 |
|
152 |
* Name JOSEPH RAJ S |
|
153 |
* Marks of 5 subjects: ** English 083 ** Hindi 042 ** Maths 47 ** |
|
349 | 154 |
Science 35 ** Social 72 |
321 | 155 |
* Total marks 244 |
382
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
156 |
|
321 | 157 |
|
158 |
Now lets try and find the mean of English marks of all students. |
|
159 |
||
160 |
For this we do. :: |
|
161 |
||
162 |
L=loadtxt('/home/fossee/sslc2.txt',usecols=(3,),delimiter=';') |
|
163 |
L |
|
164 |
mean(L) |
|
165 |
||
166 |
loadtxt function loads data from an external file.Delimiter specifies |
|
167 |
the kind of character are the fields of data seperated by. |
|
168 |
usecols specifies the columns to be used so (3,). The 'comma' is added |
|
169 |
because usecols is a sequence. |
|
170 |
||
171 |
To get the median marks. :: |
|
172 |
||
382
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
173 |
median(L) |
321 | 174 |
|
175 |
Standard deviation. :: |
|
176 |
||
382
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
177 |
std(L) |
321 | 178 |
|
179 |
||
180 |
Now lets try and and get the mean for all the subjects :: |
|
181 |
||
182 |
L=loadtxt('/home/fossee/sslc2.txt',usecols=(3,4,5,6,7),delimiter=';') |
|
183 |
mean(L,0) |
|
184 |
array([ 73.55452504, 53.79828941, 62.83342759, 50.69806158, 63.17056881]) |
|
185 |
||
186 |
As we can see from the result mean(L,0). The resultant sequence |
|
187 |
is the mean marks of all students that gave the exam for the five subjects. |
|
188 |
||
189 |
and :: |
|
190 |
||
191 |
mean(L,1) |
|
192 |
||
193 |
||
194 |
is the average accumalative marks of individual students. Clearly, mean(L,0) |
|
195 |
was a row wise calcultaion while mean(L,1) was a column wise calculation. |
|
196 |
||
197 |
||
198 |
{{{ Show summary slide }}} |
|
199 |
||
200 |
This brings us to the end of the tutorial. |
|
201 |
we have learnt |
|
202 |
||
203 |
* How to do the standard statistical operations sum , mean |
|
204 |
median and standard deviation in Python. |
|
205 |
* Combine text loading and the statistical operation to solve |
|
206 |
real world problems. |
|
207 |
||
208 |
{{{ Show the "sponsored by FOSSEE" slide }}} |
|
209 |
||
210 |
||
211 |
This tutorial was created as a part of FOSSEE project, NME ICT, MHRD India |
|
212 |
||
213 |
Hope you have enjoyed and found it useful. |
|
349 | 214 |
|
382
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
215 |
Thank you! |
349 | 216 |