author | Puneeth Chaganti <punchagan@fossee.in> |
Thu, 11 Nov 2010 02:55:20 +0530 | |
changeset 459 | 68c324a9981c |
parent 406 | a534e9e79599 |
child 450 | d49aee7ab1b9 |
permissions | -rw-r--r-- |
362 | 1 |
.. Objectives |
2 |
.. ---------- |
|
3 |
||
4 |
.. By the end of this tutorial you will -- |
|
5 |
||
6 |
.. 1. Get to know simple statistics functions like mean,std etc .. (Remembering) |
|
7 |
.. #. Apply them on a real world example. (Applying) |
|
8 |
||
9 |
||
10 |
.. Prerequisites |
|
11 |
.. ------------- |
|
12 |
||
13 |
.. Getting started with IPython |
|
14 |
.. Loading Data from files |
|
15 |
.. Getting started with Lists |
|
16 |
||
406
a534e9e79599
Completed basic data type based on review and improved on slides
Amit Sethi
parents:
383
diff
changeset
|
17 |
.. Author : Amit Sethi |
a534e9e79599
Completed basic data type based on review and improved on slides
Amit Sethi
parents:
383
diff
changeset
|
18 |
Internal Reviewer : Puneeth |
362 | 19 |
External Reviewer : |
20 |
Checklist OK? : <put date stamp here, if OK> [2010-10-05] |
|
21 |
||
383
4a6d548d4369
Minor comments on Statistics.
Puneeth Chaganti <punchagan@fossee.in>
parents:
382
diff
changeset
|
22 |
.. #[punch; add slides, exercises!] |
4a6d548d4369
Minor comments on Statistics.
Puneeth Chaganti <punchagan@fossee.in>
parents:
382
diff
changeset
|
23 |
|
382
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
24 |
Hello friends and welcome to the tutorial on Statistics using Python |
321 | 25 |
|
26 |
{{{ Show the slide containing title }}} |
|
27 |
||
28 |
{{{ Show the slide containing the outline slide }}} |
|
29 |
||
30 |
In this tutorial, we shall learn |
|
31 |
* Doing simple statistical operations in Python |
|
32 |
* Applying these to real world problems |
|
33 |
||
34 |
||
382
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
35 |
.. #[punch: since loadtxt is anyway a pre-req, I would recommend you |
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
36 |
.. to use a data file and load data from that. that is good, since you |
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
37 |
.. would get to deal with arrays, instead of lists. |
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
38 |
|
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
39 |
.. Talking of rows and columns of 2-D lists etc is confusing. Also, |
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
40 |
.. converting to float can be avoided. The tutorial will feel more |
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
41 |
.. natural, is what I think. |
321 | 42 |
|
382
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
43 |
.. The idea of separating the main problem and giving toy examples |
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
44 |
.. doesn't sound good. Use the same problem to explain stuff. Or use a |
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
45 |
.. smaller data-set or something. Using lists doesn't seem natural.] |
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
46 |
|
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
47 |
|
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
48 |
We will first start with the most necessary statistical operation i.e |
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
49 |
finding mean. |
321 | 50 |
|
51 |
We have a list of ages of a random group of people :: |
|
52 |
||
382
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
53 |
age_list = [4,45,23,34,34,38,65,42,32,7] |
321 | 54 |
|
382
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
55 |
One way of getting the mean could be getting sum of all the ages and |
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
56 |
dividing by the number of people in the group. :: |
321 | 57 |
|
382
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
58 |
sum_age_list = sum(age_list) |
321 | 59 |
|
382
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
60 |
sum function gives us the sum of the elements. Note that the |
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
61 |
``sum_age_list`` variable is an integer and the number of people or |
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
62 |
length of the list is also an integer. We will need to convert one of |
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
63 |
them to a float before carrying out the division. :: |
321 | 64 |
|
382
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
65 |
mean_using_sum = float(sum_age_list)/len(age_list) |
321 | 66 |
|
382
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
67 |
This obviously gives the mean age but there is a simpler way to do |
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
68 |
this in Python - using the mean function:: |
321 | 69 |
|
70 |
mean(age_list) |
|
71 |
||
382
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
72 |
Mean can be used in more ways in case of 2 dimensional lists. Take a |
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
73 |
two dimensional list :: |
321 | 74 |
|
75 |
two_dimension=[[1,5,6,8],[1,3,4,5]] |
|
76 |
||
382
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
77 |
The mean function by default gives the mean of the flattened sequence. |
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
78 |
A Flattened sequence means a list obtained by concatenating all the |
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
79 |
smaller lists into a large long list. In this case, the list obtained |
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
80 |
by writing the two lists one after the other. :: |
321 | 81 |
|
82 |
mean(two_dimension) |
|
83 |
flattened_seq=[1,5,6,8,1,3,4,5] |
|
84 |
mean(flattened_seq) |
|
85 |
||
382
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
86 |
As you can see both the results are same. ``mean`` function can also |
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
87 |
give us the mean of each column, or the mean of corresponding elements |
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
88 |
in the smaller lists. :: |
321 | 89 |
|
382
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
90 |
mean(two_dimension, 0) |
321 | 91 |
array([ 1. , 4. , 5. , 6.5]) |
92 |
||
93 |
we pass an extra argument 0 in that case. |
|
94 |
||
382
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
95 |
If we use an argument 1, we obtain the mean along the rows. :: |
321 | 96 |
|
382
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
97 |
mean(two_dimension, 1) |
321 | 98 |
array([ 5. , 3.25]) |
99 |
||
100 |
We can see more option of mean using :: |
|
101 |
||
102 |
mean? |
|
103 |
||
104 |
Similarly we can calculate median and stanard deviation of a list |
|
105 |
using the functions median and std:: |
|
106 |
||
107 |
median(age_list) |
|
108 |
std(age_list) |
|
109 |
||
382
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
110 |
Median and std can also be calculated for two dimensional arrays along |
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
111 |
columns and rows just like mean. |
321 | 112 |
|
382
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
113 |
For example :: |
321 | 114 |
|
382
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
115 |
median(two_dimension, 0) |
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
116 |
std(two_dimension, 1) |
321 | 117 |
|
382
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
118 |
This gives us the median along the colums and standard devition along |
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
119 |
the rows. |
321 | 120 |
|
121 |
Now lets apply this to a real world example |
|
122 |
||
382
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
123 |
We will a data file that is at the a path ``/home/fossee/sslc2.txt``. |
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
124 |
It contains record of students and their performance in one of the |
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
125 |
State Secondary Board Examination. It has 180, 000 lines of record. We |
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
126 |
are going to read it and process this data. We can see the content of |
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
127 |
file by double clicking on it. It might take some time to open since |
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
128 |
it is quite a large file. Please don't edit the data. This file has |
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
129 |
a particular structure. |
321 | 130 |
|
131 |
We can do :: |
|
132 |
||
133 |
cat /home/fossee/sslc2.txt |
|
134 |
||
135 |
to check the contents of the file. |
|
136 |
||
137 |
Each line in the file is a set of 11 fields separated |
|
138 |
by semi-colons Consider a sample line from this file. |
|
139 |
A;015163;JOSEPH RAJ S;083;042;47;00;72;244;;; |
|
140 |
||
141 |
The following are the fields in any given line. |
|
142 |
* Region Code which is 'A' |
|
143 |
* Roll Number 015163 |
|
144 |
* Name JOSEPH RAJ S |
|
145 |
* Marks of 5 subjects: ** English 083 ** Hindi 042 ** Maths 47 ** |
|
349 | 146 |
Science 35 ** Social 72 |
321 | 147 |
* Total marks 244 |
382
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
148 |
|
321 | 149 |
|
150 |
Now lets try and find the mean of English marks of all students. |
|
151 |
||
152 |
For this we do. :: |
|
153 |
||
154 |
L=loadtxt('/home/fossee/sslc2.txt',usecols=(3,),delimiter=';') |
|
155 |
L |
|
156 |
mean(L) |
|
157 |
||
158 |
loadtxt function loads data from an external file.Delimiter specifies |
|
159 |
the kind of character are the fields of data seperated by. |
|
160 |
usecols specifies the columns to be used so (3,). The 'comma' is added |
|
161 |
because usecols is a sequence. |
|
162 |
||
163 |
To get the median marks. :: |
|
164 |
||
382
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
165 |
median(L) |
321 | 166 |
|
167 |
Standard deviation. :: |
|
168 |
||
382
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
169 |
std(L) |
321 | 170 |
|
171 |
||
172 |
Now lets try and and get the mean for all the subjects :: |
|
173 |
||
174 |
L=loadtxt('/home/fossee/sslc2.txt',usecols=(3,4,5,6,7),delimiter=';') |
|
175 |
mean(L,0) |
|
176 |
array([ 73.55452504, 53.79828941, 62.83342759, 50.69806158, 63.17056881]) |
|
177 |
||
178 |
As we can see from the result mean(L,0). The resultant sequence |
|
179 |
is the mean marks of all students that gave the exam for the five subjects. |
|
180 |
||
181 |
and :: |
|
182 |
||
183 |
mean(L,1) |
|
184 |
||
185 |
||
186 |
is the average accumalative marks of individual students. Clearly, mean(L,0) |
|
187 |
was a row wise calcultaion while mean(L,1) was a column wise calculation. |
|
188 |
||
189 |
||
190 |
{{{ Show summary slide }}} |
|
191 |
||
192 |
This brings us to the end of the tutorial. |
|
193 |
we have learnt |
|
194 |
||
195 |
* How to do the standard statistical operations sum , mean |
|
196 |
median and standard deviation in Python. |
|
197 |
* Combine text loading and the statistical operation to solve |
|
198 |
real world problems. |
|
199 |
||
200 |
{{{ Show the "sponsored by FOSSEE" slide }}} |
|
201 |
||
202 |
||
203 |
This tutorial was created as a part of FOSSEE project, NME ICT, MHRD India |
|
204 |
||
205 |
Hope you have enjoyed and found it useful. |
|
349 | 206 |
|
382
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
207 |
Thank you! |
349 | 208 |