author | bhanu |
Mon, 15 Nov 2010 14:40:49 +0530 | |
changeset 499 | fff4a90b2310 |
parent 450 | d49aee7ab1b9 |
permissions | -rw-r--r-- |
362 | 1 |
.. Objectives |
2 |
.. ---------- |
|
3 |
||
4 |
.. By the end of this tutorial you will -- |
|
5 |
||
6 |
.. 1. Get to know simple statistics functions like mean,std etc .. (Remembering) |
|
7 |
.. #. Apply them on a real world example. (Applying) |
|
8 |
||
9 |
||
10 |
.. Prerequisites |
|
11 |
.. ------------- |
|
12 |
||
13 |
.. Getting started with IPython |
|
14 |
.. Loading Data from files |
|
15 |
.. Getting started with Lists |
|
450
d49aee7ab1b9
Rewrite of statistics script as suggested by punch and change in slides accordingly
Amit Sethi
parents:
406
diff
changeset
|
16 |
.. Accessing Pieces of Arrays |
d49aee7ab1b9
Rewrite of statistics script as suggested by punch and change in slides accordingly
Amit Sethi
parents:
406
diff
changeset
|
17 |
|
362 | 18 |
|
406
a534e9e79599
Completed basic data type based on review and improved on slides
Amit Sethi
parents:
383
diff
changeset
|
19 |
.. Author : Amit Sethi |
a534e9e79599
Completed basic data type based on review and improved on slides
Amit Sethi
parents:
383
diff
changeset
|
20 |
Internal Reviewer : Puneeth |
362 | 21 |
External Reviewer : |
22 |
Checklist OK? : <put date stamp here, if OK> [2010-10-05] |
|
23 |
||
383
4a6d548d4369
Minor comments on Statistics.
Puneeth Chaganti <punchagan@fossee.in>
parents:
382
diff
changeset
|
24 |
.. #[punch; add slides, exercises!] |
4a6d548d4369
Minor comments on Statistics.
Puneeth Chaganti <punchagan@fossee.in>
parents:
382
diff
changeset
|
25 |
|
382
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
26 |
Hello friends and welcome to the tutorial on Statistics using Python |
321 | 27 |
|
28 |
{{{ Show the slide containing title }}} |
|
29 |
||
30 |
{{{ Show the slide containing the outline slide }}} |
|
31 |
||
32 |
In this tutorial, we shall learn |
|
450
d49aee7ab1b9
Rewrite of statistics script as suggested by punch and change in slides accordingly
Amit Sethi
parents:
406
diff
changeset
|
33 |
* Doing statistical operations in Python |
d49aee7ab1b9
Rewrite of statistics script as suggested by punch and change in slides accordingly
Amit Sethi
parents:
406
diff
changeset
|
34 |
* Summing set of numbers |
d49aee7ab1b9
Rewrite of statistics script as suggested by punch and change in slides accordingly
Amit Sethi
parents:
406
diff
changeset
|
35 |
* Finding there mean |
d49aee7ab1b9
Rewrite of statistics script as suggested by punch and change in slides accordingly
Amit Sethi
parents:
406
diff
changeset
|
36 |
* Finding there Median |
d49aee7ab1b9
Rewrite of statistics script as suggested by punch and change in slides accordingly
Amit Sethi
parents:
406
diff
changeset
|
37 |
* Finding there Standard Deviation |
d49aee7ab1b9
Rewrite of statistics script as suggested by punch and change in slides accordingly
Amit Sethi
parents:
406
diff
changeset
|
38 |
|
321 | 39 |
|
40 |
||
382
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
41 |
.. #[punch: since loadtxt is anyway a pre-req, I would recommend you |
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
42 |
.. to use a data file and load data from that. that is good, since you |
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
43 |
.. would get to deal with arrays, instead of lists. |
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
44 |
|
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
45 |
.. Talking of rows and columns of 2-D lists etc is confusing. Also, |
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
46 |
.. converting to float can be avoided. The tutorial will feel more |
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
47 |
.. natural, is what I think. |
321 | 48 |
|
382
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
49 |
.. The idea of separating the main problem and giving toy examples |
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
50 |
.. doesn't sound good. Use the same problem to explain stuff. Or use a |
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
51 |
.. smaller data-set or something. Using lists doesn't seem natural.] |
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
52 |
|
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
53 |
|
450
d49aee7ab1b9
Rewrite of statistics script as suggested by punch and change in slides accordingly
Amit Sethi
parents:
406
diff
changeset
|
54 |
For this tutorial We will use data file that is at the a path |
d49aee7ab1b9
Rewrite of statistics script as suggested by punch and change in slides accordingly
Amit Sethi
parents:
406
diff
changeset
|
55 |
``/home/fossee/sslc2.txt``. It contains record of students and their |
d49aee7ab1b9
Rewrite of statistics script as suggested by punch and change in slides accordingly
Amit Sethi
parents:
406
diff
changeset
|
56 |
performance in one of the State Secondary Board Examination. It has |
d49aee7ab1b9
Rewrite of statistics script as suggested by punch and change in slides accordingly
Amit Sethi
parents:
406
diff
changeset
|
57 |
180,000 lines of record. We are going to read it and process this |
d49aee7ab1b9
Rewrite of statistics script as suggested by punch and change in slides accordingly
Amit Sethi
parents:
406
diff
changeset
|
58 |
data. We can see the content of file by double clicking on it. It |
d49aee7ab1b9
Rewrite of statistics script as suggested by punch and change in slides accordingly
Amit Sethi
parents:
406
diff
changeset
|
59 |
might take some time to open since it is quite a large file. Please |
d49aee7ab1b9
Rewrite of statistics script as suggested by punch and change in slides accordingly
Amit Sethi
parents:
406
diff
changeset
|
60 |
don't edit the data. This file has a particular structure. |
321 | 61 |
|
62 |
We can do :: |
|
63 |
||
64 |
cat /home/fossee/sslc2.txt |
|
65 |
||
66 |
to check the contents of the file. |
|
67 |
||
450
d49aee7ab1b9
Rewrite of statistics script as suggested by punch and change in slides accordingly
Amit Sethi
parents:
406
diff
changeset
|
68 |
|
d49aee7ab1b9
Rewrite of statistics script as suggested by punch and change in slides accordingly
Amit Sethi
parents:
406
diff
changeset
|
69 |
{{{ Show the data structure on a slide }}} |
d49aee7ab1b9
Rewrite of statistics script as suggested by punch and change in slides accordingly
Amit Sethi
parents:
406
diff
changeset
|
70 |
|
321 | 71 |
Each line in the file is a set of 11 fields separated |
72 |
by semi-colons Consider a sample line from this file. |
|
73 |
A;015163;JOSEPH RAJ S;083;042;47;00;72;244;;; |
|
74 |
||
75 |
The following are the fields in any given line. |
|
76 |
* Region Code which is 'A' |
|
77 |
* Roll Number 015163 |
|
78 |
* Name JOSEPH RAJ S |
|
79 |
* Marks of 5 subjects: ** English 083 ** Hindi 042 ** Maths 47 ** |
|
349 | 80 |
Science 35 ** Social 72 |
321 | 81 |
* Total marks 244 |
382
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
82 |
|
321 | 83 |
|
450
d49aee7ab1b9
Rewrite of statistics script as suggested by punch and change in slides accordingly
Amit Sethi
parents:
406
diff
changeset
|
84 |
Lets try and load this data as an array and then run various function on |
d49aee7ab1b9
Rewrite of statistics script as suggested by punch and change in slides accordingly
Amit Sethi
parents:
406
diff
changeset
|
85 |
it. |
321 | 86 |
|
450
d49aee7ab1b9
Rewrite of statistics script as suggested by punch and change in slides accordingly
Amit Sethi
parents:
406
diff
changeset
|
87 |
To get the data as an array we do. :: |
d49aee7ab1b9
Rewrite of statistics script as suggested by punch and change in slides accordingly
Amit Sethi
parents:
406
diff
changeset
|
88 |
|
d49aee7ab1b9
Rewrite of statistics script as suggested by punch and change in slides accordingly
Amit Sethi
parents:
406
diff
changeset
|
89 |
L=loadtxt('/home/amit/sslc2.txt',usecols=(3,4,5,6,7,),delimiter=';') |
321 | 90 |
L |
450
d49aee7ab1b9
Rewrite of statistics script as suggested by punch and change in slides accordingly
Amit Sethi
parents:
406
diff
changeset
|
91 |
|
321 | 92 |
|
93 |
loadtxt function loads data from an external file.Delimiter specifies |
|
450
d49aee7ab1b9
Rewrite of statistics script as suggested by punch and change in slides accordingly
Amit Sethi
parents:
406
diff
changeset
|
94 |
the kind of character are the fields of data seperated by. usecols |
d49aee7ab1b9
Rewrite of statistics script as suggested by punch and change in slides accordingly
Amit Sethi
parents:
406
diff
changeset
|
95 |
specifies the columns to be used so (3,4,5,6,7) loads those |
d49aee7ab1b9
Rewrite of statistics script as suggested by punch and change in slides accordingly
Amit Sethi
parents:
406
diff
changeset
|
96 |
colums. The 'comma' is added because usecols is a sequence. |
321 | 97 |
|
450
d49aee7ab1b9
Rewrite of statistics script as suggested by punch and change in slides accordingly
Amit Sethi
parents:
406
diff
changeset
|
98 |
As we can see L is an array. We can get the shape of this array using:: |
321 | 99 |
|
450
d49aee7ab1b9
Rewrite of statistics script as suggested by punch and change in slides accordingly
Amit Sethi
parents:
406
diff
changeset
|
100 |
L.shape |
d49aee7ab1b9
Rewrite of statistics script as suggested by punch and change in slides accordingly
Amit Sethi
parents:
406
diff
changeset
|
101 |
(185667, 5) |
d49aee7ab1b9
Rewrite of statistics script as suggested by punch and change in slides accordingly
Amit Sethi
parents:
406
diff
changeset
|
102 |
|
d49aee7ab1b9
Rewrite of statistics script as suggested by punch and change in slides accordingly
Amit Sethi
parents:
406
diff
changeset
|
103 |
Lets start applying statistics operations on these. We will start with |
d49aee7ab1b9
Rewrite of statistics script as suggested by punch and change in slides accordingly
Amit Sethi
parents:
406
diff
changeset
|
104 |
the most basic, summing. How do you find the sum of marks of all |
d49aee7ab1b9
Rewrite of statistics script as suggested by punch and change in slides accordingly
Amit Sethi
parents:
406
diff
changeset
|
105 |
subjects for the first student. |
d49aee7ab1b9
Rewrite of statistics script as suggested by punch and change in slides accordingly
Amit Sethi
parents:
406
diff
changeset
|
106 |
|
d49aee7ab1b9
Rewrite of statistics script as suggested by punch and change in slides accordingly
Amit Sethi
parents:
406
diff
changeset
|
107 |
As we know from our knowledge of accessing pieces of arrays. To acess |
d49aee7ab1b9
Rewrite of statistics script as suggested by punch and change in slides accordingly
Amit Sethi
parents:
406
diff
changeset
|
108 |
the first row we will do :: |
321 | 109 |
|
450
d49aee7ab1b9
Rewrite of statistics script as suggested by punch and change in slides accordingly
Amit Sethi
parents:
406
diff
changeset
|
110 |
L[0,:] |
d49aee7ab1b9
Rewrite of statistics script as suggested by punch and change in slides accordingly
Amit Sethi
parents:
406
diff
changeset
|
111 |
|
d49aee7ab1b9
Rewrite of statistics script as suggested by punch and change in slides accordingly
Amit Sethi
parents:
406
diff
changeset
|
112 |
Now to sum this we can say :: |
321 | 113 |
|
450
d49aee7ab1b9
Rewrite of statistics script as suggested by punch and change in slides accordingly
Amit Sethi
parents:
406
diff
changeset
|
114 |
totalmarks=sum(L[0,:]) |
d49aee7ab1b9
Rewrite of statistics script as suggested by punch and change in slides accordingly
Amit Sethi
parents:
406
diff
changeset
|
115 |
totalmarks |
321 | 116 |
|
450
d49aee7ab1b9
Rewrite of statistics script as suggested by punch and change in slides accordingly
Amit Sethi
parents:
406
diff
changeset
|
117 |
To get the mean we can do :: |
d49aee7ab1b9
Rewrite of statistics script as suggested by punch and change in slides accordingly
Amit Sethi
parents:
406
diff
changeset
|
118 |
|
d49aee7ab1b9
Rewrite of statistics script as suggested by punch and change in slides accordingly
Amit Sethi
parents:
406
diff
changeset
|
119 |
totalmarks/len(L[0,:]) |
d49aee7ab1b9
Rewrite of statistics script as suggested by punch and change in slides accordingly
Amit Sethi
parents:
406
diff
changeset
|
120 |
|
d49aee7ab1b9
Rewrite of statistics script as suggested by punch and change in slides accordingly
Amit Sethi
parents:
406
diff
changeset
|
121 |
or simply :: |
d49aee7ab1b9
Rewrite of statistics script as suggested by punch and change in slides accordingly
Amit Sethi
parents:
406
diff
changeset
|
122 |
|
d49aee7ab1b9
Rewrite of statistics script as suggested by punch and change in slides accordingly
Amit Sethi
parents:
406
diff
changeset
|
123 |
mean(L[0,:]) |
321 | 124 |
|
450
d49aee7ab1b9
Rewrite of statistics script as suggested by punch and change in slides accordingly
Amit Sethi
parents:
406
diff
changeset
|
125 |
But we have such a large data set calculating one by one the mean of |
d49aee7ab1b9
Rewrite of statistics script as suggested by punch and change in slides accordingly
Amit Sethi
parents:
406
diff
changeset
|
126 |
each student is impossible. Is there a way to reduce the work. |
d49aee7ab1b9
Rewrite of statistics script as suggested by punch and change in slides accordingly
Amit Sethi
parents:
406
diff
changeset
|
127 |
|
d49aee7ab1b9
Rewrite of statistics script as suggested by punch and change in slides accordingly
Amit Sethi
parents:
406
diff
changeset
|
128 |
For this we will look into the documentation of mean by doing:: |
321 | 129 |
|
450
d49aee7ab1b9
Rewrite of statistics script as suggested by punch and change in slides accordingly
Amit Sethi
parents:
406
diff
changeset
|
130 |
mean? |
321 | 131 |
|
450
d49aee7ab1b9
Rewrite of statistics script as suggested by punch and change in slides accordingly
Amit Sethi
parents:
406
diff
changeset
|
132 |
As we know L is a two dimensional array. We can calculate the mean |
d49aee7ab1b9
Rewrite of statistics script as suggested by punch and change in slides accordingly
Amit Sethi
parents:
406
diff
changeset
|
133 |
across each of the axis of the array. The axis of rows is referred by |
d49aee7ab1b9
Rewrite of statistics script as suggested by punch and change in slides accordingly
Amit Sethi
parents:
406
diff
changeset
|
134 |
number 0 and columns by 1. So to calculate mean accross all colums we |
d49aee7ab1b9
Rewrite of statistics script as suggested by punch and change in slides accordingly
Amit Sethi
parents:
406
diff
changeset
|
135 |
will pass extra parameter 1 for the axis.:: |
d49aee7ab1b9
Rewrite of statistics script as suggested by punch and change in slides accordingly
Amit Sethi
parents:
406
diff
changeset
|
136 |
|
321 | 137 |
mean(L,1) |
138 |
||
450
d49aee7ab1b9
Rewrite of statistics script as suggested by punch and change in slides accordingly
Amit Sethi
parents:
406
diff
changeset
|
139 |
L here is the two dimensional array. |
d49aee7ab1b9
Rewrite of statistics script as suggested by punch and change in slides accordingly
Amit Sethi
parents:
406
diff
changeset
|
140 |
|
d49aee7ab1b9
Rewrite of statistics script as suggested by punch and change in slides accordingly
Amit Sethi
parents:
406
diff
changeset
|
141 |
Similarly to calculate average marks scored by all the students for each |
d49aee7ab1b9
Rewrite of statistics script as suggested by punch and change in slides accordingly
Amit Sethi
parents:
406
diff
changeset
|
142 |
subject can be calculated using :: |
d49aee7ab1b9
Rewrite of statistics script as suggested by punch and change in slides accordingly
Amit Sethi
parents:
406
diff
changeset
|
143 |
|
d49aee7ab1b9
Rewrite of statistics script as suggested by punch and change in slides accordingly
Amit Sethi
parents:
406
diff
changeset
|
144 |
mean(L,0) |
d49aee7ab1b9
Rewrite of statistics script as suggested by punch and change in slides accordingly
Amit Sethi
parents:
406
diff
changeset
|
145 |
|
d49aee7ab1b9
Rewrite of statistics script as suggested by punch and change in slides accordingly
Amit Sethi
parents:
406
diff
changeset
|
146 |
Next lets now calculate the median of English marks for the all the students |
d49aee7ab1b9
Rewrite of statistics script as suggested by punch and change in slides accordingly
Amit Sethi
parents:
406
diff
changeset
|
147 |
We can access English marks of all students using :: |
d49aee7ab1b9
Rewrite of statistics script as suggested by punch and change in slides accordingly
Amit Sethi
parents:
406
diff
changeset
|
148 |
|
d49aee7ab1b9
Rewrite of statistics script as suggested by punch and change in slides accordingly
Amit Sethi
parents:
406
diff
changeset
|
149 |
L[:,0] |
d49aee7ab1b9
Rewrite of statistics script as suggested by punch and change in slides accordingly
Amit Sethi
parents:
406
diff
changeset
|
150 |
|
d49aee7ab1b9
Rewrite of statistics script as suggested by punch and change in slides accordingly
Amit Sethi
parents:
406
diff
changeset
|
151 |
To get the median we will do :: |
d49aee7ab1b9
Rewrite of statistics script as suggested by punch and change in slides accordingly
Amit Sethi
parents:
406
diff
changeset
|
152 |
|
d49aee7ab1b9
Rewrite of statistics script as suggested by punch and change in slides accordingly
Amit Sethi
parents:
406
diff
changeset
|
153 |
median(L[:,0]) |
321 | 154 |
|
450
d49aee7ab1b9
Rewrite of statistics script as suggested by punch and change in slides accordingly
Amit Sethi
parents:
406
diff
changeset
|
155 |
For all the subjects we can use the same syntax as mean and calculate |
d49aee7ab1b9
Rewrite of statistics script as suggested by punch and change in slides accordingly
Amit Sethi
parents:
406
diff
changeset
|
156 |
median across all rows using :: |
d49aee7ab1b9
Rewrite of statistics script as suggested by punch and change in slides accordingly
Amit Sethi
parents:
406
diff
changeset
|
157 |
|
d49aee7ab1b9
Rewrite of statistics script as suggested by punch and change in slides accordingly
Amit Sethi
parents:
406
diff
changeset
|
158 |
median(L,0) |
d49aee7ab1b9
Rewrite of statistics script as suggested by punch and change in slides accordingly
Amit Sethi
parents:
406
diff
changeset
|
159 |
|
d49aee7ab1b9
Rewrite of statistics script as suggested by punch and change in slides accordingly
Amit Sethi
parents:
406
diff
changeset
|
160 |
|
d49aee7ab1b9
Rewrite of statistics script as suggested by punch and change in slides accordingly
Amit Sethi
parents:
406
diff
changeset
|
161 |
Similarly to calculate standard deviation for English we can do:: |
d49aee7ab1b9
Rewrite of statistics script as suggested by punch and change in slides accordingly
Amit Sethi
parents:
406
diff
changeset
|
162 |
|
d49aee7ab1b9
Rewrite of statistics script as suggested by punch and change in slides accordingly
Amit Sethi
parents:
406
diff
changeset
|
163 |
std(L[:,0]) |
d49aee7ab1b9
Rewrite of statistics script as suggested by punch and change in slides accordingly
Amit Sethi
parents:
406
diff
changeset
|
164 |
|
d49aee7ab1b9
Rewrite of statistics script as suggested by punch and change in slides accordingly
Amit Sethi
parents:
406
diff
changeset
|
165 |
and for all rows:: |
d49aee7ab1b9
Rewrite of statistics script as suggested by punch and change in slides accordingly
Amit Sethi
parents:
406
diff
changeset
|
166 |
|
d49aee7ab1b9
Rewrite of statistics script as suggested by punch and change in slides accordingly
Amit Sethi
parents:
406
diff
changeset
|
167 |
std(L,0) |
d49aee7ab1b9
Rewrite of statistics script as suggested by punch and change in slides accordingly
Amit Sethi
parents:
406
diff
changeset
|
168 |
|
d49aee7ab1b9
Rewrite of statistics script as suggested by punch and change in slides accordingly
Amit Sethi
parents:
406
diff
changeset
|
169 |
Following is an exercise that you must do. |
d49aee7ab1b9
Rewrite of statistics script as suggested by punch and change in slides accordingly
Amit Sethi
parents:
406
diff
changeset
|
170 |
|
d49aee7ab1b9
Rewrite of statistics script as suggested by punch and change in slides accordingly
Amit Sethi
parents:
406
diff
changeset
|
171 |
%% %% In the given file football.txt at path /home/fossee/football.txt , one column is player name,second is goals at home and third goals away. |
d49aee7ab1b9
Rewrite of statistics script as suggested by punch and change in slides accordingly
Amit Sethi
parents:
406
diff
changeset
|
172 |
1.Find the total goals for each player |
d49aee7ab1b9
Rewrite of statistics script as suggested by punch and change in slides accordingly
Amit Sethi
parents:
406
diff
changeset
|
173 |
2.Mean home and away goals |
d49aee7ab1b9
Rewrite of statistics script as suggested by punch and change in slides accordingly
Amit Sethi
parents:
406
diff
changeset
|
174 |
3.Standard deviation of home and away goals |
321 | 175 |
|
176 |
{{{ Show summary slide }}} |
|
177 |
||
178 |
This brings us to the end of the tutorial. |
|
179 |
we have learnt |
|
180 |
||
181 |
* How to do the standard statistical operations sum , mean |
|
182 |
median and standard deviation in Python. |
|
183 |
* Combine text loading and the statistical operation to solve |
|
184 |
real world problems. |
|
185 |
||
186 |
{{{ Show the "sponsored by FOSSEE" slide }}} |
|
187 |
||
188 |
||
189 |
This tutorial was created as a part of FOSSEE project, NME ICT, MHRD India |
|
190 |
||
191 |
Hope you have enjoyed and found it useful. |
|
349 | 192 |
|
382
aa8ea9119476
Reviewed statistics script.
Puneeth Chaganti <punchagan@fossee.in>
parents:
362
diff
changeset
|
193 |
Thank you! |
349 | 194 |