26 {{{ Show the slide containing title }}} |
28 {{{ Show the slide containing title }}} |
27 |
29 |
28 {{{ Show the slide containing the outline slide }}} |
30 {{{ Show the slide containing the outline slide }}} |
29 |
31 |
30 In this tutorial, we shall learn |
32 In this tutorial, we shall learn |
31 * Doing simple statistical operations in Python |
33 * Doing statistical operations in Python |
32 * Applying these to real world problems |
34 * Summing set of numbers |
|
35 * Finding there mean |
|
36 * Finding there Median |
|
37 * Finding there Standard Deviation |
|
38 |
33 |
39 |
34 .. #[punch: the prerequisites part may be skipped in the tutorial. It |
|
35 .. will be provided separately.] |
|
36 |
|
37 You will need Ipython with pylab running on your computer to use this |
|
38 tutorial. |
|
39 |
|
40 Also you will need to know about loading data using loadtxt to be able |
|
41 to follow the real world application. |
|
42 |
40 |
43 .. #[punch: since loadtxt is anyway a pre-req, I would recommend you |
41 .. #[punch: since loadtxt is anyway a pre-req, I would recommend you |
44 .. to use a data file and load data from that. that is good, since you |
42 .. to use a data file and load data from that. that is good, since you |
45 .. would get to deal with arrays, instead of lists. |
43 .. would get to deal with arrays, instead of lists. |
46 |
44 |
51 .. The idea of separating the main problem and giving toy examples |
49 .. The idea of separating the main problem and giving toy examples |
52 .. doesn't sound good. Use the same problem to explain stuff. Or use a |
50 .. doesn't sound good. Use the same problem to explain stuff. Or use a |
53 .. smaller data-set or something. Using lists doesn't seem natural.] |
51 .. smaller data-set or something. Using lists doesn't seem natural.] |
54 |
52 |
55 |
53 |
56 We will first start with the most necessary statistical operation i.e |
54 For this tutorial We will use data file that is at the a path |
57 finding mean. |
55 ``/home/fossee/sslc2.txt``. It contains record of students and their |
58 |
56 performance in one of the State Secondary Board Examination. It has |
59 We have a list of ages of a random group of people :: |
57 180,000 lines of record. We are going to read it and process this |
60 |
58 data. We can see the content of file by double clicking on it. It |
61 age_list = [4,45,23,34,34,38,65,42,32,7] |
59 might take some time to open since it is quite a large file. Please |
62 |
60 don't edit the data. This file has a particular structure. |
63 One way of getting the mean could be getting sum of all the ages and |
|
64 dividing by the number of people in the group. :: |
|
65 |
|
66 sum_age_list = sum(age_list) |
|
67 |
|
68 sum function gives us the sum of the elements. Note that the |
|
69 ``sum_age_list`` variable is an integer and the number of people or |
|
70 length of the list is also an integer. We will need to convert one of |
|
71 them to a float before carrying out the division. :: |
|
72 |
|
73 mean_using_sum = float(sum_age_list)/len(age_list) |
|
74 |
|
75 This obviously gives the mean age but there is a simpler way to do |
|
76 this in Python - using the mean function:: |
|
77 |
|
78 mean(age_list) |
|
79 |
|
80 Mean can be used in more ways in case of 2 dimensional lists. Take a |
|
81 two dimensional list :: |
|
82 |
|
83 two_dimension=[[1,5,6,8],[1,3,4,5]] |
|
84 |
|
85 The mean function by default gives the mean of the flattened sequence. |
|
86 A Flattened sequence means a list obtained by concatenating all the |
|
87 smaller lists into a large long list. In this case, the list obtained |
|
88 by writing the two lists one after the other. :: |
|
89 |
|
90 mean(two_dimension) |
|
91 flattened_seq=[1,5,6,8,1,3,4,5] |
|
92 mean(flattened_seq) |
|
93 |
|
94 As you can see both the results are same. ``mean`` function can also |
|
95 give us the mean of each column, or the mean of corresponding elements |
|
96 in the smaller lists. :: |
|
97 |
|
98 mean(two_dimension, 0) |
|
99 array([ 1. , 4. , 5. , 6.5]) |
|
100 |
|
101 we pass an extra argument 0 in that case. |
|
102 |
|
103 If we use an argument 1, we obtain the mean along the rows. :: |
|
104 |
|
105 mean(two_dimension, 1) |
|
106 array([ 5. , 3.25]) |
|
107 |
|
108 We can see more option of mean using :: |
|
109 |
|
110 mean? |
|
111 |
|
112 Similarly we can calculate median and stanard deviation of a list |
|
113 using the functions median and std:: |
|
114 |
|
115 median(age_list) |
|
116 std(age_list) |
|
117 |
|
118 Median and std can also be calculated for two dimensional arrays along |
|
119 columns and rows just like mean. |
|
120 |
|
121 For example :: |
|
122 |
|
123 median(two_dimension, 0) |
|
124 std(two_dimension, 1) |
|
125 |
|
126 This gives us the median along the colums and standard devition along |
|
127 the rows. |
|
128 |
|
129 Now lets apply this to a real world example |
|
130 |
|
131 We will a data file that is at the a path ``/home/fossee/sslc2.txt``. |
|
132 It contains record of students and their performance in one of the |
|
133 State Secondary Board Examination. It has 180, 000 lines of record. We |
|
134 are going to read it and process this data. We can see the content of |
|
135 file by double clicking on it. It might take some time to open since |
|
136 it is quite a large file. Please don't edit the data. This file has |
|
137 a particular structure. |
|
138 |
61 |
139 We can do :: |
62 We can do :: |
140 |
63 |
141 cat /home/fossee/sslc2.txt |
64 cat /home/fossee/sslc2.txt |
142 |
65 |
143 to check the contents of the file. |
66 to check the contents of the file. |
|
67 |
|
68 |
|
69 {{{ Show the data structure on a slide }}} |
144 |
70 |
145 Each line in the file is a set of 11 fields separated |
71 Each line in the file is a set of 11 fields separated |
146 by semi-colons Consider a sample line from this file. |
72 by semi-colons Consider a sample line from this file. |
147 A;015163;JOSEPH RAJ S;083;042;47;00;72;244;;; |
73 A;015163;JOSEPH RAJ S;083;042;47;00;72;244;;; |
148 |
74 |
153 * Marks of 5 subjects: ** English 083 ** Hindi 042 ** Maths 47 ** |
79 * Marks of 5 subjects: ** English 083 ** Hindi 042 ** Maths 47 ** |
154 Science 35 ** Social 72 |
80 Science 35 ** Social 72 |
155 * Total marks 244 |
81 * Total marks 244 |
156 |
82 |
157 |
83 |
158 Now lets try and find the mean of English marks of all students. |
84 Lets try and load this data as an array and then run various function on |
|
85 it. |
159 |
86 |
160 For this we do. :: |
87 To get the data as an array we do. :: |
161 |
88 |
162 L=loadtxt('/home/fossee/sslc2.txt',usecols=(3,),delimiter=';') |
89 L=loadtxt('/home/amit/sslc2.txt',usecols=(3,4,5,6,7,),delimiter=';') |
163 L |
90 L |
164 mean(L) |
91 |
165 |
92 |
166 loadtxt function loads data from an external file.Delimiter specifies |
93 loadtxt function loads data from an external file.Delimiter specifies |
167 the kind of character are the fields of data seperated by. |
94 the kind of character are the fields of data seperated by. usecols |
168 usecols specifies the columns to be used so (3,). The 'comma' is added |
95 specifies the columns to be used so (3,4,5,6,7) loads those |
169 because usecols is a sequence. |
96 colums. The 'comma' is added because usecols is a sequence. |
170 |
97 |
171 To get the median marks. :: |
98 As we can see L is an array. We can get the shape of this array using:: |
172 |
99 |
173 median(L) |
100 L.shape |
|
101 (185667, 5) |
|
102 |
|
103 Lets start applying statistics operations on these. We will start with |
|
104 the most basic, summing. How do you find the sum of marks of all |
|
105 subjects for the first student. |
|
106 |
|
107 As we know from our knowledge of accessing pieces of arrays. To acess |
|
108 the first row we will do :: |
174 |
109 |
175 Standard deviation. :: |
110 L[0,:] |
176 |
|
177 std(L) |
|
178 |
111 |
|
112 Now to sum this we can say :: |
179 |
113 |
180 Now lets try and and get the mean for all the subjects :: |
114 totalmarks=sum(L[0,:]) |
|
115 totalmarks |
181 |
116 |
182 L=loadtxt('/home/fossee/sslc2.txt',usecols=(3,4,5,6,7),delimiter=';') |
117 To get the mean we can do :: |
183 mean(L,0) |
|
184 array([ 73.55452504, 53.79828941, 62.83342759, 50.69806158, 63.17056881]) |
|
185 |
118 |
186 As we can see from the result mean(L,0). The resultant sequence |
119 totalmarks/len(L[0,:]) |
187 is the mean marks of all students that gave the exam for the five subjects. |
|
188 |
120 |
189 and :: |
121 or simply :: |
190 |
122 |
|
123 mean(L[0,:]) |
|
124 |
|
125 But we have such a large data set calculating one by one the mean of |
|
126 each student is impossible. Is there a way to reduce the work. |
|
127 |
|
128 For this we will look into the documentation of mean by doing:: |
|
129 |
|
130 mean? |
|
131 |
|
132 As we know L is a two dimensional array. We can calculate the mean |
|
133 across each of the axis of the array. The axis of rows is referred by |
|
134 number 0 and columns by 1. So to calculate mean accross all colums we |
|
135 will pass extra parameter 1 for the axis.:: |
|
136 |
191 mean(L,1) |
137 mean(L,1) |
192 |
138 |
193 |
139 L here is the two dimensional array. |
194 is the average accumalative marks of individual students. Clearly, mean(L,0) |
|
195 was a row wise calcultaion while mean(L,1) was a column wise calculation. |
|
196 |
140 |
|
141 Similarly to calculate average marks scored by all the students for each |
|
142 subject can be calculated using :: |
|
143 |
|
144 mean(L,0) |
|
145 |
|
146 Next lets now calculate the median of English marks for the all the students |
|
147 We can access English marks of all students using :: |
|
148 |
|
149 L[:,0] |
|
150 |
|
151 To get the median we will do :: |
|
152 |
|
153 median(L[:,0]) |
|
154 |
|
155 For all the subjects we can use the same syntax as mean and calculate |
|
156 median across all rows using :: |
|
157 |
|
158 median(L,0) |
|
159 |
|
160 |
|
161 Similarly to calculate standard deviation for English we can do:: |
|
162 |
|
163 std(L[:,0]) |
|
164 |
|
165 and for all rows:: |
|
166 |
|
167 std(L,0) |
|
168 |
|
169 Following is an exercise that you must do. |
|
170 |
|
171 %% %% In the given file football.txt at path /home/fossee/football.txt , one column is player name,second is goals at home and third goals away. |
|
172 1.Find the total goals for each player |
|
173 2.Mean home and away goals |
|
174 3.Standard deviation of home and away goals |
197 |
175 |
198 {{{ Show summary slide }}} |
176 {{{ Show summary slide }}} |
199 |
177 |
200 This brings us to the end of the tutorial. |
178 This brings us to the end of the tutorial. |
201 we have learnt |
179 we have learnt |