26 {{{ Show the slide containing title }}} |
28 {{{ Show the slide containing title }}} |
27 |
29 |
28 {{{ Show the slide containing the outline slide }}} |
30 {{{ Show the slide containing the outline slide }}} |
29 |
31 |
30 In this tutorial, we shall learn |
32 In this tutorial, we shall learn |
31 * Doing simple statistical operations in Python |
33 * Doing statistical operations in Python |
32 * Applying these to real world problems |
34 * Summing set of numbers |
|
35 * Finding there mean |
|
36 * Finding there Median |
|
37 * Finding there Standard Deviation |
|
38 |
33 |
39 |
34 |
40 |
35 .. #[punch: since loadtxt is anyway a pre-req, I would recommend you |
41 .. #[punch: since loadtxt is anyway a pre-req, I would recommend you |
36 .. to use a data file and load data from that. that is good, since you |
42 .. to use a data file and load data from that. that is good, since you |
37 .. would get to deal with arrays, instead of lists. |
43 .. would get to deal with arrays, instead of lists. |
43 .. The idea of separating the main problem and giving toy examples |
49 .. The idea of separating the main problem and giving toy examples |
44 .. doesn't sound good. Use the same problem to explain stuff. Or use a |
50 .. doesn't sound good. Use the same problem to explain stuff. Or use a |
45 .. smaller data-set or something. Using lists doesn't seem natural.] |
51 .. smaller data-set or something. Using lists doesn't seem natural.] |
46 |
52 |
47 |
53 |
48 We will first start with the most necessary statistical operation i.e |
54 For this tutorial We will use data file that is at the a path |
49 finding mean. |
55 ``/home/fossee/sslc2.txt``. It contains record of students and their |
50 |
56 performance in one of the State Secondary Board Examination. It has |
51 We have a list of ages of a random group of people :: |
57 180,000 lines of record. We are going to read it and process this |
52 |
58 data. We can see the content of file by double clicking on it. It |
53 age_list = [4,45,23,34,34,38,65,42,32,7] |
59 might take some time to open since it is quite a large file. Please |
54 |
60 don't edit the data. This file has a particular structure. |
55 One way of getting the mean could be getting sum of all the ages and |
|
56 dividing by the number of people in the group. :: |
|
57 |
|
58 sum_age_list = sum(age_list) |
|
59 |
|
60 sum function gives us the sum of the elements. Note that the |
|
61 ``sum_age_list`` variable is an integer and the number of people or |
|
62 length of the list is also an integer. We will need to convert one of |
|
63 them to a float before carrying out the division. :: |
|
64 |
|
65 mean_using_sum = float(sum_age_list)/len(age_list) |
|
66 |
|
67 This obviously gives the mean age but there is a simpler way to do |
|
68 this in Python - using the mean function:: |
|
69 |
|
70 mean(age_list) |
|
71 |
|
72 Mean can be used in more ways in case of 2 dimensional lists. Take a |
|
73 two dimensional list :: |
|
74 |
|
75 two_dimension=[[1,5,6,8],[1,3,4,5]] |
|
76 |
|
77 The mean function by default gives the mean of the flattened sequence. |
|
78 A Flattened sequence means a list obtained by concatenating all the |
|
79 smaller lists into a large long list. In this case, the list obtained |
|
80 by writing the two lists one after the other. :: |
|
81 |
|
82 mean(two_dimension) |
|
83 flattened_seq=[1,5,6,8,1,3,4,5] |
|
84 mean(flattened_seq) |
|
85 |
|
86 As you can see both the results are same. ``mean`` function can also |
|
87 give us the mean of each column, or the mean of corresponding elements |
|
88 in the smaller lists. :: |
|
89 |
|
90 mean(two_dimension, 0) |
|
91 array([ 1. , 4. , 5. , 6.5]) |
|
92 |
|
93 we pass an extra argument 0 in that case. |
|
94 |
|
95 If we use an argument 1, we obtain the mean along the rows. :: |
|
96 |
|
97 mean(two_dimension, 1) |
|
98 array([ 5. , 3.25]) |
|
99 |
|
100 We can see more option of mean using :: |
|
101 |
|
102 mean? |
|
103 |
|
104 Similarly we can calculate median and stanard deviation of a list |
|
105 using the functions median and std:: |
|
106 |
|
107 median(age_list) |
|
108 std(age_list) |
|
109 |
|
110 Median and std can also be calculated for two dimensional arrays along |
|
111 columns and rows just like mean. |
|
112 |
|
113 For example :: |
|
114 |
|
115 median(two_dimension, 0) |
|
116 std(two_dimension, 1) |
|
117 |
|
118 This gives us the median along the colums and standard devition along |
|
119 the rows. |
|
120 |
|
121 Now lets apply this to a real world example |
|
122 |
|
123 We will a data file that is at the a path ``/home/fossee/sslc2.txt``. |
|
124 It contains record of students and their performance in one of the |
|
125 State Secondary Board Examination. It has 180, 000 lines of record. We |
|
126 are going to read it and process this data. We can see the content of |
|
127 file by double clicking on it. It might take some time to open since |
|
128 it is quite a large file. Please don't edit the data. This file has |
|
129 a particular structure. |
|
130 |
61 |
131 We can do :: |
62 We can do :: |
132 |
63 |
133 cat /home/fossee/sslc2.txt |
64 cat /home/fossee/sslc2.txt |
134 |
65 |
135 to check the contents of the file. |
66 to check the contents of the file. |
|
67 |
|
68 |
|
69 {{{ Show the data structure on a slide }}} |
136 |
70 |
137 Each line in the file is a set of 11 fields separated |
71 Each line in the file is a set of 11 fields separated |
138 by semi-colons Consider a sample line from this file. |
72 by semi-colons Consider a sample line from this file. |
139 A;015163;JOSEPH RAJ S;083;042;47;00;72;244;;; |
73 A;015163;JOSEPH RAJ S;083;042;47;00;72;244;;; |
140 |
74 |
145 * Marks of 5 subjects: ** English 083 ** Hindi 042 ** Maths 47 ** |
79 * Marks of 5 subjects: ** English 083 ** Hindi 042 ** Maths 47 ** |
146 Science 35 ** Social 72 |
80 Science 35 ** Social 72 |
147 * Total marks 244 |
81 * Total marks 244 |
148 |
82 |
149 |
83 |
150 Now lets try and find the mean of English marks of all students. |
84 Lets try and load this data as an array and then run various function on |
|
85 it. |
151 |
86 |
152 For this we do. :: |
87 To get the data as an array we do. :: |
153 |
88 |
154 L=loadtxt('/home/fossee/sslc2.txt',usecols=(3,),delimiter=';') |
89 L=loadtxt('/home/amit/sslc2.txt',usecols=(3,4,5,6,7,),delimiter=';') |
155 L |
90 L |
156 mean(L) |
91 |
157 |
92 |
158 loadtxt function loads data from an external file.Delimiter specifies |
93 loadtxt function loads data from an external file.Delimiter specifies |
159 the kind of character are the fields of data seperated by. |
94 the kind of character are the fields of data seperated by. usecols |
160 usecols specifies the columns to be used so (3,). The 'comma' is added |
95 specifies the columns to be used so (3,4,5,6,7) loads those |
161 because usecols is a sequence. |
96 colums. The 'comma' is added because usecols is a sequence. |
162 |
97 |
163 To get the median marks. :: |
98 As we can see L is an array. We can get the shape of this array using:: |
164 |
99 |
165 median(L) |
100 L.shape |
|
101 (185667, 5) |
|
102 |
|
103 Lets start applying statistics operations on these. We will start with |
|
104 the most basic, summing. How do you find the sum of marks of all |
|
105 subjects for the first student. |
|
106 |
|
107 As we know from our knowledge of accessing pieces of arrays. To acess |
|
108 the first row we will do :: |
166 |
109 |
167 Standard deviation. :: |
110 L[0,:] |
168 |
|
169 std(L) |
|
170 |
111 |
|
112 Now to sum this we can say :: |
171 |
113 |
172 Now lets try and and get the mean for all the subjects :: |
114 totalmarks=sum(L[0,:]) |
|
115 totalmarks |
173 |
116 |
174 L=loadtxt('/home/fossee/sslc2.txt',usecols=(3,4,5,6,7),delimiter=';') |
117 To get the mean we can do :: |
175 mean(L,0) |
|
176 array([ 73.55452504, 53.79828941, 62.83342759, 50.69806158, 63.17056881]) |
|
177 |
118 |
178 As we can see from the result mean(L,0). The resultant sequence |
119 totalmarks/len(L[0,:]) |
179 is the mean marks of all students that gave the exam for the five subjects. |
|
180 |
120 |
181 and :: |
121 or simply :: |
182 |
122 |
|
123 mean(L[0,:]) |
|
124 |
|
125 But we have such a large data set calculating one by one the mean of |
|
126 each student is impossible. Is there a way to reduce the work. |
|
127 |
|
128 For this we will look into the documentation of mean by doing:: |
|
129 |
|
130 mean? |
|
131 |
|
132 As we know L is a two dimensional array. We can calculate the mean |
|
133 across each of the axis of the array. The axis of rows is referred by |
|
134 number 0 and columns by 1. So to calculate mean accross all colums we |
|
135 will pass extra parameter 1 for the axis.:: |
|
136 |
183 mean(L,1) |
137 mean(L,1) |
184 |
138 |
185 |
139 L here is the two dimensional array. |
186 is the average accumalative marks of individual students. Clearly, mean(L,0) |
|
187 was a row wise calcultaion while mean(L,1) was a column wise calculation. |
|
188 |
140 |
|
141 Similarly to calculate average marks scored by all the students for each |
|
142 subject can be calculated using :: |
|
143 |
|
144 mean(L,0) |
|
145 |
|
146 Next lets now calculate the median of English marks for the all the students |
|
147 We can access English marks of all students using :: |
|
148 |
|
149 L[:,0] |
|
150 |
|
151 To get the median we will do :: |
|
152 |
|
153 median(L[:,0]) |
|
154 |
|
155 For all the subjects we can use the same syntax as mean and calculate |
|
156 median across all rows using :: |
|
157 |
|
158 median(L,0) |
|
159 |
|
160 |
|
161 Similarly to calculate standard deviation for English we can do:: |
|
162 |
|
163 std(L[:,0]) |
|
164 |
|
165 and for all rows:: |
|
166 |
|
167 std(L,0) |
|
168 |
|
169 Following is an exercise that you must do. |
|
170 |
|
171 %% %% In the given file football.txt at path /home/fossee/football.txt , one column is player name,second is goals at home and third goals away. |
|
172 1.Find the total goals for each player |
|
173 2.Mean home and away goals |
|
174 3.Standard deviation of home and away goals |
189 |
175 |
190 {{{ Show summary slide }}} |
176 {{{ Show summary slide }}} |
191 |
177 |
192 This brings us to the end of the tutorial. |
178 This brings us to the end of the tutorial. |
193 we have learnt |
179 we have learnt |