17 .. Author : Puneeth |
17 .. Author : Puneeth |
18 Internal Reviewer : Anoop Jacob Thomas<anoop@fossee.in> |
18 Internal Reviewer : Anoop Jacob Thomas<anoop@fossee.in> |
19 External Reviewer : |
19 External Reviewer : |
20 Checklist OK? : <put date stamp here, if OK> [2010-10-05] |
20 Checklist OK? : <put date stamp here, if OK> [2010-10-05] |
21 |
21 |
22 Hello friends and welcome to the tutorial on statistics using Python |
22 Hello friends and welcome to the tutorial on Statistics using Python |
23 |
23 |
24 {{{ Show the slide containing title }}} |
24 {{{ Show the slide containing title }}} |
25 |
25 |
26 {{{ Show the slide containing the outline slide }}} |
26 {{{ Show the slide containing the outline slide }}} |
27 |
27 |
28 In this tutorial, we shall learn |
28 In this tutorial, we shall learn |
29 * Doing simple statistical operations in Python |
29 * Doing simple statistical operations in Python |
30 * Applying these to real world problems |
30 * Applying these to real world problems |
31 |
31 |
32 You will need Ipython with pylab running on your computer |
32 .. #[punch: the prerequisites part may be skipped in the tutorial. It |
33 to use this tutorial. |
33 .. will be provided separately.] |
34 |
34 |
35 Also you will need to know about loading data using loadtxt to be |
35 You will need Ipython with pylab running on your computer to use this |
36 able to follow the real world application. |
36 tutorial. |
37 |
37 |
38 We will first start with the most necessary statistical |
38 Also you will need to know about loading data using loadtxt to be able |
39 operation i.e finding mean. |
39 to follow the real world application. |
|
40 |
|
41 .. #[punch: since loadtxt is anyway a pre-req, I would recommend you |
|
42 .. to use a data file and load data from that. that is good, since you |
|
43 .. would get to deal with arrays, instead of lists. |
|
44 |
|
45 .. Talking of rows and columns of 2-D lists etc is confusing. Also, |
|
46 .. converting to float can be avoided. The tutorial will feel more |
|
47 .. natural, is what I think. |
|
48 |
|
49 .. The idea of separating the main problem and giving toy examples |
|
50 .. doesn't sound good. Use the same problem to explain stuff. Or use a |
|
51 .. smaller data-set or something. Using lists doesn't seem natural.] |
|
52 |
|
53 |
|
54 We will first start with the most necessary statistical operation i.e |
|
55 finding mean. |
40 |
56 |
41 We have a list of ages of a random group of people :: |
57 We have a list of ages of a random group of people :: |
42 |
58 |
43 age_list=[4,45,23,34,34,38,65,42,32,7] |
59 age_list = [4,45,23,34,34,38,65,42,32,7] |
44 |
60 |
45 One way of getting the mean could be getting sum of |
61 One way of getting the mean could be getting sum of all the ages and |
46 all the elements and dividing by length of the list.:: |
62 dividing by the number of people in the group. :: |
47 |
63 |
48 sum_age_list =sum(age_list) |
64 sum_age_list = sum(age_list) |
49 |
65 |
50 sum function gives us the sum of the elements.:: |
66 sum function gives us the sum of the elements. Note that the |
51 |
67 ``sum_age_list`` variable is an integer and the number of people or |
52 mean_using_sum=float(sum_age_list)/len(age_list) |
68 length of the list is also an integer. We will need to convert one of |
53 |
69 them to a float before carrying out the division. :: |
54 This obviously gives the mean age but python has another |
70 |
55 method for getting the mean. This is the mean function:: |
71 mean_using_sum = float(sum_age_list)/len(age_list) |
|
72 |
|
73 This obviously gives the mean age but there is a simpler way to do |
|
74 this in Python - using the mean function:: |
56 |
75 |
57 mean(age_list) |
76 mean(age_list) |
58 |
77 |
59 Mean can be used in more ways in case of 2 dimensional lists. |
78 Mean can be used in more ways in case of 2 dimensional lists. Take a |
60 Take a two dimensional list :: |
79 two dimensional list :: |
61 |
80 |
62 two_dimension=[[1,5,6,8],[1,3,4,5]] |
81 two_dimension=[[1,5,6,8],[1,3,4,5]] |
63 |
82 |
64 the mean function used in default manner will give the mean of the |
83 The mean function by default gives the mean of the flattened sequence. |
65 flattened sequence. Flattened sequence means the two lists taken |
84 A Flattened sequence means a list obtained by concatenating all the |
66 as if it was a single list of elements :: |
85 smaller lists into a large long list. In this case, the list obtained |
|
86 by writing the two lists one after the other. :: |
67 |
87 |
68 mean(two_dimension) |
88 mean(two_dimension) |
69 flattened_seq=[1,5,6,8,1,3,4,5] |
89 flattened_seq=[1,5,6,8,1,3,4,5] |
70 mean(flattened_seq) |
90 mean(flattened_seq) |
71 |
91 |
72 As you can see both the results are same. The other way is mean |
92 As you can see both the results are same. ``mean`` function can also |
73 of each column.:: |
93 give us the mean of each column, or the mean of corresponding elements |
74 |
94 in the smaller lists. :: |
75 mean(two_dimension,0) |
95 |
|
96 mean(two_dimension, 0) |
76 array([ 1. , 4. , 5. , 6.5]) |
97 array([ 1. , 4. , 5. , 6.5]) |
77 |
98 |
78 we pass an extra argument 0 in that case. |
99 we pass an extra argument 0 in that case. |
79 |
100 |
80 In case of getting mean along the rows the argument is 1:: |
101 If we use an argument 1, we obtain the mean along the rows. :: |
81 |
102 |
82 mean(two_dimension,1) |
103 mean(two_dimension, 1) |
83 array([ 5. , 3.25]) |
104 array([ 5. , 3.25]) |
84 |
105 |
85 We can see more option of mean using :: |
106 We can see more option of mean using :: |
86 |
107 |
87 mean? |
108 mean? |
90 using the functions median and std:: |
111 using the functions median and std:: |
91 |
112 |
92 median(age_list) |
113 median(age_list) |
93 std(age_list) |
114 std(age_list) |
94 |
115 |
95 Median and std can also be calculated for two dimensional arrays along columns and rows just like mean. |
116 Median and std can also be calculated for two dimensional arrays along |
96 |
117 columns and rows just like mean. |
97 For example :: |
118 |
|
119 For example :: |
98 |
120 |
99 median(two_dimension,0) |
121 median(two_dimension, 0) |
100 std(two_dimension,1) |
122 std(two_dimension, 1) |
101 |
123 |
102 This gives us the median along the colums and standard devition along the rows. |
124 This gives us the median along the colums and standard devition along |
|
125 the rows. |
103 |
126 |
104 Now lets apply this to a real world example |
127 Now lets apply this to a real world example |
105 |
128 |
106 We will a data file that is at the a path |
129 We will a data file that is at the a path ``/home/fossee/sslc2.txt``. |
107 ``/home/fossee/sslc2.txt``.It contains record of students and their |
130 It contains record of students and their performance in one of the |
108 performance in one of the State Secondary Board Examination. It has |
131 State Secondary Board Examination. It has 180, 000 lines of record. We |
109 180, 000 lines of record. We are going to read it and process this |
132 are going to read it and process this data. We can see the content of |
110 data. We can see the content of file by double clicking on it. It |
133 file by double clicking on it. It might take some time to open since |
111 might take some time to open since it is quite a large file. Please |
134 it is quite a large file. Please don't edit the data. This file has |
112 don't edit the data. This file has a particular structure. |
135 a particular structure. |
113 |
136 |
114 We can do :: |
137 We can do :: |
115 |
138 |
116 cat /home/fossee/sslc2.txt |
139 cat /home/fossee/sslc2.txt |
117 |
140 |