|
1 Hello friends and welcome to the tutorial on statistics using Python |
|
2 |
|
3 {{{ Show the slide containing title }}} |
|
4 |
|
5 {{{ Show the slide containing the outline slide }}} |
|
6 |
|
7 In this tutorial, we shall learn |
|
8 * Doing simple statistical operations in Python |
|
9 * Applying these to real world problems |
|
10 |
|
11 You will need Ipython with pylab running on your computer |
|
12 to use this tutorial. |
|
13 |
|
14 Also you will need to know about loading data using loadtxt to be |
|
15 able to follow the real world application. |
|
16 |
|
17 We will first start with the most necessary statistical |
|
18 operation i.e finding mean. |
|
19 |
|
20 We have a list of ages of a random group of people :: |
|
21 |
|
22 age_list=[4,45,23,34,34,38,65,42,32,7] |
|
23 |
|
24 One way of getting the mean could be getting sum of |
|
25 all the elements and dividing by length of the list.:: |
|
26 |
|
27 sum_age_list =sum(age_list) |
|
28 |
|
29 sum function gives us the sum of the elements.:: |
|
30 |
|
31 mean_using_sum=sum_age_list/len(age_list) |
|
32 |
|
33 This obviously gives the mean age but python has another |
|
34 method for getting the mean. This is the mean function:: |
|
35 |
|
36 mean(age_list) |
|
37 |
|
38 Mean can be used in more ways in case of 2 dimensional lists. |
|
39 Take a two dimensional list :: |
|
40 |
|
41 two_dimension=[[1,5,6,8],[1,3,4,5]] |
|
42 |
|
43 the mean function used in default manner will give the mean of the |
|
44 flattened sequence. Flattened sequence means the two lists taken |
|
45 as if it was a single list of elements :: |
|
46 |
|
47 mean(two_dimension) |
|
48 flattened_seq=[1,5,6,8,1,3,4,5] |
|
49 mean(flattened_seq) |
|
50 |
|
51 As you can see both the results are same. The other is mean |
|
52 of each column.:: |
|
53 |
|
54 mean(two_dimension,0) |
|
55 array([ 1. , 4. , 5. , 6.5]) |
|
56 |
|
57 or along the two rows seperately.:: |
|
58 |
|
59 mean(two_dimension,1) |
|
60 array([ 5. , 3.25]) |
|
61 |
|
62 We can see more option of mean using :: |
|
63 |
|
64 mean? |
|
65 |
|
66 Similarly we can calculate median and stanard deviation of a list |
|
67 using the functions median and std:: |
|
68 |
|
69 median(age_list) |
|
70 std(age_list) |
|
71 |
|
72 |
|
73 |
|
74 Now lets apply this to a real world example :: |
|
75 |
|
76 We will a data file that is at the a path |
|
77 ``/home/fossee/sslc2.txt``.It contains record of students and their |
|
78 performance in one of the State Secondary Board Examination. It has |
|
79 180, 000 lines of record. We are going to read it and process this |
|
80 data. We can see the content of file by double clicking on it. It |
|
81 might take some time to open since it is quite a large file. Please |
|
82 don't edit the data. This file has a particular structure. |
|
83 |
|
84 We can do :: |
|
85 |
|
86 cat /home/fossee/sslc2.txt |
|
87 |
|
88 to check the contents of the file. |
|
89 |
|
90 Each line in the file is a set of 11 fields separated |
|
91 by semi-colons Consider a sample line from this file. |
|
92 A;015163;JOSEPH RAJ S;083;042;47;00;72;244;;; |
|
93 |
|
94 The following are the fields in any given line. |
|
95 * Region Code which is 'A' |
|
96 * Roll Number 015163 |
|
97 * Name JOSEPH RAJ S |
|
98 * Marks of 5 subjects: ** English 083 ** Hindi 042 ** Maths 47 ** |
|
99 Science AA (Absent) ** Social 72 |
|
100 * Total marks 244 |
|
101 * |
|
102 |
|
103 Now lets try and find the mean of English marks of all students. |
|
104 |
|
105 For this we do. :: |
|
106 |
|
107 L=loadtxt('/home/fossee/sslc2.txt',usecols=(3,),delimiter=';') |
|
108 L |
|
109 mean(L) |
|
110 |
|
111 loadtxt function loads data from an external file.Delimiter specifies |
|
112 the kind of character are the fields of data seperated by. |
|
113 usecols specifies the columns to be used so (3,). The 'comma' is added |
|
114 because usecols is a sequence. |
|
115 |
|
116 To get the median marks. :: |
|
117 |
|
118 median(L) |
|
119 |
|
120 Standard deviation. :: |
|
121 |
|
122 std(L) |
|
123 |
|
124 |
|
125 Now lets try and and get the mean for all the subjects :: |
|
126 |
|
127 L=loadtxt('sslc2.txt',usecols=(3,4,5,6,7),delimiter=';') |
|
128 mean(L,0) |
|
129 array([ 73.55452504, 53.79828941, 62.83342759, 50.69806158, 63.17056881]) |
|
130 |
|
131 As we can see from the result mean(L,0). The resultant sequence |
|
132 is the mean marks of all students that gave the exam for the five subjects. |
|
133 |
|
134 and :: |
|
135 |
|
136 mean(L,1) |
|
137 |
|
138 |
|
139 is the average accumalative marks of individual students. Clearly, mean(L,0) |
|
140 was a row wise calcultaion while mean(L,1) was a column wise calculation. |
|
141 |
|
142 |
|
143 {{{ Show summary slide }}} |
|
144 |
|
145 This brings us to the end of the tutorial. |
|
146 we have learnt |
|
147 |
|
148 * How to do the standard statistical operations sum , mean |
|
149 median and standard deviation in Python. |
|
150 * Combine text loading and the statistical operation to solve |
|
151 real world problems. |
|
152 |
|
153 {{{ Show the "sponsored by FOSSEE" slide }}} |
|
154 |
|
155 |
|
156 This tutorial was created as a part of FOSSEE project, NME ICT, MHRD India |
|
157 |
|
158 Hope you have enjoyed and found it useful. |
|
159 Thankyou |
|
160 |
|
161 .. Author : Amit Sethi |
|
162 Internal Reviewer 1 : |
|
163 Internal Reviewer 2 : |
|
164 External Reviewer : |
|
165 |