177
|
1 |
Hello friends and welcome to the tutorial on statistics using Python
|
|
2 |
|
|
3 |
{{{ Show the slide containing title }}}
|
|
4 |
|
|
5 |
{{{ Show the slide containing the outline slide }}}
|
|
6 |
|
|
7 |
In this tutorial, we shall learn
|
|
8 |
* Doing simple statistical operations in Python
|
|
9 |
* Applying these to real world problems
|
|
10 |
|
|
11 |
You will need Ipython with pylab running on your computer
|
|
12 |
to use this tutorial.
|
|
13 |
|
|
14 |
Also you will need to know about loading data using loadtxt to be
|
|
15 |
able to follow the real world application.
|
|
16 |
|
|
17 |
We will first start with the most necessary statistical
|
|
18 |
operation i.e finding mean.
|
|
19 |
|
|
20 |
We have a list of ages of a random group of people ::
|
|
21 |
|
|
22 |
age_list=[4,45,23,34,34,38,65,42,32,7]
|
|
23 |
|
|
24 |
One way of getting the mean could be getting sum of
|
|
25 |
all the elements and dividing by length of the list.::
|
|
26 |
|
|
27 |
sum_age_list =sum(age_list)
|
|
28 |
|
|
29 |
sum function gives us the sum of the elements.::
|
|
30 |
|
|
31 |
mean_using_sum=sum_age_list/len(age_list)
|
|
32 |
|
|
33 |
This obviously gives the mean age but python has another
|
|
34 |
method for getting the mean. This is the mean function::
|
|
35 |
|
|
36 |
mean(age_list)
|
|
37 |
|
|
38 |
Mean can be used in more ways in case of 2 dimensional lists.
|
|
39 |
Take a two dimensional list ::
|
|
40 |
|
|
41 |
two_dimension=[[1,5,6,8],[1,3,4,5]]
|
|
42 |
|
|
43 |
the mean function used in default manner will give the mean of the
|
|
44 |
flattened sequence. Flattened sequence means the two lists taken
|
|
45 |
as if it was a single list of elements ::
|
|
46 |
|
|
47 |
mean(two_dimension)
|
|
48 |
flattened_seq=[1,5,6,8,1,3,4,5]
|
|
49 |
mean(flattened_seq)
|
|
50 |
|
|
51 |
As you can see both the results are same. The other is mean
|
|
52 |
of each column.::
|
|
53 |
|
|
54 |
mean(two_dimension,0)
|
|
55 |
array([ 1. , 4. , 5. , 6.5])
|
|
56 |
|
|
57 |
or along the two rows seperately.::
|
|
58 |
|
|
59 |
mean(two_dimension,1)
|
|
60 |
array([ 5. , 3.25])
|
|
61 |
|
|
62 |
We can see more option of mean using ::
|
|
63 |
|
|
64 |
mean?
|
|
65 |
|
|
66 |
Similarly we can calculate median and stanard deviation of a list
|
|
67 |
using the functions median and std::
|
|
68 |
|
|
69 |
median(age_list)
|
|
70 |
std(age_list)
|
|
71 |
|
|
72 |
|
|
73 |
|
|
74 |
Now lets apply this to a real world example ::
|
|
75 |
|
|
76 |
We will a data file that is at the a path
|
|
77 |
``/home/fossee/sslc2.txt``.It contains record of students and their
|
|
78 |
performance in one of the State Secondary Board Examination. It has
|
|
79 |
180, 000 lines of record. We are going to read it and process this
|
|
80 |
data. We can see the content of file by double clicking on it. It
|
|
81 |
might take some time to open since it is quite a large file. Please
|
|
82 |
don't edit the data. This file has a particular structure.
|
|
83 |
|
|
84 |
We can do ::
|
|
85 |
|
|
86 |
cat /home/fossee/sslc2.txt
|
|
87 |
|
|
88 |
to check the contents of the file.
|
|
89 |
|
|
90 |
Each line in the file is a set of 11 fields separated
|
|
91 |
by semi-colons Consider a sample line from this file.
|
|
92 |
A;015163;JOSEPH RAJ S;083;042;47;00;72;244;;;
|
|
93 |
|
|
94 |
The following are the fields in any given line.
|
|
95 |
* Region Code which is 'A'
|
|
96 |
* Roll Number 015163
|
|
97 |
* Name JOSEPH RAJ S
|
|
98 |
* Marks of 5 subjects: ** English 083 ** Hindi 042 ** Maths 47 **
|
|
99 |
Science AA (Absent) ** Social 72
|
|
100 |
* Total marks 244
|
|
101 |
*
|
|
102 |
|
|
103 |
Now lets try and find the mean of English marks of all students.
|
|
104 |
|
|
105 |
For this we do. ::
|
|
106 |
|
|
107 |
L=loadtxt('/home/fossee/sslc2.txt',usecols=(3,),delimiter=';')
|
|
108 |
L
|
|
109 |
mean(L)
|
|
110 |
|
|
111 |
loadtxt function loads data from an external file.Delimiter specifies
|
|
112 |
the kind of character are the fields of data seperated by.
|
|
113 |
usecols specifies the columns to be used so (3,). The 'comma' is added
|
|
114 |
because usecols is a sequence.
|
|
115 |
|
|
116 |
To get the median marks. ::
|
|
117 |
|
|
118 |
median(L)
|
|
119 |
|
|
120 |
Standard deviation. ::
|
|
121 |
|
|
122 |
std(L)
|
|
123 |
|
|
124 |
|
|
125 |
Now lets try and and get the mean for all the subjects ::
|
|
126 |
|
|
127 |
L=loadtxt('sslc2.txt',usecols=(3,4,5,6,7),delimiter=';')
|
|
128 |
mean(L,0)
|
|
129 |
array([ 73.55452504, 53.79828941, 62.83342759, 50.69806158, 63.17056881])
|
|
130 |
|
|
131 |
As we can see from the result mean(L,0). The resultant sequence
|
|
132 |
is the mean marks of all students that gave the exam for the five subjects.
|
|
133 |
|
|
134 |
and ::
|
|
135 |
|
|
136 |
mean(L,1)
|
|
137 |
|
|
138 |
|
|
139 |
is the average accumalative marks of individual students. Clearly, mean(L,0)
|
|
140 |
was a row wise calcultaion while mean(L,1) was a column wise calculation.
|
|
141 |
|
|
142 |
|
|
143 |
{{{ Show summary slide }}}
|
|
144 |
|
|
145 |
This brings us to the end of the tutorial.
|
|
146 |
we have learnt
|
|
147 |
|
|
148 |
* How to do the standard statistical operations sum , mean
|
|
149 |
median and standard deviation in Python.
|
|
150 |
* Combine text loading and the statistical operation to solve
|
|
151 |
real world problems.
|
|
152 |
|
|
153 |
{{{ Show the "sponsored by FOSSEE" slide }}}
|
|
154 |
|
|
155 |
|
|
156 |
This tutorial was created as a part of FOSSEE project, NME ICT, MHRD India
|
|
157 |
|
|
158 |
Hope you have enjoyed and found it useful.
|
|
159 |
Thankyou
|
|
160 |
|
|
161 |
.. Author : Amit Sethi
|
|
162 |
Internal Reviewer 1 :
|
|
163 |
Internal Reviewer 2 :
|
|
164 |
External Reviewer :
|
|
165 |
|