1 Hello welcome to the tutorial on statistics and dictionaries in Python. |
1 Hello welcome to the tutorial on statistics and dictionaries in Python. |
2 |
2 |
3 In the previous tutorial we saw the `for' loop and lists. Here we shall look into |
3 Till now we have covered: |
4 calculating mean for the same pendulum experiment and then move on to calculate |
4 * How to create plots. |
5 the mean, median and standard deviation for a very large data set. |
5 * How to read data from file and process it. |
6 |
6 |
7 Let's start with calculating the mean acceleration due to gravity based on the data from pendulum.txt. |
7 In this session, we will use them and some new concepts to solve a problem/exercise. |
8 |
8 |
9 We first create an empty list `g_list' to which we shall append the values of `g'. |
9 We have a file named sslc1.txt. |
10 In []: g_list = [] |
10 It contains record of students and their performance in one of the State Secondary Board Examination. |
|
11 We can see the content of file by opening with any text editor. |
|
12 Please don't edit the data. |
|
13 It is arranged in a particular format. |
|
14 One particular line being: |
|
15 A;015163;JOSEPH RAJ S;083;042;47;AA;72;244;;; |
|
16 It has following fields: |
|
17 * Region Code which is 'A' |
|
18 * Roll Number 015163 |
|
19 * Name JOSEPH RAJ S |
|
20 * Marks of 5 subjects: |
|
21 ** English 083 |
|
22 ** Hindi 042 |
|
23 ** Maths 47 |
|
24 ** Science AA (Absent) |
|
25 ** Social 72 |
|
26 * Total marks 244 |
|
27 * Pass/Fail Blank cause he was absent in one exam or else it will be(P/F) |
|
28 * Withheld Blank in this case(W) |
11 |
29 |
12 For each pair of `L' and `t' values in the file `pendulum.txt' we calculate the |
30 So problem we are going to solve is: |
13 value of `g' and append it to the list `g_list' |
31 Draw a pie chart representing proportion of students who scored more than 90% in each region in Science. |
14 In []: for line in open('pendulum.txt'): |
|
15 .... point = line.split() |
|
16 .... L = float(point[0]) |
|
17 .... t = float(point[1]) |
|
18 .... g = 4 * pi * pi * L / (t * t) |
|
19 .... g_list.append(g) |
|
20 |
32 |
21 We proceed to calculate the mean of the value of `g' from the list `g_list'. |
33 The result would be something like this: |
22 Here we shall show three ways of calculating the mean. |
34 slide of result. |
23 Firstly, we calculate the sum `total' of the values in `g_list'. |
|
24 In []: total = 0 |
|
25 In []: for g in g_list: |
|
26 ....: total += g |
|
27 ....: |
|
28 |
35 |
29 Once we have the total we calculate by dividing the `total' by the length of `g_list' |
36 We would be using following machinery: |
|
37 File Reading(done already) |
|
38 parsing (done partly) |
|
39 Dictionaries (new) |
|
40 Arrays |
|
41 Plot (done already) |
30 |
42 |
31 In []: g_mean = total / len(g_list) |
43 Dictionaries |
32 In []: print 'Mean: ', g_mean |
|
33 |
44 |
34 The second method is slightly simpler. Python provides a built-in function called "sum()" that computes the sum of all the elements in a list. |
45 We earlier used lists, we just created them and appended items to list. |
35 In []: g_mean = sum(g_list) / len(g_list) |
46 x = [1, 4, 2, 7, 6] |
36 In []: print 'Mean: ', g_mean |
47 to access the first element we use index number, and it starts from 0 so |
|
48 x[0] will give |
|
49 1 and |
|
50 x[3] will |
|
51 7 |
37 |
52 |
38 The third method is the simplest. Python provides a built-in function `mean' that |
53 At times we don't have index to relate things. For example consider a telephone directory, we give it a name and it should return back corresponding number. List is not the best kind of data structure for such problems, and hence Python provides support for dictionaries. Dictionaries are key value pairs. Lists are indexed by integers while dictionaries are indexed by strings. For example: |
39 calculates the mean of all the elements in a list. |
|
40 In []: g_mean = mean(g_list) |
|
41 In []: print 'Mean: ', g_mean |
|
42 |
54 |
43 Python provides support for dictionaries. Dictionaries are key value pairs. Lists are indexed by integers while dictionaries are indexed by strings. For example: |
|
44 In []: d = {'png' : 'image', |
55 In []: d = {'png' : 'image', |
45 'txt' : 'text', |
56 'txt' : 'text', |
46 'py' : 'python'} |
57 'py' : 'python'} |
47 is a dictionary. The first element in the pair is called the `key' and the second |
58 d is a dictionary. The first element in the pair is called the `key' and the second is called the `value'. The key always has to be a string while the value can be of any type. |
48 is called the `value'. The key always has to be a string while the value can be |
|
49 of any type. |
|
50 |
59 |
51 Dictionaries are indexed using their keys as shown |
60 Dictionaries are indexed using their keys as shown |
52 In []: d['txt'] |
61 In []: d['txt'] |
53 Out[]: 'text' |
62 Out[]: 'text' |
54 |
63 |
55 In []: d['png'] |
64 In []: d['png'] |
56 Out[]: 'image' |
65 Out[]: 'image' |
57 |
66 |
58 The dictionaries can be searched for the presence of a certain key by typing |
67 The dictionaries can be searched for the presence of a certain key by typing |
59 In []: 'py' in d |
68 'py' in d |
60 Out[]: True |
69 True |
61 |
70 |
62 In []: 'jpg' in d |
71 'jpg' in d |
63 Out[]: False |
72 False |
64 Please note the values cannot be searched in a dictionaries. |
73 Please note the values cannot be searched in a dictionaries. |
65 |
74 |
66 In []: d.keys() |
75 d.keys() |
67 Out[]: ['py', 'txt', 'png'] |
76 ['py', 'txt', 'png'] |
68 is used to obtain the list of all keys in a dictionary |
77 is used to obtain the list of all keys in a dictionary |
69 |
78 |
70 In []: d.values() |
79 d.values() |
71 Out[]: ['python', 'text', 'image'] |
80 ['python', 'text', 'image'] |
72 is used to obtain the list of all values in a dictionary |
81 is used to obtain the list of all values in a dictionary |
73 |
82 |
74 In []: d |
83 d |
75 Out[]: {'png': 'image', 'py': 'python', 'txt': 'text'} |
84 |
76 Please observe that dictionaries do not preserve the order in which the items |
85 Please observe that dictionaries do not preserve the order in which the items were entered. The order of the elements in a dictionary should not be relied upon. |
77 were entered. The order of the elements in a dictionary should not be relied upon. |
86 |
|
87 ------------------------------------------------------------------------------------------------------------------ |
|
88 |
|
89 Parsing and string processing |
|
90 |
|
91 As we saw previously we will be dealing with lines with such content |
|
92 A;015162;JENIL T P;081;060;77;41;74;333;P;; |
|
93 so ';' is delimiter we have to look for. |
|
94 We will create one string variable to see how can we process it get the desired output. |
|
95 |
|
96 line = 'A;015162;JENIL T P;081;060;77;41;74;333;P;;' |
|
97 a = line.split(';') |
|
98 we have used split earlier to split on empty spaces. |
|
99 a |
|
100 |
|
101 is list with all elements separated. |
|
102 a[0] is the region we want. |
|
103 and a[6] will give us the science marks of a particular region. |
|
104 So we create a dictionary of all the regions with number of students having more then 90 marks. |
|
105 Something like |
|
106 d = {'A': 729, 'C': 764, 'B': 1120,'E': 414, 'D': 603, 'F': 500} |
|
107 |
|
108 ------------------------------------------------------------------------------------------------------------------ |
|
109 |
|
110 code |
|
111 |
|
112 We first create an empty dictionary |
|
113 |
|
114 science = {} |
|
115 now we read the record data one by one |
|
116 |
|
117 for record in open('sslc1.txt'): |
|
118 |
|
119 we split the record on ';' and store the list in 'fields' |
|
120 fields = record.split(';') |
|
121 |
|
122 now we strip this string for leading and trailing white spaces |
|
123 region_code = fields[0].strip() |
|
124 |
|
125 now we check if the region code is always there in dictionary by writing 'if' statement |
|
126 if region_code not in science: |
|
127 when this statement is true, we add new entry to dictionary with |
|
128 science[region_code] = 0 |
|
129 |
|
130 we again strip(ing is good) the string |
|
131 score_str = fields[6].strip() |
|
132 |
|
133 we check if student was not absent |
|
134 if score_str != 'AA': |
|
135 then we check if his marks are above 90 or not |
|
136 if int(score_str) > 90: |
|
137 science[region_code] += 1 |
|
138 |
|
139 Hit return twice |
|
140 |
|
141 by end of this loop we will have our desired output in the dictionary 'science' |
|
142 we can check the values by |
|
143 science |
|
144 |
|
145 now to create a pie chart we use |
|
146 |
|
147 pie(science.values(),labels = science.keys()) |
|
148 title('Students scoring 90% and above in science by region') |
|
149 savefig('science.png') |
|
150 |