1 Hello welcome to the tutorial on statistics and dictionaries in Python. |
1 Hello and welcome to the tutorial on handling large data files and processing them to get desired results. |
2 |
2 |
3 Till now we have covered: |
3 Till now we have covered: |
4 * How to create plots. |
4 * How to create plots. |
5 * How to read data from file and process it. |
5 * How to read data from file and process it. |
6 |
6 |
7 In this session, we will use them and some new concepts to solve a problem/exercise. |
7 In this session, we will use them and some new concepts to solve a problem/exercise. |
8 |
8 |
9 We have a file named sslc1.txt. |
9 We have a file named sslc1.txt. |
10 It contains record of students and their performance in one of the State Secondary Board Examination. |
10 It contains record of students and their performance in one of the State Secondary Board Examination. It has 180, 000 lines of record. We are going to read it and process this data. |
11 We can see the content of file by opening with any text editor. |
11 We can see the content of file by opening with any text editor. |
12 Please don't edit the data. |
12 Please don't edit the data. |
13 It is arranged in a particular format. |
13 It is arranged in a particular format. |
14 One particular line being: |
14 One particular line being: |
15 A;015163;JOSEPH RAJ S;083;042;47;AA;72;244;;; |
15 A;015163;JOSEPH RAJ S;083;042;47;AA;72;244;;; |
40 Arrays |
40 Arrays |
41 Plot (done already) |
41 Plot (done already) |
42 |
42 |
43 Dictionaries |
43 Dictionaries |
44 |
44 |
45 We earlier used lists, we just created them and appended items to list. |
45 We earlier used lists, back then we just created them and appended items to list. |
46 x = [1, 4, 2, 7, 6] |
46 x = [1, 4, 2, 7, 6] |
47 to access the first element we use index number, and it starts from 0 so |
47 to access the first element we use index number, and it starts from 0 so |
48 x[0] will give |
48 x[0] will give |
49 1 and |
49 1 and |
50 x[3] will |
50 x[3] will |
51 7 |
51 7 |
52 |
52 |
53 At times we don't have index to relate things. For example consider a telephone directory, we give it a name and it should return back corresponding number. List is not the best kind of data structure for such problems, and hence Python provides support for dictionaries. Dictionaries are key value pairs. Lists are indexed by integers while dictionaries are indexed by strings. For example: |
53 At times we don't have index to relate things. For example consider a telephone directory, we give it a name and it should return back corresponding number. List is not the best kind of data structure for such problems, and hence Python provides support for dictionaries. Dictionaries are key value pairs. Lists are indexed by integers while dictionaries are indexed by strings. For example: |
54 |
54 |
55 In []: d = {'png' : 'image', |
55 d = {'png' : 'image', |
56 'txt' : 'text', |
56 'txt' : 'text', |
57 'py' : 'python'} |
57 'py' : 'python'} |
|
58 |
|
59 d |
|
60 |
58 d is a dictionary. The first element in the pair is called the `key' and the second is called the `value'. The key always has to be a string while the value can be of any type. |
61 d is a dictionary. The first element in the pair is called the `key' and the second is called the `value'. The key always has to be a string while the value can be of any type. |
59 |
62 |
60 Dictionaries are indexed using their keys as shown |
63 Dictionaries are indexed using their keys as shown |
61 In []: d['txt'] |
64 In []: d['txt'] |
62 Out[]: 'text' |
65 Out[]: 'text' |
89 Parsing and string processing |
93 Parsing and string processing |
90 |
94 |
91 As we saw previously we will be dealing with lines with such content |
95 As we saw previously we will be dealing with lines with such content |
92 A;015162;JENIL T P;081;060;77;41;74;333;P;; |
96 A;015162;JENIL T P;081;060;77;41;74;333;P;; |
93 so ';' is delimiter we have to look for. |
97 so ';' is delimiter we have to look for. |
|
98 |
94 We will create one string variable to see how can we process it get the desired output. |
99 We will create one string variable to see how can we process it get the desired output. |
95 |
100 |
96 line = 'A;015162;JENIL T P;081;060;77;41;74;333;P;;' |
101 line = 'A;015162;JENIL T P;081;060;77;41;74;333;P;;' |
97 a = line.split(';') |
102 a = line.split(';') |
98 we have used split earlier to split on empty spaces. |
103 we have used split earlier to split on empty spaces, but in this case we will split line for each ';' |
|
104 |
99 a |
105 a |
100 |
106 |
101 is list with all elements separated. |
107 is list containing all the fields separately. |
102 a[0] is the region we want. |
108 |
103 and a[6] will give us the science marks of a particular region. |
109 a[0] is the region code. |
|
110 and a[6] will give us the science marks of that particular region. |
|
111 |
104 So we create a dictionary of all the regions with number of students having more then 90 marks. |
112 So we create a dictionary of all the regions with number of students having more then 90 marks. |
105 Something like |
113 # Something like |
106 d = {'A': 729, 'C': 764, 'B': 1120,'E': 414, 'D': 603, 'F': 500} |
114 # d = {'A': 729, 'C': 764, 'B': 1120,'E': 414, 'D': 603, 'F': 500} |
107 |
115 |
108 ------------------------------------------------------------------------------------------------------------------ |
116 ------------------------------------------------------------------------------------------------------------------ |
109 |
117 |
110 code |
118 code |
111 |
119 |
112 We first create an empty dictionary |
120 We first create an empty dictionary |
113 |
121 |
114 science = {} |
122 science = {} |
115 now we read the record data one by one |
123 now we read the record data one by one |
116 |
124 |
117 for record in open('sslc1.txt'): |
125 for record in open('sslc.txt'): |
118 |
126 |
119 we split the record on ';' and store the list in 'fields' |
127 we split the record on ';' and store the list as fields equals record.split(';') |
120 fields = record.split(';') |
128 # fields = record.split(';') |
121 |
129 |
122 now we strip this string for leading and trailing white spaces |
130 now get region code of particular entry by region_code equal to fields[0].strip. strip with remove all leading and trailing white spaces from the string |
123 region_code = fields[0].strip() |
131 # region_code = fields[0].strip() |
124 |
132 |
125 now we check if the region code is always there in dictionary by writing 'if' statement |
133 now we check if the region code is always there in dictionary by writing 'if' statement, |
126 if region_code not in science: |
134 if region_code not in science: |
127 when this statement is true, we add new entry to dictionary with |
135 when this statement is true, we add new entry to dictionary with initial value 0 and key being the region code. |
128 science[region_code] = 0 |
136 science[region_code] = 0 |
|
137 |
|
138 Note that this if statement is inside the for loop so for if block we will have to give additional indentation. |
129 |
139 |
130 we again strip(ing is good) the string |
140 we again come back to older for loop indentation and we again strip(ing is good) the string and get science marks by |
131 score_str = fields[6].strip() |
141 score_str = fields[6].strip() |
132 |
142 |
133 we check if student was not absent |
143 we check if student was not absent |
134 if score_str != 'AA': |
144 if score_str != 'AA': |
135 then we check if his marks are above 90 or not |
145 then we check if his marks are above 90 or not |
136 if int(score_str) > 90: |
146 if int(score_str) > 90: |
|
147 if true we add it to the value of dictionary for that region by |
137 science[region_code] += 1 |
148 science[region_code] += 1 |
138 |
149 |
139 Hit return twice |
150 Hit return twice |
140 |
151 |
141 by end of this loop we will have our desired output in the dictionary 'science' |
152 by end of this loop we will have our desired output in the dictionary 'science' |