1 Hello and welcome to the tutorial on handling large data files and processing them to get desired results. |
1 Hello and welcome to the tutorial on handling large data files and processing them. |
2 |
2 |
3 Till now we have covered: |
3 Till now we have covered: |
4 * How to create plots. |
4 * How to create plots. |
5 * How to read data from file and process it. |
5 * How to read data from files and process it. |
6 |
6 |
7 In this session, we will use them and some new concepts to solve a problem/exercise. |
7 In this session, we will use these concepts and some new ones, to solve a problem/exercise. |
8 |
8 |
9 We have a file named sslc.txt. |
9 We have a file named sslc.txt. |
10 It contains record of students and their performance in one of the State Secondary Board Examination. It has 180, 000 lines of record. We are going to read it and process this data. |
10 It contains record of students and their performance in one of the State Secondary Board Examination. It has 180, 000 lines of record. We are going to read it and process this data. |
11 We can see the content of file by opening with any text editor. |
11 We can see the content of file by opening with any text editor. |
12 Please don't edit the data. |
12 Please don't edit the data. |
13 This file has a particular structure. Each line in the file is a set of 11 fields: |
13 This file has a particular structure. Each line in the file is a set of 11 fields separated by semi-colons |
14 A;015163;JOSEPH RAJ S;083;042;47;AA;72;244;;; |
14 A;015163;JOSEPH RAJ S;083;042;47;AA;72;244;;; |
15 The following are the fields in any given line. |
15 The following are the fields in any given line. |
16 * Region Code which is 'A' |
16 * Region Code which is 'A' |
17 * Roll Number 015163 |
17 * Roll Number 015163 |
18 * Name JOSEPH RAJ S |
18 * Name JOSEPH RAJ S |
45 Let's first start off with dictionaries. |
45 Let's first start off with dictionaries. |
46 |
46 |
47 We earlier used lists briefly. Back then we just created lists and appended items into them. |
47 We earlier used lists briefly. Back then we just created lists and appended items into them. |
48 x = [1, 4, 2, 7, 6] |
48 x = [1, 4, 2, 7, 6] |
49 In order to access any element in a list, we use its index number. Index starts from 0. |
49 In order to access any element in a list, we use its index number. Index starts from 0. |
50 For eg. x[0] will give 1 and x[3] will 7. |
50 For eg. x[0] will give 1 and x[3] will give 7. |
51 |
51 |
52 There are times when we can't access things through integer indexes. For example consider a telephone directory, we give it a name and it should return back corresponding number. List is not the best kind of data structure for such problems, and hence Python provides support for dictionaries. Dictionaries are key value pairs. Lists are indexed by integers while dictionaries are indexed by strings. For example: |
52 But, using integer indexes isn't always convenient. For example, consider a telephone directory. We give it a name and it should return a corresponding number. A list is not well suited for such problems. Python's dictionaries are better, for such problems. Dictionaries are just key-value pairs. For example: |
53 |
53 |
54 d = {'png' : 'image', |
54 d = {'png' : 'image', |
55 'txt' : 'text', |
55 'txt' : 'text', |
56 'py' : 'python'} |
56 'py' : 'python'} |
57 |
57 |
58 d |
58 d |
59 |
59 |
60 d is a dictionary. The first element in the pair is called the `key' and the second is called the `value'. The key always has to be a string while the value can be of any type. |
60 d is a dictionary. The first element in the pair is called the `key' and the second is called the `value'. The key always has to be a string while the value can be of any type. |
61 |
61 |
62 Dictionaries are indexed using their keys as shown |
62 Lists are indexed by integers while dictionaries are indexed by strings. They are indexed using their keys as shown |
63 In []: d['txt'] |
63 In []: d['txt'] |
64 Out[]: 'text' |
64 Out[]: 'text' |
65 |
65 |
66 In []: d['png'] |
66 In []: d['png'] |
67 Out[]: 'image' |
67 Out[]: 'image' |
121 Let's now start off with the code |
121 Let's now start off with the code |
122 |
122 |
123 We first create an empty dictionary |
123 We first create an empty dictionary |
124 |
124 |
125 science = {} |
125 science = {} |
126 now we read the record data one by one from the file sslc.txt |
126 now we read the records, one by one from the file sslc.txt |
127 |
127 |
128 for record in open('sslc.txt'): |
128 for record in open('sslc.txt'): |
129 |
129 |
130 we split the record on ';' and store them in a list by: fields equals record.split(';') |
130 we split each record on ';' and store it in a list by: fields equals record.split(';') |
131 |
131 |
132 now we get the region code of a particular entry by region_code equal to fields[0].strip. |
132 now we get the region code of a particular entry by region_code equal to fields[0].strip. |
133 The strip() is used to remove all leading and trailing white spaces from a given string |
133 The strip() is used to remove all leading and trailing white spaces from a given string |
134 |
134 |
135 now we check if the region code is already there in dictionary by typing |
135 now we check if the region code is already there in dictionary by typing |
137 when this statement is true, we add new entry to dictionary with initial value 0 and key being the region code. |
137 when this statement is true, we add new entry to dictionary with initial value 0 and key being the region code. |
138 science[region_code] = 0 |
138 science[region_code] = 0 |
139 |
139 |
140 Note that this if statement is inside the for loop so for the if block we will have to give additional indentation. |
140 Note that this if statement is inside the for loop so for the if block we will have to give additional indentation. |
141 |
141 |
142 we again come back to the older 'for' loop indentation and we again strip the string and to get the science marks by |
142 we again come back to the older, 'for' loop's, indentation and get the science marks by |
143 score_str = fields[6].strip() |
143 score_str = fields[6].strip() |
144 |
144 |
145 we check if student was not absent |
145 we check if student was not absent |
146 if score_str != 'AA': |
146 if score_str != 'AA': |
147 then we check if his marks are above 90 or not |
147 then we check if his marks are above 90 or not |