author | Puneeth Chaganti <punchagan@gmail.com> |
Tue, 13 Apr 2010 14:32:38 +0530 | |
changeset 53 | 3d2c2c0bc3e2 |
parent 52 | 53700ad0e71e |
child 58 | 2c4e318741cf |
permissions | -rw-r--r-- |
52
53700ad0e71e
Edits to statistics.txt.
Puneeth Chaganti <punchagan@gmail.com>
parents:
51
diff
changeset
|
1 |
Hello and welcome to the tutorial on handling large data files and processing them. |
6
e1fcec83e1ab
Added statistics.txt.
Santosh G. Vattam <vattam.santosh@gmail.com>
parents:
diff
changeset
|
2 |
|
46
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
3 |
Till now we have covered: |
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
4 |
* How to create plots. |
52
53700ad0e71e
Edits to statistics.txt.
Puneeth Chaganti <punchagan@gmail.com>
parents:
51
diff
changeset
|
5 |
* How to read data from files and process it. |
6
e1fcec83e1ab
Added statistics.txt.
Santosh G. Vattam <vattam.santosh@gmail.com>
parents:
diff
changeset
|
6 |
|
52
53700ad0e71e
Edits to statistics.txt.
Puneeth Chaganti <punchagan@gmail.com>
parents:
51
diff
changeset
|
7 |
In this session, we will use these concepts and some new ones, to solve a problem/exercise. |
6
e1fcec83e1ab
Added statistics.txt.
Santosh G. Vattam <vattam.santosh@gmail.com>
parents:
diff
changeset
|
8 |
|
51 | 9 |
We have a file named sslc.txt. |
47 | 10 |
It contains record of students and their performance in one of the State Secondary Board Examination. It has 180, 000 lines of record. We are going to read it and process this data. |
46
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
11 |
We can see the content of file by opening with any text editor. |
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
12 |
Please don't edit the data. |
52
53700ad0e71e
Edits to statistics.txt.
Puneeth Chaganti <punchagan@gmail.com>
parents:
51
diff
changeset
|
13 |
This file has a particular structure. Each line in the file is a set of 11 fields separated by semi-colons |
46
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
14 |
A;015163;JOSEPH RAJ S;083;042;47;AA;72;244;;; |
50 | 15 |
The following are the fields in any given line. |
46
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
16 |
* Region Code which is 'A' |
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
17 |
* Roll Number 015163 |
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
18 |
* Name JOSEPH RAJ S |
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
19 |
* Marks of 5 subjects: |
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
20 |
** English 083 |
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
21 |
** Hindi 042 |
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
22 |
** Maths 47 |
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
23 |
** Science AA (Absent) |
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
24 |
** Social 72 |
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
25 |
* Total marks 244 |
50 | 26 |
* Pass/Fail - This field is blank here because the particular candidate was absent for an exam if not it would've been one of (P/F) |
27 |
* Withheld - Again blank in this case(W) |
|
7
9794cc414498
Minor edits to statistics.txt
Santosh G. Vattam <vattam.santosh@gmail.com>
parents:
6
diff
changeset
|
28 |
|
50 | 29 |
Let us now look at the problem we wish to solve: |
30 |
Draw a pie chart representing the proportion of students who scored more than 90% in each region in Science. |
|
46
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
31 |
|
50 | 32 |
This is the result we expect: |
33 |
#slide of result. |
|
6
e1fcec83e1ab
Added statistics.txt.
Santosh G. Vattam <vattam.santosh@gmail.com>
parents:
diff
changeset
|
34 |
|
50 | 35 |
In order to solve this problem, we need the following machinery: |
36 |
File Reading - which we have already looked at. |
|
37 |
parsing - which we have looked at partially. |
|
38 |
Dictionaries - we shall be introducing the concept of dictionaries here. |
|
39 |
And finally plotting - which we have been doing all along. |
|
6
e1fcec83e1ab
Added statistics.txt.
Santosh G. Vattam <vattam.santosh@gmail.com>
parents:
diff
changeset
|
40 |
|
50 | 41 |
Let's first start off with dictionaries. |
42 |
||
43 |
We earlier used lists briefly. Back then we just created lists and appended items into them. |
|
46
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
44 |
x = [1, 4, 2, 7, 6] |
50 | 45 |
In order to access any element in a list, we use its index number. Index starts from 0. |
52
53700ad0e71e
Edits to statistics.txt.
Puneeth Chaganti <punchagan@gmail.com>
parents:
51
diff
changeset
|
46 |
For eg. x[0] will give 1 and x[3] will give 7. |
46
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
47 |
|
52
53700ad0e71e
Edits to statistics.txt.
Puneeth Chaganti <punchagan@gmail.com>
parents:
51
diff
changeset
|
48 |
But, using integer indexes isn't always convenient. For example, consider a telephone directory. We give it a name and it should return a corresponding number. A list is not well suited for such problems. Python's dictionaries are better, for such problems. Dictionaries are just key-value pairs. For example: |
46
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
49 |
|
47 | 50 |
d = {'png' : 'image', |
7
9794cc414498
Minor edits to statistics.txt
Santosh G. Vattam <vattam.santosh@gmail.com>
parents:
6
diff
changeset
|
51 |
'txt' : 'text', |
9794cc414498
Minor edits to statistics.txt
Santosh G. Vattam <vattam.santosh@gmail.com>
parents:
6
diff
changeset
|
52 |
'py' : 'python'} |
47 | 53 |
|
54 |
d |
|
55 |
||
46
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
56 |
d is a dictionary. The first element in the pair is called the `key' and the second is called the `value'. The key always has to be a string while the value can be of any type. |
6
e1fcec83e1ab
Added statistics.txt.
Santosh G. Vattam <vattam.santosh@gmail.com>
parents:
diff
changeset
|
57 |
|
52
53700ad0e71e
Edits to statistics.txt.
Puneeth Chaganti <punchagan@gmail.com>
parents:
51
diff
changeset
|
58 |
Lists are indexed by integers while dictionaries are indexed by strings. They are indexed using their keys as shown |
7
9794cc414498
Minor edits to statistics.txt
Santosh G. Vattam <vattam.santosh@gmail.com>
parents:
6
diff
changeset
|
59 |
In []: d['txt'] |
9794cc414498
Minor edits to statistics.txt
Santosh G. Vattam <vattam.santosh@gmail.com>
parents:
6
diff
changeset
|
60 |
Out[]: 'text' |
6
e1fcec83e1ab
Added statistics.txt.
Santosh G. Vattam <vattam.santosh@gmail.com>
parents:
diff
changeset
|
61 |
|
7
9794cc414498
Minor edits to statistics.txt
Santosh G. Vattam <vattam.santosh@gmail.com>
parents:
6
diff
changeset
|
62 |
In []: d['png'] |
9794cc414498
Minor edits to statistics.txt
Santosh G. Vattam <vattam.santosh@gmail.com>
parents:
6
diff
changeset
|
63 |
Out[]: 'image' |
6
e1fcec83e1ab
Added statistics.txt.
Santosh G. Vattam <vattam.santosh@gmail.com>
parents:
diff
changeset
|
64 |
|
7
9794cc414498
Minor edits to statistics.txt
Santosh G. Vattam <vattam.santosh@gmail.com>
parents:
6
diff
changeset
|
65 |
The dictionaries can be searched for the presence of a certain key by typing |
46
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
66 |
'py' in d |
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
67 |
True |
6
e1fcec83e1ab
Added statistics.txt.
Santosh G. Vattam <vattam.santosh@gmail.com>
parents:
diff
changeset
|
68 |
|
46
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
69 |
'jpg' in d |
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
70 |
False |
6
e1fcec83e1ab
Added statistics.txt.
Santosh G. Vattam <vattam.santosh@gmail.com>
parents:
diff
changeset
|
71 |
|
52
53700ad0e71e
Edits to statistics.txt.
Puneeth Chaganti <punchagan@gmail.com>
parents:
51
diff
changeset
|
72 |
Please note that keys, and not values, are searched. |
50 | 73 |
'In a telephone directory one can search for a number based on a name, but not for a name based on a number' |
74 |
||
52
53700ad0e71e
Edits to statistics.txt.
Puneeth Chaganti <punchagan@gmail.com>
parents:
51
diff
changeset
|
75 |
to obtain the list of all keys in a dictionary, type |
46
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
76 |
d.keys() |
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
77 |
['py', 'txt', 'png'] |
6
e1fcec83e1ab
Added statistics.txt.
Santosh G. Vattam <vattam.santosh@gmail.com>
parents:
diff
changeset
|
78 |
|
50 | 79 |
Similarly, |
46
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
80 |
d.values() |
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
81 |
['python', 'text', 'image'] |
7
9794cc414498
Minor edits to statistics.txt
Santosh G. Vattam <vattam.santosh@gmail.com>
parents:
6
diff
changeset
|
82 |
is used to obtain the list of all values in a dictionary |
9794cc414498
Minor edits to statistics.txt
Santosh G. Vattam <vattam.santosh@gmail.com>
parents:
6
diff
changeset
|
83 |
|
50 | 84 |
Let's now see what the dictionary contains |
85 |
d |
|
46
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
86 |
|
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
87 |
Please observe that dictionaries do not preserve the order in which the items were entered. The order of the elements in a dictionary should not be relied upon. |
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
88 |
|
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
89 |
------------------------------------------------------------------------------------------------------------------ |
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
90 |
|
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
91 |
Parsing and string processing |
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
92 |
|
50 | 93 |
As we saw previously we will be dealing with lines with content of the form |
46
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
94 |
A;015162;JENIL T P;081;060;77;41;74;333;P;; |
50 | 95 |
Here ';' is delimiter, that is ';' is used to separate the fields. |
47 | 96 |
|
50 | 97 |
We shall create one string variable to see how can we process it to get the desired output. |
46
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
98 |
|
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
99 |
line = 'A;015162;JENIL T P;081;060;77;41;74;333;P;;' |
50 | 100 |
|
101 |
Previously we saw how to split on spaces when we processed the pendulum.txt file. Let us now look at how to split a string into a list of fields based on a delimiter other than space. |
|
46
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
102 |
a = line.split(';') |
47 | 103 |
|
50 | 104 |
Let's now check what 'a' contains. |
105 |
||
106 |
a |
|
46
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
107 |
|
47 | 108 |
is list containing all the fields separately. |
109 |
||
50 | 110 |
a[0] is the region code, a[1] the roll no., a[2] the name and so on. |
111 |
Similarly, a[6] will give us the science marks of that particular region. |
|
47 | 112 |
|
50 | 113 |
So we create a dictionary of all the regions with number of students having more than 90 marks. |
46
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
114 |
|
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
115 |
------------------------------------------------------------------------------------------------------------------ |
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
116 |
|
50 | 117 |
Let's now start off with the code |
46
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
118 |
|
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
119 |
We first create an empty dictionary |
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
120 |
|
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
121 |
science = {} |
53
3d2c2c0bc3e2
More edits to statistics.txt.
Puneeth Chaganti <punchagan@gmail.com>
parents:
52
diff
changeset
|
122 |
now we read the records, one by one from the file sslc.txt |
46
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
123 |
|
47 | 124 |
for record in open('sslc.txt'): |
46
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
125 |
|
53
3d2c2c0bc3e2
More edits to statistics.txt.
Puneeth Chaganti <punchagan@gmail.com>
parents:
52
diff
changeset
|
126 |
we split each record on ';' and store it in a list by: fields equals record.split(';') |
47 | 127 |
|
50 | 128 |
now we get the region code of a particular entry by region_code equal to fields[0].strip. |
129 |
The strip() is used to remove all leading and trailing white spaces from a given string |
|
46
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
130 |
|
50 | 131 |
now we check if the region code is already there in dictionary by typing |
46
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
132 |
if region_code not in science: |
47 | 133 |
when this statement is true, we add new entry to dictionary with initial value 0 and key being the region code. |
46
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
134 |
science[region_code] = 0 |
47 | 135 |
|
50 | 136 |
Note that this if statement is inside the for loop so for the if block we will have to give additional indentation. |
46
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
137 |
|
53
3d2c2c0bc3e2
More edits to statistics.txt.
Puneeth Chaganti <punchagan@gmail.com>
parents:
52
diff
changeset
|
138 |
we again come back to the older, 'for' loop's, indentation and get the science marks by |
46
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
139 |
score_str = fields[6].strip() |
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
140 |
|
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
141 |
we check if student was not absent |
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
142 |
if score_str != 'AA': |
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
143 |
then we check if his marks are above 90 or not |
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
144 |
if int(score_str) > 90: |
50 | 145 |
if yes we add 1 to the value of dictionary for that region by |
46
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
146 |
science[region_code] += 1 |
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
147 |
|
50 | 148 |
Hit return twice to exit the for loop |
46
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
149 |
|
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
150 |
by end of this loop we will have our desired output in the dictionary 'science' |
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
151 |
we can check the values by |
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
152 |
science |
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
153 |
|
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
154 |
now to create a pie chart we use |
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
155 |
|
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
156 |
pie(science.values(),labels = science.keys()) |
50 | 157 |
|
158 |
the first argument to the pie function is the values to be plotted. The second is an optional argument which is used to label the regions. |
|
159 |
||
46
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
160 |
title('Students scoring 90% and above in science by region') |
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
161 |
savefig('science.png') |
50 | 162 |
|
163 |
That brings us to the end of this tutorial. We have learnt about dictionaries, some basic string parsing and plotting pie chart in this tutorial. Hope you have enjoyed it. Thank you. |
|
53
3d2c2c0bc3e2
More edits to statistics.txt.
Puneeth Chaganti <punchagan@gmail.com>
parents:
52
diff
changeset
|
164 |
#slide of summary. |