author | Shantanu <shantanu@fossee.in> |
Tue, 13 Apr 2010 00:07:35 +0530 | |
changeset 47 | 501e3fb21e3c |
parent 46 | 34df59770550 |
child 50 | 9d60720b16b0 |
permissions | -rw-r--r-- |
47 | 1 |
Hello and welcome to the tutorial on handling large data files and processing them to get desired results. |
6
e1fcec83e1ab
Added statistics.txt.
Santosh G. Vattam <vattam.santosh@gmail.com>
parents:
diff
changeset
|
2 |
|
46
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
3 |
Till now we have covered: |
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
4 |
* How to create plots. |
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
5 |
* How to read data from file and process it. |
6
e1fcec83e1ab
Added statistics.txt.
Santosh G. Vattam <vattam.santosh@gmail.com>
parents:
diff
changeset
|
6 |
|
46
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
7 |
In this session, we will use them and some new concepts to solve a problem/exercise. |
6
e1fcec83e1ab
Added statistics.txt.
Santosh G. Vattam <vattam.santosh@gmail.com>
parents:
diff
changeset
|
8 |
|
47 | 9 |
We have a file named sslc1.txt. |
10 |
It contains record of students and their performance in one of the State Secondary Board Examination. It has 180, 000 lines of record. We are going to read it and process this data. |
|
46
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
11 |
We can see the content of file by opening with any text editor. |
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
12 |
Please don't edit the data. |
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
13 |
It is arranged in a particular format. |
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
14 |
One particular line being: |
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
15 |
A;015163;JOSEPH RAJ S;083;042;47;AA;72;244;;; |
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
16 |
It has following fields: |
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
17 |
* Region Code which is 'A' |
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
18 |
* Roll Number 015163 |
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
19 |
* Name JOSEPH RAJ S |
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
20 |
* Marks of 5 subjects: |
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
21 |
** English 083 |
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
22 |
** Hindi 042 |
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
23 |
** Maths 47 |
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
24 |
** Science AA (Absent) |
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
25 |
** Social 72 |
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
26 |
* Total marks 244 |
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
27 |
* Pass/Fail Blank cause he was absent in one exam or else it will be(P/F) |
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
28 |
* Withheld Blank in this case(W) |
7
9794cc414498
Minor edits to statistics.txt
Santosh G. Vattam <vattam.santosh@gmail.com>
parents:
6
diff
changeset
|
29 |
|
46
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
30 |
So problem we are going to solve is: |
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
31 |
Draw a pie chart representing proportion of students who scored more than 90% in each region in Science. |
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
32 |
|
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
33 |
The result would be something like this: |
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
34 |
slide of result. |
6
e1fcec83e1ab
Added statistics.txt.
Santosh G. Vattam <vattam.santosh@gmail.com>
parents:
diff
changeset
|
35 |
|
46
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
36 |
We would be using following machinery: |
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
37 |
File Reading(done already) |
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
38 |
parsing (done partly) |
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
39 |
Dictionaries (new) |
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
40 |
Arrays |
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
41 |
Plot (done already) |
6
e1fcec83e1ab
Added statistics.txt.
Santosh G. Vattam <vattam.santosh@gmail.com>
parents:
diff
changeset
|
42 |
|
46
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
43 |
Dictionaries |
6
e1fcec83e1ab
Added statistics.txt.
Santosh G. Vattam <vattam.santosh@gmail.com>
parents:
diff
changeset
|
44 |
|
47 | 45 |
We earlier used lists, back then we just created them and appended items to list. |
46
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
46 |
x = [1, 4, 2, 7, 6] |
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
47 |
to access the first element we use index number, and it starts from 0 so |
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
48 |
x[0] will give |
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
49 |
1 and |
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
50 |
x[3] will |
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
51 |
7 |
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
52 |
|
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
53 |
At times we don't have index to relate things. For example consider a telephone directory, we give it a name and it should return back corresponding number. List is not the best kind of data structure for such problems, and hence Python provides support for dictionaries. Dictionaries are key value pairs. Lists are indexed by integers while dictionaries are indexed by strings. For example: |
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
54 |
|
47 | 55 |
d = {'png' : 'image', |
7
9794cc414498
Minor edits to statistics.txt
Santosh G. Vattam <vattam.santosh@gmail.com>
parents:
6
diff
changeset
|
56 |
'txt' : 'text', |
9794cc414498
Minor edits to statistics.txt
Santosh G. Vattam <vattam.santosh@gmail.com>
parents:
6
diff
changeset
|
57 |
'py' : 'python'} |
47 | 58 |
|
59 |
d |
|
60 |
||
46
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
61 |
d is a dictionary. The first element in the pair is called the `key' and the second is called the `value'. The key always has to be a string while the value can be of any type. |
6
e1fcec83e1ab
Added statistics.txt.
Santosh G. Vattam <vattam.santosh@gmail.com>
parents:
diff
changeset
|
62 |
|
7
9794cc414498
Minor edits to statistics.txt
Santosh G. Vattam <vattam.santosh@gmail.com>
parents:
6
diff
changeset
|
63 |
Dictionaries are indexed using their keys as shown |
9794cc414498
Minor edits to statistics.txt
Santosh G. Vattam <vattam.santosh@gmail.com>
parents:
6
diff
changeset
|
64 |
In []: d['txt'] |
9794cc414498
Minor edits to statistics.txt
Santosh G. Vattam <vattam.santosh@gmail.com>
parents:
6
diff
changeset
|
65 |
Out[]: 'text' |
6
e1fcec83e1ab
Added statistics.txt.
Santosh G. Vattam <vattam.santosh@gmail.com>
parents:
diff
changeset
|
66 |
|
7
9794cc414498
Minor edits to statistics.txt
Santosh G. Vattam <vattam.santosh@gmail.com>
parents:
6
diff
changeset
|
67 |
In []: d['png'] |
9794cc414498
Minor edits to statistics.txt
Santosh G. Vattam <vattam.santosh@gmail.com>
parents:
6
diff
changeset
|
68 |
Out[]: 'image' |
6
e1fcec83e1ab
Added statistics.txt.
Santosh G. Vattam <vattam.santosh@gmail.com>
parents:
diff
changeset
|
69 |
|
7
9794cc414498
Minor edits to statistics.txt
Santosh G. Vattam <vattam.santosh@gmail.com>
parents:
6
diff
changeset
|
70 |
The dictionaries can be searched for the presence of a certain key by typing |
46
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
71 |
'py' in d |
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
72 |
True |
6
e1fcec83e1ab
Added statistics.txt.
Santosh G. Vattam <vattam.santosh@gmail.com>
parents:
diff
changeset
|
73 |
|
47 | 74 |
Please note the values cannot be searched in a dictionaries. |
46
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
75 |
'jpg' in d |
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
76 |
False |
47 | 77 |
'In telephone directory searching number is not a option' |
6
e1fcec83e1ab
Added statistics.txt.
Santosh G. Vattam <vattam.santosh@gmail.com>
parents:
diff
changeset
|
78 |
|
47 | 79 |
to obtain the list of all keys in a dictionary |
46
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
80 |
d.keys() |
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
81 |
['py', 'txt', 'png'] |
6
e1fcec83e1ab
Added statistics.txt.
Santosh G. Vattam <vattam.santosh@gmail.com>
parents:
diff
changeset
|
82 |
|
46
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
83 |
d.values() |
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
84 |
['python', 'text', 'image'] |
7
9794cc414498
Minor edits to statistics.txt
Santosh G. Vattam <vattam.santosh@gmail.com>
parents:
6
diff
changeset
|
85 |
is used to obtain the list of all values in a dictionary |
9794cc414498
Minor edits to statistics.txt
Santosh G. Vattam <vattam.santosh@gmail.com>
parents:
6
diff
changeset
|
86 |
|
46
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
87 |
d |
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
88 |
|
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
89 |
Please observe that dictionaries do not preserve the order in which the items were entered. The order of the elements in a dictionary should not be relied upon. |
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
90 |
|
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
91 |
------------------------------------------------------------------------------------------------------------------ |
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
92 |
|
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
93 |
Parsing and string processing |
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
94 |
|
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
95 |
As we saw previously we will be dealing with lines with such content |
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
96 |
A;015162;JENIL T P;081;060;77;41;74;333;P;; |
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
97 |
so ';' is delimiter we have to look for. |
47 | 98 |
|
46
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
99 |
We will create one string variable to see how can we process it get the desired output. |
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
100 |
|
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
101 |
line = 'A;015162;JENIL T P;081;060;77;41;74;333;P;;' |
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
102 |
a = line.split(';') |
47 | 103 |
we have used split earlier to split on empty spaces, but in this case we will split line for each ';' |
104 |
||
46
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
105 |
a |
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
106 |
|
47 | 107 |
is list containing all the fields separately. |
108 |
||
109 |
a[0] is the region code. |
|
110 |
and a[6] will give us the science marks of that particular region. |
|
111 |
||
46
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
112 |
So we create a dictionary of all the regions with number of students having more then 90 marks. |
47 | 113 |
# Something like |
114 |
# d = {'A': 729, 'C': 764, 'B': 1120,'E': 414, 'D': 603, 'F': 500} |
|
46
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
115 |
|
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
116 |
------------------------------------------------------------------------------------------------------------------ |
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
117 |
|
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
118 |
code |
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
119 |
|
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
120 |
We first create an empty dictionary |
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
121 |
|
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
122 |
science = {} |
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
123 |
now we read the record data one by one |
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
124 |
|
47 | 125 |
for record in open('sslc.txt'): |
46
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
126 |
|
47 | 127 |
we split the record on ';' and store the list as fields equals record.split(';') |
128 |
# fields = record.split(';') |
|
129 |
||
130 |
now get region code of particular entry by region_code equal to fields[0].strip. strip with remove all leading and trailing white spaces from the string |
|
131 |
# region_code = fields[0].strip() |
|
46
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
132 |
|
47 | 133 |
now we check if the region code is always there in dictionary by writing 'if' statement, |
46
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
134 |
if region_code not in science: |
47 | 135 |
when this statement is true, we add new entry to dictionary with initial value 0 and key being the region code. |
46
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
136 |
science[region_code] = 0 |
47 | 137 |
|
138 |
Note that this if statement is inside the for loop so for if block we will have to give additional indentation. |
|
46
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
139 |
|
47 | 140 |
we again come back to older for loop indentation and we again strip(ing is good) the string and get science marks by |
46
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
141 |
score_str = fields[6].strip() |
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
142 |
|
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
143 |
we check if student was not absent |
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
144 |
if score_str != 'AA': |
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
145 |
then we check if his marks are above 90 or not |
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
146 |
if int(score_str) > 90: |
47 | 147 |
if true we add it to the value of dictionary for that region by |
46
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
148 |
science[region_code] += 1 |
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
149 |
|
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
150 |
Hit return twice |
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
151 |
|
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
152 |
by end of this loop we will have our desired output in the dictionary 'science' |
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
153 |
we can check the values by |
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
154 |
science |
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
155 |
|
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
156 |
now to create a pie chart we use |
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
157 |
|
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
158 |
pie(science.values(),labels = science.keys()) |
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
159 |
title('Students scoring 90% and above in science by region') |
34df59770550
Added script for sslc.txt file and presentation.
Shantanu <shantanu@fossee.in>
parents:
7
diff
changeset
|
160 |
savefig('science.png') |