statistics.txt
changeset 58 2c4e318741cf
parent 57 8eb98721a5af
parent 53 3d2c2c0bc3e2
child 59 b62177acce71
equal deleted inserted replaced
57:8eb98721a5af 58:2c4e318741cf
     1 Hello and welcome to the tutorial on handling large data files and processing them to get desired results.
     1 Hello and welcome to the tutorial on handling large data files and processing them.
     2 
     2 
     3 Till now we have covered:
     3 Till now we have covered:
     4 * How to create plots.
     4 * How to create plots.
     5 * How to read data from file and process it.
     5 * How to read data from files and process it.
     6 
     6 
     7 In this session, we will use them and some new concepts to solve a problem/exercise. 
     7 In this session, we will use these concepts and some new ones, to solve a problem/exercise. 
     8 
     8 
     9 We have a file named sslc.txt. 
     9 We have a file named sslc.txt. 
    10 It contains record of students and their performance in one of the State Secondary Board Examination. It has 180, 000 lines of record. We are going to read it and process this data.
    10 It contains record of students and their performance in one of the State Secondary Board Examination. It has 180, 000 lines of record. We are going to read it and process this data.
    11 We can see the content of file by opening with any text editor.
    11 We can see the content of file by opening with any text editor.
    12 Please don't edit the data.
    12 Please don't edit the data.
    13 This file has a particular structure. Each line in the file is a set of 11 fields:
    13 This file has a particular structure. Each line in the file is a set of 11 fields separated by semi-colons
    14 A;015163;JOSEPH RAJ S;083;042;47;AA;72;244;;;
    14 A;015163;JOSEPH RAJ S;083;042;47;AA;72;244;;;
    15 The following are the fields in any given line.
    15 The following are the fields in any given line.
    16 * Region Code which is 'A'
    16 * Region Code which is 'A'
    17 * Roll Number 015163
    17 * Roll Number 015163
    18 * Name JOSEPH RAJ S
    18 * Name JOSEPH RAJ S
    45 Let's first start off with dictionaries.
    45 Let's first start off with dictionaries.
    46 
    46 
    47 We earlier used lists briefly. Back then we just created lists and appended items into them. 
    47 We earlier used lists briefly. Back then we just created lists and appended items into them. 
    48 x = [1, 4, 2, 7, 6]
    48 x = [1, 4, 2, 7, 6]
    49 In order to access any element in a list, we use its index number. Index starts from 0.
    49 In order to access any element in a list, we use its index number. Index starts from 0.
    50 For eg. x[0] will give 1 and x[3] will 7.
    50 For eg. x[0] will give 1 and x[3] will give 7.
    51 
    51 
    52 There are times when we can't access things through integer indexes. For example consider a telephone directory, we give it a name and it should return back corresponding number. List is not the best kind of data structure for such problems, and hence Python provides support for dictionaries. Dictionaries are key value pairs. Lists are indexed by integers while dictionaries are indexed by strings. For example:
    52 But, using integer indexes isn't always convenient. For example, consider a telephone directory. We give it a name and it should return a corresponding number. A list is not well suited for such problems. Python's dictionaries are better, for such problems. Dictionaries are just key-value pairs. For example:
    53 
    53 
    54 d = {'png' : 'image',
    54 d = {'png' : 'image',
    55       'txt' : 'text', 
    55       'txt' : 'text', 
    56       'py' : 'python'} 
    56       'py' : 'python'} 
    57 
    57 
    58 d
    58 d
    59 
    59 
    60 d is a dictionary. The first element in the pair is called the `key' and the second is called the `value'. The key always has to be a string while the value can be of any type.
    60 d is a dictionary. The first element in the pair is called the `key' and the second is called the `value'. The key always has to be a string while the value can be of any type.
    61 
    61 
    62 Dictionaries are indexed using their keys as shown
    62 Lists are indexed by integers while dictionaries are indexed by strings. They are indexed using their keys as shown
    63 In []: d['txt']
    63 In []: d['txt']
    64 Out[]: 'text'
    64 Out[]: 'text'
    65 
    65 
    66 In []: d['png']
    66 In []: d['png']
    67 Out[]: 'image'
    67 Out[]: 'image'
    71 True
    71 True
    72 
    72 
    73 'jpg' in d
    73 'jpg' in d
    74 False
    74 False
    75 
    75 
    76 Please note the values cannot be searched in a dictionaries.
    76 Please note that keys, and not values, are searched. 
    77 'In a telephone directory one can search for a number based on a name, but not for a name based on a number'
    77 'In a telephone directory one can search for a number based on a name, but not for a name based on a number'
    78 
    78 
    79 to obtain the list of all keys in a dictionary type
    79 to obtain the list of all keys in a dictionary, type
    80 d.keys()
    80 d.keys()
    81 ['py', 'txt', 'png']
    81 ['py', 'txt', 'png']
    82 
    82 
    83 Similarly,
    83 Similarly,
    84 d.values()
    84 d.values()
   121 Let's now start off with the code
   121 Let's now start off with the code
   122 
   122 
   123 We first create an empty dictionary
   123 We first create an empty dictionary
   124 
   124 
   125 science = {}
   125 science = {}
   126 now we read the record data one by one from the file sslc.txt
   126 now we read the records, one by one from the file sslc.txt
   127 
   127 
   128 for record in open('sslc.txt'):
   128 for record in open('sslc.txt'):
   129 
   129 
   130     we split the record on ';' and store them in a list by: fields equals record.split(';')
   130     we split each record on ';' and store it in a list by: fields equals record.split(';')
   131 
   131 
   132     now we get the region code of a particular entry by region_code equal to fields[0].strip.
   132     now we get the region code of a particular entry by region_code equal to fields[0].strip.
   133 The strip() is used to remove all leading and trailing white spaces from a given string
   133 The strip() is used to remove all leading and trailing white spaces from a given string
   134 
   134 
   135     now we check if the region code is already there in dictionary by typing
   135     now we check if the region code is already there in dictionary by typing
   137        when this statement is true, we add new entry to dictionary with initial value 0 and key being the region code.
   137        when this statement is true, we add new entry to dictionary with initial value 0 and key being the region code.
   138        science[region_code] = 0
   138        science[region_code] = 0
   139        
   139        
   140     Note that this if statement is inside the for loop so for the if block we will have to give additional indentation.
   140     Note that this if statement is inside the for loop so for the if block we will have to give additional indentation.
   141 
   141 
   142     we again come back to the older 'for' loop indentation and we again strip the string and to get the science marks by
   142     we again come back to the older, 'for' loop's, indentation and get the science marks by
   143     score_str = fields[6].strip()
   143     score_str = fields[6].strip()
   144 
   144 
   145     we check if student was not absent
   145     we check if student was not absent
   146     if score_str != 'AA':
   146     if score_str != 'AA':
   147        then we check if his marks are above 90 or not
   147        then we check if his marks are above 90 or not
   163 
   163 
   164 title('Students scoring 90% and above in science by region')
   164 title('Students scoring 90% and above in science by region')
   165 savefig('science.png')
   165 savefig('science.png')
   166 
   166 
   167 That brings us to the end of this tutorial. We have learnt about dictionaries, some basic string parsing and plotting pie chart in this tutorial. Hope you have enjoyed it. Thank you.
   167 That brings us to the end of this tutorial. We have learnt about dictionaries, some basic string parsing and plotting pie chart in this tutorial. Hope you have enjoyed it. Thank you.
       
   168 #slide of summary.