statistics.txt
changeset 47 501e3fb21e3c
parent 46 34df59770550
child 50 9d60720b16b0
equal deleted inserted replaced
46:34df59770550 47:501e3fb21e3c
     1 Hello welcome to the tutorial on statistics and dictionaries in Python.
     1 Hello and welcome to the tutorial on handling large data files and processing them to get desired results.
     2 
     2 
     3 Till now we have covered:
     3 Till now we have covered:
     4 * How to create plots.
     4 * How to create plots.
     5 * How to read data from file and process it.
     5 * How to read data from file and process it.
     6 
     6 
     7 In this session, we will use them and some new concepts to solve a problem/exercise. 
     7 In this session, we will use them and some new concepts to solve a problem/exercise. 
     8 
     8 
     9 We have a file named sslc1.txt.
     9 We have a file named sslc1.txt. 
    10 It contains record of students and their performance in one of the State Secondary Board Examination.
    10 It contains record of students and their performance in one of the State Secondary Board Examination. It has 180, 000 lines of record. We are going to read it and process this data.
    11 We can see the content of file by opening with any text editor.
    11 We can see the content of file by opening with any text editor.
    12 Please don't edit the data.
    12 Please don't edit the data.
    13 It is arranged in a particular format.
    13 It is arranged in a particular format.
    14 One particular line being:
    14 One particular line being:
    15 A;015163;JOSEPH RAJ S;083;042;47;AA;72;244;;;
    15 A;015163;JOSEPH RAJ S;083;042;47;AA;72;244;;;
    40 Arrays
    40 Arrays
    41 Plot (done already)
    41 Plot (done already)
    42 
    42 
    43 Dictionaries
    43 Dictionaries
    44 
    44 
    45 We earlier used lists, we just created them and appended items to list. 
    45 We earlier used lists, back then we just created them and appended items to list. 
    46 x = [1, 4, 2, 7, 6]
    46 x = [1, 4, 2, 7, 6]
    47 to access the first element we use index number, and it starts from 0 so
    47 to access the first element we use index number, and it starts from 0 so
    48 x[0] will give
    48 x[0] will give
    49 1 and
    49 1 and
    50 x[3] will
    50 x[3] will
    51 7
    51 7
    52 
    52 
    53 At times we don't have index to relate things. For example consider a telephone directory, we give it a name and it should return back corresponding number. List is not the best kind of data structure for such problems, and hence Python provides support for dictionaries. Dictionaries are key value pairs. Lists are indexed by integers while dictionaries are indexed by strings. For example:
    53 At times we don't have index to relate things. For example consider a telephone directory, we give it a name and it should return back corresponding number. List is not the best kind of data structure for such problems, and hence Python provides support for dictionaries. Dictionaries are key value pairs. Lists are indexed by integers while dictionaries are indexed by strings. For example:
    54 
    54 
    55 In []: d = {'png' : 'image',
    55 d = {'png' : 'image',
    56       'txt' : 'text', 
    56       'txt' : 'text', 
    57       'py' : 'python'} 
    57       'py' : 'python'} 
       
    58 
       
    59 d
       
    60 
    58 d is a dictionary. The first element in the pair is called the `key' and the second is called the `value'. The key always has to be a string while the value can be of any type.
    61 d is a dictionary. The first element in the pair is called the `key' and the second is called the `value'. The key always has to be a string while the value can be of any type.
    59 
    62 
    60 Dictionaries are indexed using their keys as shown
    63 Dictionaries are indexed using their keys as shown
    61 In []: d['txt']
    64 In []: d['txt']
    62 Out[]: 'text'
    65 Out[]: 'text'
    66 
    69 
    67 The dictionaries can be searched for the presence of a certain key by typing
    70 The dictionaries can be searched for the presence of a certain key by typing
    68 'py' in d
    71 'py' in d
    69 True
    72 True
    70 
    73 
       
    74 Please note the values cannot be searched in a dictionaries.
    71 'jpg' in d
    75 'jpg' in d
    72 False
    76 False
    73 Please note the values cannot be searched in a dictionaries.
    77 'In telephone directory searching number is not a option'
    74 
    78 
       
    79 to obtain the list of all keys in a dictionary
    75 d.keys()
    80 d.keys()
    76 ['py', 'txt', 'png']
    81 ['py', 'txt', 'png']
    77 is used to obtain the list of all keys in a dictionary
       
    78 
    82 
    79 d.values()
    83 d.values()
    80 ['python', 'text', 'image']
    84 ['python', 'text', 'image']
    81 is used to obtain the list of all values in a dictionary
    85 is used to obtain the list of all values in a dictionary
    82 
    86 
    89 Parsing and string processing
    93 Parsing and string processing
    90 
    94 
    91 As we saw previously we will be dealing with lines with such content
    95 As we saw previously we will be dealing with lines with such content
    92 A;015162;JENIL T P;081;060;77;41;74;333;P;;
    96 A;015162;JENIL T P;081;060;77;41;74;333;P;;
    93 so ';' is delimiter we have to look for.
    97 so ';' is delimiter we have to look for.
       
    98 
    94 We will create one string variable to see how can we process it get the desired output.
    99 We will create one string variable to see how can we process it get the desired output.
    95 
   100 
    96 line = 'A;015162;JENIL T P;081;060;77;41;74;333;P;;'
   101 line = 'A;015162;JENIL T P;081;060;77;41;74;333;P;;'
    97 a = line.split(';')
   102 a = line.split(';')
    98 we have used split earlier to split on empty spaces.
   103 we have used split earlier to split on empty spaces, but in this case we will split line for each ';'
       
   104 
    99 a 
   105 a 
   100 
   106 
   101 is list with all elements separated.
   107 is list containing all the fields separately.
   102 a[0] is the region we want.
   108 
   103 and a[6] will give us the science marks of a particular region.
   109 a[0] is the region code.
       
   110 and a[6] will give us the science marks of that particular region.
       
   111 
   104 So we create a dictionary of all the regions with number of students having more then 90 marks.
   112 So we create a dictionary of all the regions with number of students having more then 90 marks.
   105 Something like 
   113 # Something like 
   106 d = {'A': 729, 'C': 764, 'B': 1120,'E': 414, 'D': 603, 'F': 500}
   114 # d = {'A': 729, 'C': 764, 'B': 1120,'E': 414, 'D': 603, 'F': 500}
   107 
   115 
   108 ------------------------------------------------------------------------------------------------------------------
   116 ------------------------------------------------------------------------------------------------------------------
   109 
   117 
   110 code
   118 code
   111 
   119 
   112 We first create an empty dictionary
   120 We first create an empty dictionary
   113 
   121 
   114 science = {}
   122 science = {}
   115 now we read the record data one by one
   123 now we read the record data one by one
   116 
   124 
   117 for record in open('sslc1.txt'):
   125 for record in open('sslc.txt'):
   118 
   126 
   119     we split the record on ';' and store the list in 'fields'
   127     we split the record on ';' and store the list as fields equals record.split(';')
   120     fields = record.split(';')
   128 #    fields = record.split(';')
   121 
   129 
   122     now we strip this string for leading and trailing white spaces
   130     now get region code of particular entry by region_code equal to fields[0].strip. strip with remove all leading and trailing white spaces from the string
   123     region_code = fields[0].strip()
   131 #    region_code = fields[0].strip()
   124 
   132 
   125     now we check if the region code is always there in dictionary by writing 'if' statement
   133     now we check if the region code is always there in dictionary by writing 'if' statement, 
   126     if region_code not in science:    
   134     if region_code not in science:    
   127        when this statement is true, we add new entry to dictionary with 
   135        when this statement is true, we add new entry to dictionary with initial value 0 and key being the region code.
   128        science[region_code] = 0
   136        science[region_code] = 0
       
   137        
       
   138     Note that this if statement is inside the for loop so for if block we will have to give additional indentation.
   129 
   139 
   130     we again strip(ing is good) the string
   140     we again come back to older for loop indentation and we again strip(ing is good) the string and get science marks by
   131     score_str = fields[6].strip()
   141     score_str = fields[6].strip()
   132 
   142 
   133     we check if student was not absent
   143     we check if student was not absent
   134     if score_str != 'AA':
   144     if score_str != 'AA':
   135        then we check if his marks are above 90 or not
   145        then we check if his marks are above 90 or not
   136        if int(score_str) > 90:
   146        if int(score_str) > 90:
       
   147        	  if true we add it to the value of dictionary for that region by
   137        	  science[region_code] += 1
   148        	  science[region_code] += 1
   138 
   149 
   139     Hit return twice
   150     Hit return twice
   140 
   151 
   141 by end of this loop we will have our desired output in the dictionary 'science'
   152 by end of this loop we will have our desired output in the dictionary 'science'
   145 now to create a pie chart we use
   156 now to create a pie chart we use
   146 
   157 
   147 pie(science.values(),labels = science.keys())
   158 pie(science.values(),labels = science.keys())
   148 title('Students scoring 90% and above in science by region')
   159 title('Students scoring 90% and above in science by region')
   149 savefig('science.png')
   160 savefig('science.png')
   150