statistics.txt
changeset 50 9d60720b16b0
parent 47 501e3fb21e3c
child 51 32d854e62be9
equal deleted inserted replaced
49:90c2d777fb0e 50:9d60720b16b0
     8 
     8 
     9 We have a file named sslc1.txt. 
     9 We have a file named sslc1.txt. 
    10 It contains record of students and their performance in one of the State Secondary Board Examination. It has 180, 000 lines of record. We are going to read it and process this data.
    10 It contains record of students and their performance in one of the State Secondary Board Examination. It has 180, 000 lines of record. We are going to read it and process this data.
    11 We can see the content of file by opening with any text editor.
    11 We can see the content of file by opening with any text editor.
    12 Please don't edit the data.
    12 Please don't edit the data.
    13 It is arranged in a particular format.
    13 This file has a particular structure. Each line in the file is a set of 11 fields:
    14 One particular line being:
       
    15 A;015163;JOSEPH RAJ S;083;042;47;AA;72;244;;;
    14 A;015163;JOSEPH RAJ S;083;042;47;AA;72;244;;;
    16 It has following fields:
    15 The following are the fields in any given line.
    17 * Region Code which is 'A'
    16 * Region Code which is 'A'
    18 * Roll Number 015163
    17 * Roll Number 015163
    19 * Name JOSEPH RAJ S
    18 * Name JOSEPH RAJ S
    20 * Marks of 5 subjects: 
    19 * Marks of 5 subjects: 
    21   ** English 083
    20   ** English 083
    22   ** Hindi 042
    21   ** Hindi 042
    23   ** Maths 47
    22   ** Maths 47
    24   ** Science AA (Absent)
    23   ** Science AA (Absent)
    25   ** Social 72
    24   ** Social 72
    26 * Total marks 244
    25 * Total marks 244
    27 * Pass/Fail Blank cause he was absent in one exam or else it will be(P/F)
    26 * Pass/Fail - This field is blank here because the particular candidate was absent for an exam if not it would've been one of (P/F)
    28 * Withheld Blank in this case(W)
    27 * Withheld - Again blank in this case(W)
    29 
    28 
    30 So problem we are going to solve is:
    29 Let us now look at the problem we wish to solve:
    31 Draw a pie chart representing proportion of students who scored more than 90% in each region in Science.
    30 Draw a pie chart representing the proportion of students who scored more than 90% in each region in Science.
    32 
    31 
    33 The result would be something like this:
    32 This is the result we expect:
    34 slide of result.
    33 #slide of result.
    35 
    34 
    36 We would be using following machinery:
    35 In order to solve this problem, we need the following machinery:
    37 File Reading(done already)
    36 File Reading - which we have already looked at.
    38 parsing (done partly)
    37 parsing  - which we have looked at partially.
    39 Dictionaries (new)
    38 Dictionaries - we shall be introducing the concept of dictionaries here.
    40 Arrays
    39 And finally plotting - which we have been doing all along.
    41 Plot (done already)
       
    42 
    40 
    43 Dictionaries
    41 Let's first start off with dictionaries.
    44 
    42 
    45 We earlier used lists, back then we just created them and appended items to list. 
    43 We earlier used lists briefly. Back then we just created lists and appended items into them. 
    46 x = [1, 4, 2, 7, 6]
    44 x = [1, 4, 2, 7, 6]
    47 to access the first element we use index number, and it starts from 0 so
    45 In order to access any element in a list, we use its index number. Index starts from 0.
    48 x[0] will give
    46 For eg. x[0] will give 1 and x[3] will 7.
    49 1 and
       
    50 x[3] will
       
    51 7
       
    52 
    47 
    53 At times we don't have index to relate things. For example consider a telephone directory, we give it a name and it should return back corresponding number. List is not the best kind of data structure for such problems, and hence Python provides support for dictionaries. Dictionaries are key value pairs. Lists are indexed by integers while dictionaries are indexed by strings. For example:
    48 There are times when we can't access things through integer indexes. For example consider a telephone directory, we give it a name and it should return back corresponding number. List is not the best kind of data structure for such problems, and hence Python provides support for dictionaries. Dictionaries are key value pairs. Lists are indexed by integers while dictionaries are indexed by strings. For example:
    54 
    49 
    55 d = {'png' : 'image',
    50 d = {'png' : 'image',
    56       'txt' : 'text', 
    51       'txt' : 'text', 
    57       'py' : 'python'} 
    52       'py' : 'python'} 
    58 
    53 
    69 
    64 
    70 The dictionaries can be searched for the presence of a certain key by typing
    65 The dictionaries can be searched for the presence of a certain key by typing
    71 'py' in d
    66 'py' in d
    72 True
    67 True
    73 
    68 
    74 Please note the values cannot be searched in a dictionaries.
       
    75 'jpg' in d
    69 'jpg' in d
    76 False
    70 False
    77 'In telephone directory searching number is not a option'
       
    78 
    71 
    79 to obtain the list of all keys in a dictionary
    72 Please note the values cannot be searched in a dictionaries.
       
    73 'In a telephone directory one can search for a number based on a name, but not for a name based on a number'
       
    74 
       
    75 to obtain the list of all keys in a dictionary type
    80 d.keys()
    76 d.keys()
    81 ['py', 'txt', 'png']
    77 ['py', 'txt', 'png']
    82 
    78 
       
    79 Similarly,
    83 d.values()
    80 d.values()
    84 ['python', 'text', 'image']
    81 ['python', 'text', 'image']
    85 is used to obtain the list of all values in a dictionary
    82 is used to obtain the list of all values in a dictionary
    86 
    83 
    87 d
    84 Let's now see what the dictionary contains
       
    85 d 
    88 
    86 
    89 Please observe that dictionaries do not preserve the order in which the items were entered. The order of the elements in a dictionary should not be relied upon.
    87 Please observe that dictionaries do not preserve the order in which the items were entered. The order of the elements in a dictionary should not be relied upon.
    90 
    88 
    91 ------------------------------------------------------------------------------------------------------------------
    89 ------------------------------------------------------------------------------------------------------------------
    92 
    90 
    93 Parsing and string processing
    91 Parsing and string processing
    94 
    92 
    95 As we saw previously we will be dealing with lines with such content
    93 As we saw previously we will be dealing with lines with content of the form
    96 A;015162;JENIL T P;081;060;77;41;74;333;P;;
    94 A;015162;JENIL T P;081;060;77;41;74;333;P;;
    97 so ';' is delimiter we have to look for.
    95 Here ';' is delimiter, that is ';' is used to separate the fields.
    98 
    96 
    99 We will create one string variable to see how can we process it get the desired output.
    97 We shall create one string variable to see how can we process it to get the desired output.
   100 
    98 
   101 line = 'A;015162;JENIL T P;081;060;77;41;74;333;P;;'
    99 line = 'A;015162;JENIL T P;081;060;77;41;74;333;P;;'
       
   100 
       
   101 Previously we saw how to split on spaces when we processed the pendulum.txt file. Let us now look at how to split a string into a list of fields based on a delimiter other than space.
   102 a = line.split(';')
   102 a = line.split(';')
   103 we have used split earlier to split on empty spaces, but in this case we will split line for each ';'
       
   104 
   103 
   105 a 
   104 Let's now check what 'a' contains.
       
   105 
       
   106 a
   106 
   107 
   107 is list containing all the fields separately.
   108 is list containing all the fields separately.
   108 
   109 
   109 a[0] is the region code.
   110 a[0] is the region code, a[1] the roll no., a[2] the name and so on.
   110 and a[6] will give us the science marks of that particular region.
   111 Similarly, a[6] will give us the science marks of that particular region.
   111 
   112 
   112 So we create a dictionary of all the regions with number of students having more then 90 marks.
   113 So we create a dictionary of all the regions with number of students having more than 90 marks.
   113 # Something like 
       
   114 # d = {'A': 729, 'C': 764, 'B': 1120,'E': 414, 'D': 603, 'F': 500}
       
   115 
   114 
   116 ------------------------------------------------------------------------------------------------------------------
   115 ------------------------------------------------------------------------------------------------------------------
   117 
   116 
   118 code
   117 Let's now start off with the code
   119 
   118 
   120 We first create an empty dictionary
   119 We first create an empty dictionary
   121 
   120 
   122 science = {}
   121 science = {}
   123 now we read the record data one by one
   122 now we read the record data one by one from the file sslc1.txt
   124 
   123 
   125 for record in open('sslc.txt'):
   124 for record in open('sslc.txt'):
   126 
   125 
   127     we split the record on ';' and store the list as fields equals record.split(';')
   126     we split the record on ';' and store them in a list by: fields equals record.split(';')
   128 #    fields = record.split(';')
       
   129 
   127 
   130     now get region code of particular entry by region_code equal to fields[0].strip. strip with remove all leading and trailing white spaces from the string
   128     now we get the region code of a particular entry by region_code equal to fields[0].strip.
   131 #    region_code = fields[0].strip()
   129 The strip() is used to remove all leading and trailing white spaces from a given string
   132 
   130 
   133     now we check if the region code is always there in dictionary by writing 'if' statement, 
   131     now we check if the region code is already there in dictionary by typing
   134     if region_code not in science:    
   132     if region_code not in science:    
   135        when this statement is true, we add new entry to dictionary with initial value 0 and key being the region code.
   133        when this statement is true, we add new entry to dictionary with initial value 0 and key being the region code.
   136        science[region_code] = 0
   134        science[region_code] = 0
   137        
   135        
   138     Note that this if statement is inside the for loop so for if block we will have to give additional indentation.
   136     Note that this if statement is inside the for loop so for the if block we will have to give additional indentation.
   139 
   137 
   140     we again come back to older for loop indentation and we again strip(ing is good) the string and get science marks by
   138     we again come back to the older 'for' loop indentation and we again strip the string and to get the science marks by
   141     score_str = fields[6].strip()
   139     score_str = fields[6].strip()
   142 
   140 
   143     we check if student was not absent
   141     we check if student was not absent
   144     if score_str != 'AA':
   142     if score_str != 'AA':
   145        then we check if his marks are above 90 or not
   143        then we check if his marks are above 90 or not
   146        if int(score_str) > 90:
   144        if int(score_str) > 90:
   147        	  if true we add it to the value of dictionary for that region by
   145        	  if yes we add 1 to the value of dictionary for that region by
   148        	  science[region_code] += 1
   146        	  science[region_code] += 1
   149 
   147 
   150     Hit return twice
   148     Hit return twice to exit the for loop
   151 
   149 
   152 by end of this loop we will have our desired output in the dictionary 'science'
   150 by end of this loop we will have our desired output in the dictionary 'science'
   153 we can check the values by
   151 we can check the values by
   154 science
   152 science
   155 
   153 
   156 now to create a pie chart we use
   154 now to create a pie chart we use
   157 
   155 
   158 pie(science.values(),labels = science.keys())
   156 pie(science.values(),labels = science.keys())
       
   157 
       
   158 the first argument to the pie function is the values to be plotted. The second is an optional argument which is used to label the regions.
       
   159 
   159 title('Students scoring 90% and above in science by region')
   160 title('Students scoring 90% and above in science by region')
   160 savefig('science.png')
   161 savefig('science.png')
       
   162 
       
   163 That brings us to the end of this tutorial. We have learnt about dictionaries, some basic string parsing and plotting pie chart in this tutorial. Hope you have enjoyed it. Thank you.