statistics.txt
changeset 128 fa5c77536e4e
parent 127 76fd286276f7
child 129 dcb9b50761eb
child 146 b92b4e7ecd7b
equal deleted inserted replaced
127:76fd286276f7 128:fa5c77536e4e
     1 Hello and welcome to the tutorial on handling large data files and processing them.
       
     2 
       
     3 Up until now we have covered:
       
     4 * How to create plots.
       
     5 * How to read data from files and process it.
       
     6 
       
     7 In this tutorial, we shall use these concepts and some new ones, to solve a problem/exercise. 
       
     8 
       
     9 We have a file named sslc.txt on our desktop.
       
    10 It contains record of students and their performance in one of the State Secondary Board Examination. It has 180, 000 lines of record. We are going to read it and process this data.
       
    11 We can see the content of file by double clicking on it. It might take some time to open since it is quite a large file.
       
    12 Please don't edit the data.
       
    13 This file has a particular structure. Each line in the file is a set of 11 fields separated by semi-colons
       
    14 Consider a sample line from this file.
       
    15 A;015163;JOSEPH RAJ S;083;042;47;AA;72;244;;;
       
    16 The following are the fields in any given line.
       
    17 * Region Code which is 'A'
       
    18 * Roll Number 015163
       
    19 * Name JOSEPH RAJ S
       
    20 * Marks of 5 subjects: 
       
    21   ** English 083
       
    22   ** Hindi 042
       
    23   ** Maths 47
       
    24   ** Science AA (Absent)
       
    25   ** Social 72
       
    26 * Total marks 244
       
    27 * Pass/Fail - This field is blank here because the particular candidate was absent for an exam if not it would've been one of (P/F)
       
    28 * Withheld - Again blank in this case(W)
       
    29 
       
    30 Let us now look at the problem we wish to solve:
       
    31 Draw a pie chart representing the proportion of students who scored more than 90% in each region in Science.
       
    32 
       
    33 This is the result we expect:
       
    34 #slide of result.
       
    35 
       
    36 In order to solve this problem, we need the following machinery:
       
    37 File Reading - which we have already looked at.
       
    38 parsing  - which we have looked at partially.
       
    39 Dictionaries - we shall be introducing the concept of dictionaries here.
       
    40 And finally plotting - which we have been doing all along.
       
    41 
       
    42 Since this file is on our Desktop, let's navigate by typing 
       
    43 
       
    44 cd Desktop
       
    45 
       
    46 Let's get started, by opening the IPython prompt by typing, 
       
    47 
       
    48 ipython -pylab
       
    49 
       
    50 Let's first start off with dictionaries.
       
    51 
       
    52 We earlier used lists briefly. Back then we just created lists and appended items into them. 
       
    53 x = [1, 4, 2, 7, 6]
       
    54 In order to access any element in a list, we use its index number. Index starts from 0.
       
    55 For eg. x[0] gives 1 and x[3] gives 7.
       
    56 
       
    57 But, using integer indexes isn't always convenient. For example, consider a telephone directory. We give it a name and it should return a corresponding number. A list is not well suited for such problems. Python's dictionaries are better, for such problems. Dictionaries are just key-value pairs. For example:
       
    58 
       
    59 d = {'png' : 'image',
       
    60       'txt' : 'text', 
       
    61       'py' : 'python'} 
       
    62 
       
    63 And that is how we create a dictionary. Dictionaries are created by typing the key-value pairs within flower brackets.
       
    64 
       
    65 d
       
    66 
       
    67 d is a dictionary. The first element in the pair is called the `key' and the second is called the `value'. The key always has to be a string while the value can be of any type.
       
    68 
       
    69 Lists are indexed by integers while dictionaries are indexed by strings. They are indexed using their keys as shown
       
    70 In []: d['txt']
       
    71 Out[]: 'text'
       
    72 
       
    73 In []: d['png']
       
    74 Out[]: 'image'
       
    75 
       
    76 The dictionaries can be searched for the presence of a certain key by typing
       
    77 'py' in d
       
    78 True
       
    79 
       
    80 'jpg' in d
       
    81 False
       
    82 
       
    83 
       
    84 
       
    85 Please note that keys, and not values, are searched. 
       
    86 'In a telephone directory one can search for a number based on a name, but not for a name based on a number'
       
    87 
       
    88 to obtain the list of all keys in a dictionary, type
       
    89 d.keys()
       
    90 ['py', 'txt', 'png']
       
    91 
       
    92 Similarly,
       
    93 d.values()
       
    94 ['python', 'text', 'image']
       
    95 is used to obtain the list of all values in a dictionary
       
    96 
       
    97 one more thing to note about dictionaries, in this case for d, 
       
    98 
       
    99 d  
       
   100 
       
   101 is that dictionaries do not preserve the order in which the items were entered. The order of the elements in a dictionary should not be relied upon.
       
   102 
       
   103 ------------------------------------------------------------------------------------------------------------------
       
   104 
       
   105 Parsing and string processing
       
   106 
       
   107 As we saw previously we shall be dealing with lines with content of the form
       
   108 A;015162;JENIL T P;081;060;77;41;74;333;P;;
       
   109 Here ';' is delimiter, that is ';' is used to separate the fields.
       
   110 
       
   111 We shall create one string variable to see how can we process it to get the desired output.
       
   112 
       
   113 line = 'A;015162;JENIL T P;081;060;77;41;74;333;P;;'
       
   114 
       
   115 Previously we saw how to split on spaces when we processed the pendulum.txt file. Let us now look at how to split a string into a list of fields based on a delimiter other than space.
       
   116 a = line.split(';')
       
   117 
       
   118 Let's now check what 'a' contains.
       
   119 
       
   120 a
       
   121 
       
   122 is list containing all the fields separately.
       
   123 
       
   124 a[0] is the region code, a[1] the roll no., a[2] the name and so on.
       
   125 Similarly, a[6] gives us the science marks of that particular region.
       
   126 
       
   127 So we create a dictionary of all the regions with number of students having more than 90 marks.
       
   128 
       
   129 ------------------------------------------------------------------------------------------------------------------
       
   130 
       
   131 Let's now start off with the code
       
   132 
       
   133 We first create an empty dictionary
       
   134 
       
   135 science = {}
       
   136 now we read the records, one by one from the file sslc.txt
       
   137 
       
   138 for record in open('sslc.txt'):
       
   139 
       
   140     we split each record on ';' and store it in a list by: fields equals record.split(';')
       
   141 
       
   142     now we get the region code of a particular entry by region_code equal to fields[0].strip.
       
   143 The strip() is used to remove all leading and trailing white spaces from a given string
       
   144 
       
   145     now we check if the region code is already there in dictionary by typing
       
   146     if region_code not in science:    
       
   147        when this statement is true, we add new entry to dictionary with initial value 0 and key being the region code.
       
   148        science[region_code] = 0
       
   149        
       
   150     Note that this if statement is inside the for loop so for the if block we will have to give additional indentation.
       
   151 
       
   152     we again come back to the older, 'for' loop's, indentation and get the science marks by
       
   153     score_str = fields[6].strip()
       
   154 
       
   155     we check if student was not absent
       
   156     if score_str != 'AA':
       
   157        then we check if his marks are above 90 or not
       
   158        if int(score_str) > 90:
       
   159        	  if yes we add 1 to the value of dictionary for that region by
       
   160        	  science[region_code] += 1
       
   161 
       
   162     Hit return twice to exit the for loop
       
   163 
       
   164 by end of this loop we shall have our desired output in the dictionary 'science'
       
   165 we can check the values by
       
   166 science
       
   167 
       
   168 now to create a pie chart we use
       
   169 
       
   170 pie(science.values(),labels = science.keys())
       
   171 
       
   172 the first argument to the pie function is the values to be plotted. The second is an optional argument which is used to label the regions.
       
   173 
       
   174 title('Students scoring 90% and above in science by region')
       
   175 savefig('science.png')
       
   176 
       
   177 That brings us to the end of this tutorial. We have learnt about dictionaries, some basic string parsing and plotting pie chart in this tutorial. Hope you have enjoyed it. Thank you.
       
   178 #slide of summary.