statistics.txt
changeset 59 b62177acce71
parent 58 2c4e318741cf
equal deleted inserted replaced
58:2c4e318741cf 59:b62177acce71
     1 Hello and welcome to the tutorial on handling large data files and processing them.
     1 Hello and welcome to the tutorial on handling large data files and processing them.
     2 
     2 
     3 Till now we have covered:
     3 Up until now we have covered:
     4 * How to create plots.
     4 * How to create plots.
     5 * How to read data from files and process it.
     5 * How to read data from files and process it.
     6 
     6 
     7 In this session, we will use these concepts and some new ones, to solve a problem/exercise. 
     7 In this tutorial, we shall use these concepts and some new ones, to solve a problem/exercise. 
     8 
     8 
     9 We have a file named sslc.txt. 
     9 We have a file named sslc.txt on our desktop.
    10 It contains record of students and their performance in one of the State Secondary Board Examination. It has 180, 000 lines of record. We are going to read it and process this data.
    10 It contains record of students and their performance in one of the State Secondary Board Examination. It has 180, 000 lines of record. We are going to read it and process this data.
    11 We can see the content of file by opening with any text editor.
    11 We can see the content of file by double clicking on it. It might take some time to open since it is quite a large file.
    12 Please don't edit the data.
    12 Please don't edit the data.
    13 This file has a particular structure. Each line in the file is a set of 11 fields separated by semi-colons
    13 This file has a particular structure. Each line in the file is a set of 11 fields separated by semi-colons
       
    14 Consider a sample line from this file.
    14 A;015163;JOSEPH RAJ S;083;042;47;AA;72;244;;;
    15 A;015163;JOSEPH RAJ S;083;042;47;AA;72;244;;;
    15 The following are the fields in any given line.
    16 The following are the fields in any given line.
    16 * Region Code which is 'A'
    17 * Region Code which is 'A'
    17 * Roll Number 015163
    18 * Roll Number 015163
    18 * Name JOSEPH RAJ S
    19 * Name JOSEPH RAJ S
    36 File Reading - which we have already looked at.
    37 File Reading - which we have already looked at.
    37 parsing  - which we have looked at partially.
    38 parsing  - which we have looked at partially.
    38 Dictionaries - we shall be introducing the concept of dictionaries here.
    39 Dictionaries - we shall be introducing the concept of dictionaries here.
    39 And finally plotting - which we have been doing all along.
    40 And finally plotting - which we have been doing all along.
    40 
    41 
       
    42 Since this file is on our Desktop, let's navigate by typing 
       
    43 
       
    44 cd Desktop
       
    45 
    41 Let's get started, by opening the IPython prompt by typing, 
    46 Let's get started, by opening the IPython prompt by typing, 
    42 
    47 
    43 ipython -pylab
    48 ipython -pylab
    44 
    49 
    45 Let's first start off with dictionaries.
    50 Let's first start off with dictionaries.
    46 
    51 
    47 We earlier used lists briefly. Back then we just created lists and appended items into them. 
    52 We earlier used lists briefly. Back then we just created lists and appended items into them. 
    48 x = [1, 4, 2, 7, 6]
    53 x = [1, 4, 2, 7, 6]
    49 In order to access any element in a list, we use its index number. Index starts from 0.
    54 In order to access any element in a list, we use its index number. Index starts from 0.
    50 For eg. x[0] will give 1 and x[3] will give 7.
    55 For eg. x[0] gives 1 and x[3] gives 7.
    51 
    56 
    52 But, using integer indexes isn't always convenient. For example, consider a telephone directory. We give it a name and it should return a corresponding number. A list is not well suited for such problems. Python's dictionaries are better, for such problems. Dictionaries are just key-value pairs. For example:
    57 But, using integer indexes isn't always convenient. For example, consider a telephone directory. We give it a name and it should return a corresponding number. A list is not well suited for such problems. Python's dictionaries are better, for such problems. Dictionaries are just key-value pairs. For example:
    53 
    58 
    54 d = {'png' : 'image',
    59 d = {'png' : 'image',
    55       'txt' : 'text', 
    60       'txt' : 'text', 
    56       'py' : 'python'} 
    61       'py' : 'python'} 
       
    62 
       
    63 And that is how we create a dictionary. Dictionaries are created by typing the key-value pairs within flower brackets.
    57 
    64 
    58 d
    65 d
    59 
    66 
    60 d is a dictionary. The first element in the pair is called the `key' and the second is called the `value'. The key always has to be a string while the value can be of any type.
    67 d is a dictionary. The first element in the pair is called the `key' and the second is called the `value'. The key always has to be a string while the value can be of any type.
    61 
    68 
    71 True
    78 True
    72 
    79 
    73 'jpg' in d
    80 'jpg' in d
    74 False
    81 False
    75 
    82 
       
    83 
       
    84 
    76 Please note that keys, and not values, are searched. 
    85 Please note that keys, and not values, are searched. 
    77 'In a telephone directory one can search for a number based on a name, but not for a name based on a number'
    86 'In a telephone directory one can search for a number based on a name, but not for a name based on a number'
    78 
    87 
    79 to obtain the list of all keys in a dictionary, type
    88 to obtain the list of all keys in a dictionary, type
    80 d.keys()
    89 d.keys()
    83 Similarly,
    92 Similarly,
    84 d.values()
    93 d.values()
    85 ['python', 'text', 'image']
    94 ['python', 'text', 'image']
    86 is used to obtain the list of all values in a dictionary
    95 is used to obtain the list of all values in a dictionary
    87 
    96 
    88 Let's now see what the dictionary contains
    97 one more thing to note about dictionaries, in this case for d, 
    89 d 
       
    90 
    98 
    91 Please observe that dictionaries do not preserve the order in which the items were entered. The order of the elements in a dictionary should not be relied upon.
    99 d  
       
   100 
       
   101 is that dictionaries do not preserve the order in which the items were entered. The order of the elements in a dictionary should not be relied upon.
    92 
   102 
    93 ------------------------------------------------------------------------------------------------------------------
   103 ------------------------------------------------------------------------------------------------------------------
    94 
   104 
    95 Parsing and string processing
   105 Parsing and string processing
    96 
   106 
    97 As we saw previously we will be dealing with lines with content of the form
   107 As we saw previously we shall be dealing with lines with content of the form
    98 A;015162;JENIL T P;081;060;77;41;74;333;P;;
   108 A;015162;JENIL T P;081;060;77;41;74;333;P;;
    99 Here ';' is delimiter, that is ';' is used to separate the fields.
   109 Here ';' is delimiter, that is ';' is used to separate the fields.
   100 
   110 
   101 We shall create one string variable to see how can we process it to get the desired output.
   111 We shall create one string variable to see how can we process it to get the desired output.
   102 
   112 
   110 a
   120 a
   111 
   121 
   112 is list containing all the fields separately.
   122 is list containing all the fields separately.
   113 
   123 
   114 a[0] is the region code, a[1] the roll no., a[2] the name and so on.
   124 a[0] is the region code, a[1] the roll no., a[2] the name and so on.
   115 Similarly, a[6] will give us the science marks of that particular region.
   125 Similarly, a[6] gives us the science marks of that particular region.
   116 
   126 
   117 So we create a dictionary of all the regions with number of students having more than 90 marks.
   127 So we create a dictionary of all the regions with number of students having more than 90 marks.
   118 
   128 
   119 ------------------------------------------------------------------------------------------------------------------
   129 ------------------------------------------------------------------------------------------------------------------
   120 
   130 
   149        	  if yes we add 1 to the value of dictionary for that region by
   159        	  if yes we add 1 to the value of dictionary for that region by
   150        	  science[region_code] += 1
   160        	  science[region_code] += 1
   151 
   161 
   152     Hit return twice to exit the for loop
   162     Hit return twice to exit the for loop
   153 
   163 
   154 by end of this loop we will have our desired output in the dictionary 'science'
   164 by end of this loop we shall have our desired output in the dictionary 'science'
   155 we can check the values by
   165 we can check the values by
   156 science
   166 science
   157 
   167 
   158 now to create a pie chart we use
   168 now to create a pie chart we use
   159 
   169