statistics.txt
changeset 46 34df59770550
parent 7 9794cc414498
child 47 501e3fb21e3c
equal deleted inserted replaced
45:9d61db7bf2f4 46:34df59770550
     1 Hello welcome to the tutorial on statistics and dictionaries in Python.
     1 Hello welcome to the tutorial on statistics and dictionaries in Python.
     2 
     2 
     3 In the previous tutorial we saw the `for' loop and lists. Here we shall look into
     3 Till now we have covered:
     4 calculating mean for the same pendulum experiment and then move on to calculate
     4 * How to create plots.
     5 the mean, median and standard deviation for a very large data set.
     5 * How to read data from file and process it.
     6 
     6 
     7 Let's start with calculating the mean acceleration due to gravity based on the data from pendulum.txt.
     7 In this session, we will use them and some new concepts to solve a problem/exercise. 
     8 
     8 
     9 We first create an empty list `g_list' to which we shall append the values of `g'.
     9 We have a file named sslc1.txt.
    10 In []: g_list = []
    10 It contains record of students and their performance in one of the State Secondary Board Examination.
       
    11 We can see the content of file by opening with any text editor.
       
    12 Please don't edit the data.
       
    13 It is arranged in a particular format.
       
    14 One particular line being:
       
    15 A;015163;JOSEPH RAJ S;083;042;47;AA;72;244;;;
       
    16 It has following fields:
       
    17 * Region Code which is 'A'
       
    18 * Roll Number 015163
       
    19 * Name JOSEPH RAJ S
       
    20 * Marks of 5 subjects: 
       
    21   ** English 083
       
    22   ** Hindi 042
       
    23   ** Maths 47
       
    24   ** Science AA (Absent)
       
    25   ** Social 72
       
    26 * Total marks 244
       
    27 * Pass/Fail Blank cause he was absent in one exam or else it will be(P/F)
       
    28 * Withheld Blank in this case(W)
    11 
    29 
    12 For each pair of `L' and `t' values in the file `pendulum.txt' we calculate the 
    30 So problem we are going to solve is:
    13 value of `g' and append it to the list `g_list'
    31 Draw a pie chart representing proportion of students who scored more than 90% in each region in Science.
    14 In []: for line in open('pendulum.txt'):
       
    15   ....     point = line.split()
       
    16   ....     L = float(point[0])
       
    17   ....     t = float(point[1])
       
    18   ....     g = 4 * pi * pi * L / (t * t)
       
    19   ....     g_list.append(g)
       
    20 
    32 
    21 We proceed to calculate the mean of the value of `g' from the list `g_list'. 
    33 The result would be something like this:
    22 Here we shall show three ways of calculating the mean. 
    34 slide of result.
    23 Firstly, we calculate the sum `total' of the values in `g_list'.
       
    24 In []: total = 0
       
    25 In []: for g in g_list:
       
    26  ....:     total += g
       
    27  ....:
       
    28 
    35 
    29 Once we have the total we calculate by dividing the `total' by the length of `g_list'
    36 We would be using following machinery:
       
    37 File Reading(done already)
       
    38 parsing (done partly)
       
    39 Dictionaries (new)
       
    40 Arrays
       
    41 Plot (done already)
    30 
    42 
    31 In []: g_mean = total / len(g_list)
    43 Dictionaries
    32 In []: print 'Mean: ', g_mean
       
    33 
    44 
    34 The second method is slightly simpler. Python provides a built-in function called "sum()" that computes the sum of all the elements in a list. 
    45 We earlier used lists, we just created them and appended items to list. 
    35 In []: g_mean = sum(g_list) / len(g_list)
    46 x = [1, 4, 2, 7, 6]
    36 In []: print 'Mean: ', g_mean
    47 to access the first element we use index number, and it starts from 0 so
       
    48 x[0] will give
       
    49 1 and
       
    50 x[3] will
       
    51 7
    37 
    52 
    38 The third method is the simplest. Python provides a built-in function `mean' that
    53 At times we don't have index to relate things. For example consider a telephone directory, we give it a name and it should return back corresponding number. List is not the best kind of data structure for such problems, and hence Python provides support for dictionaries. Dictionaries are key value pairs. Lists are indexed by integers while dictionaries are indexed by strings. For example:
    39 calculates the mean of all the elements in a list.
       
    40 In []: g_mean = mean(g_list)
       
    41 In []: print 'Mean: ', g_mean
       
    42 
    54 
    43 Python provides support for dictionaries. Dictionaries are key value pairs. Lists are indexed by integers while dictionaries are indexed by strings. For example:
       
    44 In []: d = {'png' : 'image',
    55 In []: d = {'png' : 'image',
    45       'txt' : 'text', 
    56       'txt' : 'text', 
    46       'py' : 'python'} 
    57       'py' : 'python'} 
    47 is a dictionary. The first element in the pair is called the `key' and the second 
    58 d is a dictionary. The first element in the pair is called the `key' and the second is called the `value'. The key always has to be a string while the value can be of any type.
    48 is called the `value'. The key always has to be a string while the value can be 
       
    49 of any type.
       
    50 
    59 
    51 Dictionaries are indexed using their keys as shown
    60 Dictionaries are indexed using their keys as shown
    52 In []: d['txt']
    61 In []: d['txt']
    53 Out[]: 'text'
    62 Out[]: 'text'
    54 
    63 
    55 In []: d['png']
    64 In []: d['png']
    56 Out[]: 'image'
    65 Out[]: 'image'
    57 
    66 
    58 The dictionaries can be searched for the presence of a certain key by typing
    67 The dictionaries can be searched for the presence of a certain key by typing
    59 In []: 'py' in d
    68 'py' in d
    60 Out[]: True
    69 True
    61 
    70 
    62 In []: 'jpg' in d
    71 'jpg' in d
    63 Out[]: False
    72 False
    64 Please note the values cannot be searched in a dictionaries.
    73 Please note the values cannot be searched in a dictionaries.
    65 
    74 
    66 In []: d.keys()
    75 d.keys()
    67 Out[]: ['py', 'txt', 'png']
    76 ['py', 'txt', 'png']
    68 is used to obtain the list of all keys in a dictionary
    77 is used to obtain the list of all keys in a dictionary
    69 
    78 
    70 In []: d.values()
    79 d.values()
    71 Out[]: ['python', 'text', 'image']
    80 ['python', 'text', 'image']
    72 is used to obtain the list of all values in a dictionary
    81 is used to obtain the list of all values in a dictionary
    73 
    82 
    74 In []: d
    83 d
    75 Out[]: {'png': 'image', 'py': 'python', 'txt': 'text'}
    84 
    76 Please observe that dictionaries do not preserve the order in which the items
    85 Please observe that dictionaries do not preserve the order in which the items were entered. The order of the elements in a dictionary should not be relied upon.
    77 were entered. The order of the elements in a dictionary should not be relied upon.
    86 
       
    87 ------------------------------------------------------------------------------------------------------------------
       
    88 
       
    89 Parsing and string processing
       
    90 
       
    91 As we saw previously we will be dealing with lines with such content
       
    92 A;015162;JENIL T P;081;060;77;41;74;333;P;;
       
    93 so ';' is delimiter we have to look for.
       
    94 We will create one string variable to see how can we process it get the desired output.
       
    95 
       
    96 line = 'A;015162;JENIL T P;081;060;77;41;74;333;P;;'
       
    97 a = line.split(';')
       
    98 we have used split earlier to split on empty spaces.
       
    99 a 
       
   100 
       
   101 is list with all elements separated.
       
   102 a[0] is the region we want.
       
   103 and a[6] will give us the science marks of a particular region.
       
   104 So we create a dictionary of all the regions with number of students having more then 90 marks.
       
   105 Something like 
       
   106 d = {'A': 729, 'C': 764, 'B': 1120,'E': 414, 'D': 603, 'F': 500}
       
   107 
       
   108 ------------------------------------------------------------------------------------------------------------------
       
   109 
       
   110 code
       
   111 
       
   112 We first create an empty dictionary
       
   113 
       
   114 science = {}
       
   115 now we read the record data one by one
       
   116 
       
   117 for record in open('sslc1.txt'):
       
   118 
       
   119     we split the record on ';' and store the list in 'fields'
       
   120     fields = record.split(';')
       
   121 
       
   122     now we strip this string for leading and trailing white spaces
       
   123     region_code = fields[0].strip()
       
   124 
       
   125     now we check if the region code is always there in dictionary by writing 'if' statement
       
   126     if region_code not in science:    
       
   127        when this statement is true, we add new entry to dictionary with 
       
   128        science[region_code] = 0
       
   129 
       
   130     we again strip(ing is good) the string
       
   131     score_str = fields[6].strip()
       
   132 
       
   133     we check if student was not absent
       
   134     if score_str != 'AA':
       
   135        then we check if his marks are above 90 or not
       
   136        if int(score_str) > 90:
       
   137        	  science[region_code] += 1
       
   138 
       
   139     Hit return twice
       
   140 
       
   141 by end of this loop we will have our desired output in the dictionary 'science'
       
   142 we can check the values by
       
   143 science
       
   144 
       
   145 now to create a pie chart we use
       
   146 
       
   147 pie(science.values(),labels = science.keys())
       
   148 title('Students scoring 90% and above in science by region')
       
   149 savefig('science.png')
       
   150