\section{Computing the mean}
\frametitle{Value of acceleration due to gravity?}
\item We already have \typ{pendulum.txt}
\item We know that $ T = 2\pi \sqrt{\frac{L}{g}} $
\item So $ g = \frac{4 \pi^2 L}{T^2} $
\item Calculate $g$ - acceleration due to gravity for each pair of
$L$ and $T$
\item Hence calculate mean $g$
\frametitle{Acceleration due to gravity - $g$\ldots}
In []: g_list = []
In []: for line in open('pendulum.txt'):
.... point = line.split()
.... L = float(point[0])
.... t = float(point[1])
.... g = 4 * pi * pi * L / (t * t)
.... g_list.append(g)
\frametitle{Mean $g$ - Classical method}
In []: total = 0
In []: for g in g_list:
....: total += g
In []: g_mean = total / len(g_list)
In []: print 'Mean: ', g_mean
\frametitle{Mean $g$ - Slightly improved method}
In []: g_mean = sum(g_list) / len(g_list)
In []: print 'Mean: ', g_mean
\frametitle{Mean $g$ - One liner}
In []: g_mean = mean(g_list)
In []: print 'Mean: ', g_mean
\section{Processing voluminous data}
\frametitle{More on data processing}
We have a huge data file--180,000 records.\\How do we do
\emph{efficient} statistical computations, i.e. find mean, median,
standard deviation etc.;\\How do we draw pie charts?
\frametitle{Structure of the file}
Understanding the structure of \typ{sslc1.txt}
\item Each line in the file has a student's details(record)
\item Each record consists of fields separated by ';'
\emphbar{A;015162;JENIL T P;081;060;77;41;74;333;P;;}
\frametitle{Structure of the file \ldots}
\emphbar{A;015163;JOSEPH RAJ S;083;042;47;AA;72;244;;;}
Each record consists of:
\item Region Code
\item Roll Number
\item Name
\item Marks of 5 subjects: second lang, first lang., Math, Science,
Social Studies
\item Total marks
\item Pass/Fail (P/F)
\item Withheld (W)
\frametitle{Statistical Analysis: Problem statement}
1. Read the data supplied in the file \typ{sslc1.txt} and carry out the following:
\item[a] Draw a pie chart representing proportion of students who scored more than 90\% in each region in Science.
\item[b] Print mean, median and standard deviation of math scores for all regions combined.
\frametitle{Problem statement: explanation}
\emphbar{a. Draw a pie chart representing proportion of students who scored more than 90\% in each region in Science.}
\includegraphics[height=2.6in, interpolate=true]{data/science}
\frametitle{Machinery Required}
\item File reading
\item Parsing
\item Dictionaries
\item Arrays
\item Statistical operations
\subsection{Data processing}
\frametitle{File reading and parsing \ldots}
\emphbar{Reading files line by line is the same as we had done with the pendulum example.}
for record in open('sslc1.txt'):
fields = record.split(';')
\frametitle{Dictionaries: Introduction}
\item Lists index using integers\\
Recall \typ{p = [2, 3, 5, 7]} and\\
\typ{p[1]} is equal to \typ{3}
\item Dictionaries index using strings
\frametitle{Dictionaries \ldots}
In []: d = {'png' : 'image file',
'txt' : 'text file',
'py' : 'python code',
'java': 'bad code',
'cpp': 'complex code'}
In []: d['txt']
Out[]: 'text file'
\frametitle{Dictionaries \ldots}
In []: 'py' in d
Out[]: True
In []: 'jpg' in d
Out[]: False
\frametitle{Dictionaries \ldots}
In []: d.keys()
Out[]: ['cpp', 'py', 'txt', 'java', 'png']
In []: d.values()
Out[]: ['complex code', 'python code',
'text file', 'bad code',
'image file']
\frametitle{Inserting elements into dictionary}
\emphbar{\alert{d[key] = value}}
In []: d['bin'] = 'binary file'
In []: d
{'bin': 'binary file',
'cpp': 'complex code',
'java': 'bad code',
'png': 'image file',
'py': 'python code',
'txt': 'text file'}
\frametitle{Getting back to the problem}
Let our dictionary be:
science = {}
\item Keys will be region codes
\item Values will be the number students who scored more than 90\% in that region in Science
\begin{block}{Sample \typ{science} dictionary}
\{'A': 729, 'C': 764, 'B': 1120,'E': 414, 'D': 603, 'F': 500\}
\frametitle{Building parsed data \ldots}
science = {}
for record in open('sslc1.txt'):
fields = record.split(';')
region_code = fields[0].strip()
\frametitle{Building parsed data \ldots}
if region_code not in science:
science[region_code] = 0
score_str = fields[6].strip()
score = int(score_str) if \
score_str != 'AA' else 0
if score > 90:
science[region_code] += 1
\frametitle{Building parsed data \ldots}
print science
print science.keys()
print science.values()
\subsection{Visualizing data}
\frametitle{Pie Chart}
\includegraphics[height=2in, interpolate=true]{data/science_nolabel}
\frametitle{Pie chart}
labels = science.keys())
title('Students scoring 90% and above
in science by region')
\includegraphics[height=2in, interpolate=true]{data/science}
\frametitle{Problem statement}
\emphbar{b. Print mean, median and standard deviation of math scores for all regions combined.}
\frametitle{Building data for statistics}
math_scores = []
for record in open('sslc1.txt'):
fields = record.split(';')
score_str = fields[5].strip()
score = int(score_str) if \
score_str != 'AA' else 0
\subsection{Obtaining statistics}
\frametitle{Obtaining statistics}
print 'Mean: ', mean(math_scores)
print 'Median: ', median(math_scores)
print 'Standard Deviation: ',
\frametitle{Obtaining statistics: efficiently!}
math_array = array(math_scores)
print 'Mean: ', mean(math_array)
print 'Median: ', median(math_array)
print 'Standard Deviation: ',
\frametitle{IPython tip: Timing}
Try the following:
In []: %timeit mean(math_scores)
In []: %timeit mean(math_array)
In []: %timeit?
\item \typ{\%timeit}: accurate, many measurements
\item Can also use \typ{\%time}
\item \typ{\%time}: less accurate, one measurement
\frametitle{What tools did we use?}
\item More parsing data
\item Dictionaries for storing data
\item Facilities for drawing pie charts
\item Functions for statistical computations - mean, median, standard deviation
\item Efficient array manipulations
\item Timing in IPython
\frametitle{\incqno }
A sample line from a Comma Separated Values (CSV) file:\\
\emph{Rossum, Guido, 42, 56, 34, 54}\\
What code would you use to separate the line into fields?
\frametitle{\incqno }
In []: a = [1, 2, 5, 9]
How do you find the length of this list?
\frametitle{\incqno }
In [1]: d = {
'a': 1,
'b': 2
In [2]: print d['c']
What is the output?
\frametitle{\incqno }
In []: sc = {'A': 10, 'B': 20,
'C': 70}
Given the above dictionary, what command will you give to plot a
\frametitle{\incqno }
In []: marks = [10, 20, 30, 50, 55,
75, 83]
Given the above marks, how will you calculate the \alert{mean} and
\alert{standard deviation}?
\frametitle{\incqno }
In []: marks = [10, 20, 30, 50, 55,
75, 83]
How will you convert the list \texttt{marks} to an \alert{array}?
