Merging heads
authorAmit Sethi
Thu, 11 Nov 2010 01:43:26 +0530
changeset 451 c61ed190af5e
parent 450 d49aee7ab1b9 (diff)
parent 449 471e1e5844a1 (current diff)
child 457 68813d8d80fb
child 466 00c1ba1cb9ef
Merging heads
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/statistics/football.txt	Thu Nov 11 01:43:26 2010 +0530
@@ -0,0 +1,8 @@
+Cristiano Ronaldo, 86 ,67
+latan Ibrahimovic, 34 , 65
+Lionel Messi     , 45 , 41
+Samuel Eto’o 	 , 23 , 24
+Kaka             , 43 , 45
+Frank Lampard    , 65  ,34
+
+
--- a/statistics/script.rst	Wed Nov 10 21:17:33 2010 +0530
+++ b/statistics/script.rst	Thu Nov 11 01:43:26 2010 +0530
@@ -13,6 +13,8 @@
 .. Getting started with IPython
 .. Loading Data from files
 .. Getting started with Lists
+.. Accessing Pieces of Arrays
+
      
 .. Author              : Amit Sethi
    Internal Reviewer   : Puneeth
@@ -28,8 +30,12 @@
 {{{ Show the slide containing the outline slide }}}
 
 In this tutorial, we shall learn
- * Doing simple statistical operations in Python  
- * Applying these to real world problems 
+ * Doing statistical operations in Python  
+   * Summing set of numbers
+   * Finding there mean
+   * Finding there Median
+   * Finding there Standard Deviation 
+   
 
 
 .. #[punch: since loadtxt is anyway a pre-req, I would recommend you
@@ -45,88 +51,13 @@
 .. smaller data-set or something. Using lists doesn't seem natural.]
 
 
-We will first start with the most necessary statistical operation i.e
-finding mean.
-
-We have a list of ages of a random group of people ::
-   
-   age_list = [4,45,23,34,34,38,65,42,32,7]
-
-One way of getting the mean could be getting sum of all the ages and
-dividing by the number of people in the group. ::
-
-    sum_age_list = sum(age_list)
-
-sum function gives us the sum of the elements. Note that the
-``sum_age_list`` variable is an integer and the number of people or
-length of the list is also an integer. We will need to convert one of
-them to a float before carrying out the division. ::
-
-    mean_using_sum = float(sum_age_list)/len(age_list)
-
-This obviously gives the mean age but there is a simpler way to do
-this in Python - using the mean function::
-
-       mean(age_list)
-
-Mean can be used in more ways in case of 2 dimensional lists.  Take a
-two dimensional list ::
-     
-     two_dimension=[[1,5,6,8],[1,3,4,5]]
-
-The mean function by default gives the mean of the flattened sequence.
-A Flattened sequence means a list obtained by concatenating all the
-smaller lists into a large long list. In this case, the list obtained
-by writing the two lists one after the other. ::
-
-    mean(two_dimension)
-    flattened_seq=[1,5,6,8,1,3,4,5]
-    mean(flattened_seq)
-
-As you can see both the results are same. ``mean`` function can also
-give us the mean of each column, or the mean of corresponding elements
-in the smaller lists. ::
-   
-   mean(two_dimension, 0)
-   array([ 1. ,  4. ,  5. ,  6.5])
-
-we pass an extra argument 0 in that case.
-
-If we use an argument 1, we obtain the mean along the rows. ::
-   
-   mean(two_dimension, 1)
-   array([ 5.  ,  3.25])
-
-We can see more option of mean using ::
-   
-   mean?
-
-Similarly we can calculate median and stanard deviation of a list
-using the functions median and std::
-      
-      median(age_list)
-      std(age_list)
-
-Median and std can also be calculated for two dimensional arrays along
-columns and rows just like mean.
-
-For example ::
-       
-       median(two_dimension, 0)
-       std(two_dimension, 1)
-
-This gives us the median along the colums and standard devition along
-the rows.
-       
-Now lets apply this to a real world example 
-    
-We will a data file that is at the a path ``/home/fossee/sslc2.txt``.
-It contains record of students and their performance in one of the
-State Secondary Board Examination. It has 180, 000 lines of record. We
-are going to read it and process this data.  We can see the content of
-file by double clicking on it. It might take some time to open since
-it is quite a large file.  Please don't edit the data.  This file has
-a particular structure.
+For this tutorial We will use data file that is at the a path
+``/home/fossee/sslc2.txt``.  It contains record of students and their
+performance in one of the State Secondary Board Examination. It has
+180,000 lines of record. We are going to read it and process this
+data.  We can see the content of file by double clicking on it. It
+might take some time to open since it is quite a large file.  Please
+don't edit the data.  This file has a particular structure.
 
 We can do ::
    
@@ -134,6 +65,9 @@
 
 to check the contents of the file.
 
+
+{{{ Show the data structure on a slide }}}
+
 Each line in the file is a set of 11 fields separated 
 by semi-colons Consider a sample line from this file.  
 A;015163;JOSEPH RAJ S;083;042;47;00;72;244;;; 
@@ -147,45 +81,97 @@
 * Total marks 244
 
 
-Now lets try and find the mean of English marks of all students.
-
-For this we do. ::
+Lets try and load this data as an array and then run various function on
+it.
 
-     L=loadtxt('/home/fossee/sslc2.txt',usecols=(3,),delimiter=';')
+To get the data as an array we do. ::
+   
+     L=loadtxt('/home/amit/sslc2.txt',usecols=(3,4,5,6,7,),delimiter=';')
      L
-     mean(L)
+     
 
 loadtxt function loads data from an external file.Delimiter specifies
-the kind of character are the fields of data seperated by. 
-usecols specifies  the columns to be used so (3,). The 'comma' is added
-because usecols is a sequence.
+the kind of character are the fields of data seperated by.  usecols
+specifies the columns to be used so (3,4,5,6,7) loads those
+colums. The 'comma' is added because usecols is a sequence.
 
-To get the median marks. ::
+As we can see L is an array. We can get the shape of this array using::
    
-    median(L)
+   L.shape
+   (185667, 5)
+
+Lets start applying statistics operations on these. We will start with
+the most basic, summing. How do you find the sum of marks of all
+subjects for the first student.
+
+As we know from our knowledge of accessing pieces of arrays. To acess
+the first row we will do ::
    
-Standard deviation. ::
-	
-    std(L)
+   L[0,:]
+
+Now to sum this we can say ::
 
+    totalmarks=sum(L[0,:]) 
+    totalmarks
 
-Now lets try and and get the mean for all the subjects ::
+To get the mean we can do ::
+
+   totalmarks/len(L[0,:])
+
+or simply ::
+
+   mean(L[0,:])
 
-     L=loadtxt('/home/fossee/sslc2.txt',usecols=(3,4,5,6,7),delimiter=';')
-     mean(L,0)
-     array([ 73.55452504,  53.79828941,  62.83342759,  50.69806158,  63.17056881])
+But we have such a large data set calculating one by one the mean of
+each student is impossible. Is there a way to reduce the work.
+
+For this we will look into the documentation of mean by doing::
 
-As we can see from the result mean(L,0). The resultant sequence  
-is the mean marks of all students that gave the exam for the five subjects.
+    mean?
 
-and ::
-    
+As we know L is a two dimensional array. We can calculate the mean
+across each of the axis of the array. The axis of rows is referred by
+number 0 and columns by 1. So to calculate mean accross all colums we
+will pass extra parameter 1 for the axis.::
+
     mean(L,1)
 
-    
-is the average accumalative marks of individual students. Clearly, mean(L,0)
-was a row wise calcultaion while mean(L,1) was a column wise calculation.
+L here is the two dimensional array.
+
+Similarly to calculate average marks scored by all the students for each
+subject can be calculated using ::
+
+   mean(L,0)
+
+Next lets now calculate the median of English marks for the all the students
+We can access English marks of all students using ::
+
+   L[:,0]
+   
+To get the median we will do ::
+
+   median(L[:,0])
 
+For all the subjects we can use the same syntax as mean and calculate
+median across all rows using ::
+
+       median(L,0)
+  
+
+Similarly to calculate standard deviation for English we can do::
+
+	  std(L[:,0])
+
+and for all rows::
+
+    std(L,0)
+
+Following is an exercise that you must do. 
+
+%% %% In the given file football.txt at path /home/fossee/football.txt , one column is player name,second is goals at home and third goals away.
+   1.Find the total goals for each player
+   2.Mean home and away goals
+   3.Standard deviation of home and away goals 
 
 {{{ Show summary slide }}}
 
--- a/statistics/slides.org	Wed Nov 10 21:17:33 2010 +0530
+++ b/statistics/slides.org	Thu Nov 11 01:43:26 2010 +0530
@@ -2,32 +2,73 @@
 #+LaTeX_CLASS_OPTIONS: [presentation]
 #+BEAMER_FRAME_LEVEL: 1
 
-#+BEAMER_HEADER_EXTRA: \usetheme{Warsaw}\useoutertheme{infolines}\usecolortheme{default}\setbeamercovered{transparent}
+#+BEAMER_HEADER_EXTRA: \usetheme{Warsaw}\usecolortheme{default}\useoutertheme{infolines}\setbeamercovered{transparent}
 #+COLUMNS: %45ITEM %10BEAMER_env(Env) %10BEAMER_envargs(Env Args) %4BEAMER_col(Col) %8BEAMER_extra(Extra)
 #+PROPERTY: BEAMER_col_ALL 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 :ETC
-#+OPTIONS:   H:5 num:t toc:nil \n:nil @:t ::t |:t ^:t -:t f:t *:t <:t
+
+#+LaTeX_CLASS: beamer
+#+LaTeX_CLASS_OPTIONS: [presentation]
+
+#+LaTeX_HEADER: \usepackage[english]{babel} \usepackage{ae,aecompl}
+#+LaTeX_HEADER: \usepackage{mathpazo,courier,euler} \usepackage[scaled=.95]{helvet}
+
+#+LaTeX_HEADER: \usepackage{listings}
+
+#+LaTeX_HEADER:\lstset{language=Python, basicstyle=\ttfamily\bfseries,
+#+LaTeX_HEADER:  commentstyle=\color{red}\itshape, stringstyle=\color{darkgreen},
+#+LaTeX_HEADER:  showstringspaces=false, keywordstyle=\color{blue}\bfseries}
 
 #+TITLE: Statistics
 #+AUTHOR: FOSSEE
-#+DATE: 2010-09-14 Tue
+#+DATE
 #+EMAIL:     info@fossee.in
 
-# \author[FOSSEE] {FOSSEE}
+#+DESCRIPTION: 
+#+KEYWORDS: 
+#+LANGUAGE:  en
+#+OPTIONS:   H:3 num:nil toc:nil \n:nil @:t ::t |:t ^:t -:t f:t *:t <:t
+#+OPTIONS:   TeX:t LaTeX:nil skip:nil d:nil todo:nil pri:nil tags:not-in-toc
+
+
+* Outline 
+ - Doing statistical operations in Python  
+   - Summing set of numbers
+   - Finding there mean
+   - Finding there Median
+   - Finding there Standard Deviation 
+
+* Data set
+  - A;015163;JOSEPH RAJ S;083;042;47;00;72;244;;; 
 
-# \institute[IIT Bombay] {Department of Aerospace Engineering\\IIT Bombay}
-# \date{}
+  The following are the fields in any given line.
+   - Region Code which is 'A'
+   - Roll Number 015163
+   - Name JOSEPH RAJ S
+   - Marks of 5 subjects: -- English 083 -- 
+     Hindi 042 -- Maths 47 --
+     Science 35 -- Social 72
+   - Total marks 244
 
-* Tutorial Plan 
-** Doing simple statistical operations in Python  
-** Using loadtxt to solve statistics problem
+* Question
+  - In the given file football.txt at path /home/fossee/football.txt , 
+    one column is player name,second is goals at home 
+    and third goals away.
+    - Find the total goals for each player
+    - Mean home and away goals
+    - Standard deviation of home and away goals 
+
+* Solution
+#+begin_src python
+  L=loadtxt('/home/amit/football.txt',usecols=(1,2),
+            delimiter=',')
+  sum(L,1)
+  mean(L,0)
+  std(L,0)
+#+end_src python
 
 * Summary 
-**  seq=[1,5,6,8,1,3,4,5]
-**  sum(seq)
-**  mean(seq)
-**  median(seq)
-**  std(seq)
+  - sum
+  - mean
+  - median
+  - std
 
-* Summary
-
-** loadtxt
--- a/statistics/slides.tex	Wed Nov 10 21:17:33 2010 +0530
+++ b/statistics/slides.tex	Thu Nov 11 01:43:26 2010 +0530
@@ -1,106 +1,122 @@
-%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-%Tutorial slides on Python.
-%
-% Author: FOSSEE 
-% Copyright (c) 2009, FOSSEE, IIT Bombay
-%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-
-\documentclass[14pt,compress]{beamer}
-%\documentclass[draft]{beamer}
-%\documentclass[compress,handout]{beamer}
-%\usepackage{pgfpages} 
-%\pgfpagesuselayout{2 on 1}[a4paper,border shrink=5mm]
-
-% Modified from: generic-ornate-15min-45min.de.tex
-\mode<presentation>
-{
-  \usetheme{Warsaw}
-  \useoutertheme{infolines}
-  \setbeamercovered{transparent}
-}
-
-\usepackage[english]{babel}
+% Created 2010-11-11 Thu 01:35
+\documentclass[presentation]{beamer}
 \usepackage[latin1]{inputenc}
-%\usepackage{times}
 \usepackage[T1]{fontenc}
-
-\usepackage{ae,aecompl}
-\usepackage{mathpazo,courier,euler}
-\usepackage[scaled=.95]{helvet}
-
-\definecolor{darkgreen}{rgb}{0,0.5,0}
-
+\usepackage{fixltx2e}
+\usepackage{graphicx}
+\usepackage{longtable}
+\usepackage{float}
+\usepackage{wrapfig}
+\usepackage{soul}
+\usepackage{t1enc}
+\usepackage{textcomp}
+\usepackage{marvosym}
+\usepackage{wasysym}
+\usepackage{latexsym}
+\usepackage{amssymb}
+\usepackage{hyperref}
+\tolerance=1000
+\usepackage[english]{babel} \usepackage{ae,aecompl}
+\usepackage{mathpazo,courier,euler} \usepackage[scaled=.95]{helvet}
 \usepackage{listings}
-\lstset{language=Python,
-    basicstyle=\ttfamily\bfseries,
-    commentstyle=\color{red}\itshape,
-  stringstyle=\color{darkgreen},
-  showstringspaces=false,
-  keywordstyle=\color{blue}\bfseries}
+\lstset{language=Python, basicstyle=\ttfamily\bfseries,
+commentstyle=\color{red}\itshape, stringstyle=\color{darkgreen},
+showstringspaces=false, keywordstyle=\color{blue}\bfseries}
+\providecommand{\alert}[1]{\textbf{#1}}
 
-%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-% Macros
-\setbeamercolor{emphbar}{bg=blue!20, fg=black}
-\newcommand{\emphbar}[1]
-{\begin{beamercolorbox}[rounded=true]{emphbar} 
-      {#1}
- \end{beamercolorbox}
-}
-\newcounter{time}
-\setcounter{time}{0}
-\newcommand{\inctime}[1]{\addtocounter{time}{#1}{\tiny \thetime\ m}}
+\title{Statistics}
+\author{FOSSEE}
+\date{11 November 2010}
 
-\newcommand{\typ}[1]{\lstinline{#1}}
-
-\newcommand{\kwrd}[1]{ \texttt{\textbf{\color{blue}{#1}}}  }
-
-% Title page
-\title{Your Title Here}
-
-\author[FOSSEE] {FOSSEE}
-
-\institute[IIT Bombay] {Department of Aerospace Engineering\\IIT Bombay}
-\date{}
-
-% DOCUMENT STARTS
+\usetheme{Warsaw}\usecolortheme{default}\useoutertheme{infolines}\setbeamercovered{transparent}
 \begin{document}
 
-\begin{frame}
-  \maketitle
-\end{frame}
+\maketitle
 
-\begin{frame}[fragile]
-  \frametitle{Outline}
-  \begin{itemize}
-    \item 
-  \end{itemize}
-\end{frame}
+
+
 
-%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-%%              All other slides here.                  %%
-%% The same slides will be used in a classroom setting. %% 
-%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+
+
 
-\begin{frame}[fragile]
-  \frametitle{Summary}
-  \begin{itemize}
-    \item 
-  \end{itemize}
-\end{frame}
+
+
 
 \begin{frame}
-  \frametitle{Thank you!}  
-  \begin{block}{}
-  \begin{center}
-  This spoken tutorial has been produced by the
-  \textcolor{blue}{FOSSEE} team, which is funded by the 
-  \end{center}
-  \begin{center}
-    \textcolor{blue}{National Mission on Education through \\
-      Information \& Communication Technology \\ 
-      MHRD, Govt. of India}.
-  \end{center}  
-  \end{block}
+\frametitle{Outline}
+\label{sec-1}
+
+\begin{itemize}
+\item Doing statistical operations in Python
+
+\begin{itemize}
+\item Summing set of numbers
+\item Finding there mean
+\item Finding there Median
+\item Finding there Standard Deviation
+\end{itemize}
+
+\end{itemize}
+\end{frame}
+\begin{frame}
+\frametitle{Data set}
+\label{sec-2}
+
+\begin{itemize}
+\item A;015163;JOSEPH RAJ S;083;042;47;00;72;244;;;
+\end{itemize}
+
+
+  The following are the fields in any given line.
+\begin{itemize}
+\item Region Code which is `A'
+\item Roll Number 015163
+\item Name JOSEPH RAJ S
+\item Marks of 5 subjects: -- English 083 -- 
+     Hindi 042 -- Maths 47 --
+     Science 35 -- Social 72
+\item Total marks 244
+\end{itemize}
+\end{frame}
+\begin{frame}
+\frametitle{Question}
+\label{sec-3}
+
+\begin{itemize}
+\item In the given file football.txt at path /home/fossee/football.txt , 
+    one column is player name,second is goals at home 
+    and third goals away.
+
+\begin{itemize}
+\item Find the total goals for each player
+\item Mean home and away goals
+\item Standard deviation of home and away goals
+\end{itemize}
+
+\end{itemize}
+\end{frame}
+\begin{frame}[fragile]
+\frametitle{Solution}
+\label{sec-4}
+
+\begin{verbatim}
+L=loadtxt('/home/amit/football.txt',usecols=(1,2),
+          delimiter=',')
+sum(L,1)
+mean(L,0)
+std(L,0)
+\end{verbatim}
+\end{frame}
+\begin{frame}
+\frametitle{Summary}
+\label{sec-5}
+
+\begin{itemize}
+\item sum
+\item mean
+\item median
+\item std
+\end{itemize}
 \end{frame}
 
 \end{document}