Rewrite of statistics script as suggested by punch and change in slides accordingly
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/statistics/football.txt Thu Nov 11 01:37:32 2010 +0530
@@ -0,0 +1,8 @@
+Cristiano Ronaldo, 86 ,67
+latan Ibrahimovic, 34 , 65
+Lionel Messi , 45 , 41
+Samuel Eto’o , 23 , 24
+Kaka , 43 , 45
+Frank Lampard , 65 ,34
+
+
--- a/statistics/script.rst Wed Nov 10 17:25:18 2010 +0530
+++ b/statistics/script.rst Thu Nov 11 01:37:32 2010 +0530
@@ -13,6 +13,8 @@
.. Getting started with IPython
.. Loading Data from files
.. Getting started with Lists
+.. Accessing Pieces of Arrays
+
.. Author : Amit Sethi
Internal Reviewer : Puneeth
@@ -28,8 +30,12 @@
{{{ Show the slide containing the outline slide }}}
In this tutorial, we shall learn
- * Doing simple statistical operations in Python
- * Applying these to real world problems
+ * Doing statistical operations in Python
+ * Summing set of numbers
+ * Finding there mean
+ * Finding there Median
+ * Finding there Standard Deviation
+
.. #[punch: since loadtxt is anyway a pre-req, I would recommend you
@@ -45,88 +51,13 @@
.. smaller data-set or something. Using lists doesn't seem natural.]
-We will first start with the most necessary statistical operation i.e
-finding mean.
-
-We have a list of ages of a random group of people ::
-
- age_list = [4,45,23,34,34,38,65,42,32,7]
-
-One way of getting the mean could be getting sum of all the ages and
-dividing by the number of people in the group. ::
-
- sum_age_list = sum(age_list)
-
-sum function gives us the sum of the elements. Note that the
-``sum_age_list`` variable is an integer and the number of people or
-length of the list is also an integer. We will need to convert one of
-them to a float before carrying out the division. ::
-
- mean_using_sum = float(sum_age_list)/len(age_list)
-
-This obviously gives the mean age but there is a simpler way to do
-this in Python - using the mean function::
-
- mean(age_list)
-
-Mean can be used in more ways in case of 2 dimensional lists. Take a
-two dimensional list ::
-
- two_dimension=[[1,5,6,8],[1,3,4,5]]
-
-The mean function by default gives the mean of the flattened sequence.
-A Flattened sequence means a list obtained by concatenating all the
-smaller lists into a large long list. In this case, the list obtained
-by writing the two lists one after the other. ::
-
- mean(two_dimension)
- flattened_seq=[1,5,6,8,1,3,4,5]
- mean(flattened_seq)
-
-As you can see both the results are same. ``mean`` function can also
-give us the mean of each column, or the mean of corresponding elements
-in the smaller lists. ::
-
- mean(two_dimension, 0)
- array([ 1. , 4. , 5. , 6.5])
-
-we pass an extra argument 0 in that case.
-
-If we use an argument 1, we obtain the mean along the rows. ::
-
- mean(two_dimension, 1)
- array([ 5. , 3.25])
-
-We can see more option of mean using ::
-
- mean?
-
-Similarly we can calculate median and stanard deviation of a list
-using the functions median and std::
-
- median(age_list)
- std(age_list)
-
-Median and std can also be calculated for two dimensional arrays along
-columns and rows just like mean.
-
-For example ::
-
- median(two_dimension, 0)
- std(two_dimension, 1)
-
-This gives us the median along the colums and standard devition along
-the rows.
-
-Now lets apply this to a real world example
-
-We will a data file that is at the a path ``/home/fossee/sslc2.txt``.
-It contains record of students and their performance in one of the
-State Secondary Board Examination. It has 180, 000 lines of record. We
-are going to read it and process this data. We can see the content of
-file by double clicking on it. It might take some time to open since
-it is quite a large file. Please don't edit the data. This file has
-a particular structure.
+For this tutorial We will use data file that is at the a path
+``/home/fossee/sslc2.txt``. It contains record of students and their
+performance in one of the State Secondary Board Examination. It has
+180,000 lines of record. We are going to read it and process this
+data. We can see the content of file by double clicking on it. It
+might take some time to open since it is quite a large file. Please
+don't edit the data. This file has a particular structure.
We can do ::
@@ -134,6 +65,9 @@
to check the contents of the file.
+
+{{{ Show the data structure on a slide }}}
+
Each line in the file is a set of 11 fields separated
by semi-colons Consider a sample line from this file.
A;015163;JOSEPH RAJ S;083;042;47;00;72;244;;;
@@ -147,45 +81,97 @@
* Total marks 244
-Now lets try and find the mean of English marks of all students.
-
-For this we do. ::
+Lets try and load this data as an array and then run various function on
+it.
- L=loadtxt('/home/fossee/sslc2.txt',usecols=(3,),delimiter=';')
+To get the data as an array we do. ::
+
+ L=loadtxt('/home/amit/sslc2.txt',usecols=(3,4,5,6,7,),delimiter=';')
L
- mean(L)
+
loadtxt function loads data from an external file.Delimiter specifies
-the kind of character are the fields of data seperated by.
-usecols specifies the columns to be used so (3,). The 'comma' is added
-because usecols is a sequence.
+the kind of character are the fields of data seperated by. usecols
+specifies the columns to be used so (3,4,5,6,7) loads those
+colums. The 'comma' is added because usecols is a sequence.
-To get the median marks. ::
+As we can see L is an array. We can get the shape of this array using::
- median(L)
+ L.shape
+ (185667, 5)
+
+Lets start applying statistics operations on these. We will start with
+the most basic, summing. How do you find the sum of marks of all
+subjects for the first student.
+
+As we know from our knowledge of accessing pieces of arrays. To acess
+the first row we will do ::
-Standard deviation. ::
-
- std(L)
+ L[0,:]
+
+Now to sum this we can say ::
+ totalmarks=sum(L[0,:])
+ totalmarks
-Now lets try and and get the mean for all the subjects ::
+To get the mean we can do ::
+
+ totalmarks/len(L[0,:])
+
+or simply ::
+
+ mean(L[0,:])
- L=loadtxt('/home/fossee/sslc2.txt',usecols=(3,4,5,6,7),delimiter=';')
- mean(L,0)
- array([ 73.55452504, 53.79828941, 62.83342759, 50.69806158, 63.17056881])
+But we have such a large data set calculating one by one the mean of
+each student is impossible. Is there a way to reduce the work.
+
+For this we will look into the documentation of mean by doing::
-As we can see from the result mean(L,0). The resultant sequence
-is the mean marks of all students that gave the exam for the five subjects.
+ mean?
-and ::
-
+As we know L is a two dimensional array. We can calculate the mean
+across each of the axis of the array. The axis of rows is referred by
+number 0 and columns by 1. So to calculate mean accross all colums we
+will pass extra parameter 1 for the axis.::
+
mean(L,1)
-
-is the average accumalative marks of individual students. Clearly, mean(L,0)
-was a row wise calcultaion while mean(L,1) was a column wise calculation.
+L here is the two dimensional array.
+
+Similarly to calculate average marks scored by all the students for each
+subject can be calculated using ::
+
+ mean(L,0)
+
+Next lets now calculate the median of English marks for the all the students
+We can access English marks of all students using ::
+
+ L[:,0]
+
+To get the median we will do ::
+
+ median(L[:,0])
+For all the subjects we can use the same syntax as mean and calculate
+median across all rows using ::
+
+ median(L,0)
+
+
+Similarly to calculate standard deviation for English we can do::
+
+ std(L[:,0])
+
+and for all rows::
+
+ std(L,0)
+
+Following is an exercise that you must do.
+
+%% %% In the given file football.txt at path /home/fossee/football.txt , one column is player name,second is goals at home and third goals away.
+ 1.Find the total goals for each player
+ 2.Mean home and away goals
+ 3.Standard deviation of home and away goals
{{{ Show summary slide }}}
--- a/statistics/slides.org Wed Nov 10 17:25:18 2010 +0530
+++ b/statistics/slides.org Thu Nov 11 01:37:32 2010 +0530
@@ -2,32 +2,73 @@
#+LaTeX_CLASS_OPTIONS: [presentation]
#+BEAMER_FRAME_LEVEL: 1
-#+BEAMER_HEADER_EXTRA: \usetheme{Warsaw}\useoutertheme{infolines}\usecolortheme{default}\setbeamercovered{transparent}
+#+BEAMER_HEADER_EXTRA: \usetheme{Warsaw}\usecolortheme{default}\useoutertheme{infolines}\setbeamercovered{transparent}
#+COLUMNS: %45ITEM %10BEAMER_env(Env) %10BEAMER_envargs(Env Args) %4BEAMER_col(Col) %8BEAMER_extra(Extra)
#+PROPERTY: BEAMER_col_ALL 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 :ETC
-#+OPTIONS: H:5 num:t toc:nil \n:nil @:t ::t |:t ^:t -:t f:t *:t <:t
+
+#+LaTeX_CLASS: beamer
+#+LaTeX_CLASS_OPTIONS: [presentation]
+
+#+LaTeX_HEADER: \usepackage[english]{babel} \usepackage{ae,aecompl}
+#+LaTeX_HEADER: \usepackage{mathpazo,courier,euler} \usepackage[scaled=.95]{helvet}
+
+#+LaTeX_HEADER: \usepackage{listings}
+
+#+LaTeX_HEADER:\lstset{language=Python, basicstyle=\ttfamily\bfseries,
+#+LaTeX_HEADER: commentstyle=\color{red}\itshape, stringstyle=\color{darkgreen},
+#+LaTeX_HEADER: showstringspaces=false, keywordstyle=\color{blue}\bfseries}
#+TITLE: Statistics
#+AUTHOR: FOSSEE
-#+DATE: 2010-09-14 Tue
+#+DATE
#+EMAIL: info@fossee.in
-# \author[FOSSEE] {FOSSEE}
+#+DESCRIPTION:
+#+KEYWORDS:
+#+LANGUAGE: en
+#+OPTIONS: H:3 num:nil toc:nil \n:nil @:t ::t |:t ^:t -:t f:t *:t <:t
+#+OPTIONS: TeX:t LaTeX:nil skip:nil d:nil todo:nil pri:nil tags:not-in-toc
+
+
+* Outline
+ - Doing statistical operations in Python
+ - Summing set of numbers
+ - Finding there mean
+ - Finding there Median
+ - Finding there Standard Deviation
+
+* Data set
+ - A;015163;JOSEPH RAJ S;083;042;47;00;72;244;;;
-# \institute[IIT Bombay] {Department of Aerospace Engineering\\IIT Bombay}
-# \date{}
+ The following are the fields in any given line.
+ - Region Code which is 'A'
+ - Roll Number 015163
+ - Name JOSEPH RAJ S
+ - Marks of 5 subjects: -- English 083 --
+ Hindi 042 -- Maths 47 --
+ Science 35 -- Social 72
+ - Total marks 244
-* Tutorial Plan
-** Doing simple statistical operations in Python
-** Using loadtxt to solve statistics problem
+* Question
+ - In the given file football.txt at path /home/fossee/football.txt ,
+ one column is player name,second is goals at home
+ and third goals away.
+ - Find the total goals for each player
+ - Mean home and away goals
+ - Standard deviation of home and away goals
+
+* Solution
+#+begin_src python
+ L=loadtxt('/home/amit/football.txt',usecols=(1,2),
+ delimiter=',')
+ sum(L,1)
+ mean(L,0)
+ std(L,0)
+#+end_src python
* Summary
-** seq=[1,5,6,8,1,3,4,5]
-** sum(seq)
-** mean(seq)
-** median(seq)
-** std(seq)
+ - sum
+ - mean
+ - median
+ - std
-* Summary
-
-** loadtxt
--- a/statistics/slides.tex Wed Nov 10 17:25:18 2010 +0530
+++ b/statistics/slides.tex Thu Nov 11 01:37:32 2010 +0530
@@ -1,106 +1,122 @@
-%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-%Tutorial slides on Python.
-%
-% Author: FOSSEE
-% Copyright (c) 2009, FOSSEE, IIT Bombay
-%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-
-\documentclass[14pt,compress]{beamer}
-%\documentclass[draft]{beamer}
-%\documentclass[compress,handout]{beamer}
-%\usepackage{pgfpages}
-%\pgfpagesuselayout{2 on 1}[a4paper,border shrink=5mm]
-
-% Modified from: generic-ornate-15min-45min.de.tex
-\mode<presentation>
-{
- \usetheme{Warsaw}
- \useoutertheme{infolines}
- \setbeamercovered{transparent}
-}
-
-\usepackage[english]{babel}
+% Created 2010-11-11 Thu 01:35
+\documentclass[presentation]{beamer}
\usepackage[latin1]{inputenc}
-%\usepackage{times}
\usepackage[T1]{fontenc}
-
-\usepackage{ae,aecompl}
-\usepackage{mathpazo,courier,euler}
-\usepackage[scaled=.95]{helvet}
-
-\definecolor{darkgreen}{rgb}{0,0.5,0}
-
+\usepackage{fixltx2e}
+\usepackage{graphicx}
+\usepackage{longtable}
+\usepackage{float}
+\usepackage{wrapfig}
+\usepackage{soul}
+\usepackage{t1enc}
+\usepackage{textcomp}
+\usepackage{marvosym}
+\usepackage{wasysym}
+\usepackage{latexsym}
+\usepackage{amssymb}
+\usepackage{hyperref}
+\tolerance=1000
+\usepackage[english]{babel} \usepackage{ae,aecompl}
+\usepackage{mathpazo,courier,euler} \usepackage[scaled=.95]{helvet}
\usepackage{listings}
-\lstset{language=Python,
- basicstyle=\ttfamily\bfseries,
- commentstyle=\color{red}\itshape,
- stringstyle=\color{darkgreen},
- showstringspaces=false,
- keywordstyle=\color{blue}\bfseries}
+\lstset{language=Python, basicstyle=\ttfamily\bfseries,
+commentstyle=\color{red}\itshape, stringstyle=\color{darkgreen},
+showstringspaces=false, keywordstyle=\color{blue}\bfseries}
+\providecommand{\alert}[1]{\textbf{#1}}
-%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-% Macros
-\setbeamercolor{emphbar}{bg=blue!20, fg=black}
-\newcommand{\emphbar}[1]
-{\begin{beamercolorbox}[rounded=true]{emphbar}
- {#1}
- \end{beamercolorbox}
-}
-\newcounter{time}
-\setcounter{time}{0}
-\newcommand{\inctime}[1]{\addtocounter{time}{#1}{\tiny \thetime\ m}}
+\title{Statistics}
+\author{FOSSEE}
+\date{11 November 2010}
-\newcommand{\typ}[1]{\lstinline{#1}}
-
-\newcommand{\kwrd}[1]{ \texttt{\textbf{\color{blue}{#1}}} }
-
-% Title page
-\title{Your Title Here}
-
-\author[FOSSEE] {FOSSEE}
-
-\institute[IIT Bombay] {Department of Aerospace Engineering\\IIT Bombay}
-\date{}
-
-% DOCUMENT STARTS
+\usetheme{Warsaw}\usecolortheme{default}\useoutertheme{infolines}\setbeamercovered{transparent}
\begin{document}
-\begin{frame}
- \maketitle
-\end{frame}
+\maketitle
-\begin{frame}[fragile]
- \frametitle{Outline}
- \begin{itemize}
- \item
- \end{itemize}
-\end{frame}
+
+
-%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-%% All other slides here. %%
-%% The same slides will be used in a classroom setting. %%
-%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+
+
-\begin{frame}[fragile]
- \frametitle{Summary}
- \begin{itemize}
- \item
- \end{itemize}
-\end{frame}
+
+
\begin{frame}
- \frametitle{Thank you!}
- \begin{block}{}
- \begin{center}
- This spoken tutorial has been produced by the
- \textcolor{blue}{FOSSEE} team, which is funded by the
- \end{center}
- \begin{center}
- \textcolor{blue}{National Mission on Education through \\
- Information \& Communication Technology \\
- MHRD, Govt. of India}.
- \end{center}
- \end{block}
+\frametitle{Outline}
+\label{sec-1}
+
+\begin{itemize}
+\item Doing statistical operations in Python
+
+\begin{itemize}
+\item Summing set of numbers
+\item Finding there mean
+\item Finding there Median
+\item Finding there Standard Deviation
+\end{itemize}
+
+\end{itemize}
+\end{frame}
+\begin{frame}
+\frametitle{Data set}
+\label{sec-2}
+
+\begin{itemize}
+\item A;015163;JOSEPH RAJ S;083;042;47;00;72;244;;;
+\end{itemize}
+
+
+ The following are the fields in any given line.
+\begin{itemize}
+\item Region Code which is `A'
+\item Roll Number 015163
+\item Name JOSEPH RAJ S
+\item Marks of 5 subjects: -- English 083 --
+ Hindi 042 -- Maths 47 --
+ Science 35 -- Social 72
+\item Total marks 244
+\end{itemize}
+\end{frame}
+\begin{frame}
+\frametitle{Question}
+\label{sec-3}
+
+\begin{itemize}
+\item In the given file football.txt at path /home/fossee/football.txt ,
+ one column is player name,second is goals at home
+ and third goals away.
+
+\begin{itemize}
+\item Find the total goals for each player
+\item Mean home and away goals
+\item Standard deviation of home and away goals
+\end{itemize}
+
+\end{itemize}
+\end{frame}
+\begin{frame}[fragile]
+\frametitle{Solution}
+\label{sec-4}
+
+\begin{verbatim}
+L=loadtxt('/home/amit/football.txt',usecols=(1,2),
+ delimiter=',')
+sum(L,1)
+mean(L,0)
+std(L,0)
+\end{verbatim}
+\end{frame}
+\begin{frame}
+\frametitle{Summary}
+\label{sec-5}
+
+\begin{itemize}
+\item sum
+\item mean
+\item median
+\item std
+\end{itemize}
\end{frame}
\end{document}