# HG changeset patch # User Amit Sethi # Date 1289419652 -19800 # Node ID d49aee7ab1b9bdf104e51bb4f1894dc67f932295 # Parent 79a7ca3073d4bd51addb4f889ed001213255d8d1 Rewrite of statistics script as suggested by punch and change in slides accordingly diff -r 79a7ca3073d4 -r d49aee7ab1b9 statistics/football.txt --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/statistics/football.txt Thu Nov 11 01:37:32 2010 +0530 @@ -0,0 +1,8 @@ +Cristiano Ronaldo, 86 ,67 +latan Ibrahimovic, 34 , 65 +Lionel Messi , 45 , 41 +Samuel Eto’o , 23 , 24 +Kaka , 43 , 45 +Frank Lampard , 65 ,34 + + diff -r 79a7ca3073d4 -r d49aee7ab1b9 statistics/script.rst --- a/statistics/script.rst Wed Nov 10 17:25:18 2010 +0530 +++ b/statistics/script.rst Thu Nov 11 01:37:32 2010 +0530 @@ -13,6 +13,8 @@ .. Getting started with IPython .. Loading Data from files .. Getting started with Lists +.. Accessing Pieces of Arrays + .. Author : Amit Sethi Internal Reviewer : Puneeth @@ -28,8 +30,12 @@ {{{ Show the slide containing the outline slide }}} In this tutorial, we shall learn - * Doing simple statistical operations in Python - * Applying these to real world problems + * Doing statistical operations in Python + * Summing set of numbers + * Finding there mean + * Finding there Median + * Finding there Standard Deviation + .. #[punch: since loadtxt is anyway a pre-req, I would recommend you @@ -45,88 +51,13 @@ .. smaller data-set or something. Using lists doesn't seem natural.] -We will first start with the most necessary statistical operation i.e -finding mean. - -We have a list of ages of a random group of people :: - - age_list = [4,45,23,34,34,38,65,42,32,7] - -One way of getting the mean could be getting sum of all the ages and -dividing by the number of people in the group. :: - - sum_age_list = sum(age_list) - -sum function gives us the sum of the elements. Note that the -``sum_age_list`` variable is an integer and the number of people or -length of the list is also an integer. We will need to convert one of -them to a float before carrying out the division. :: - - mean_using_sum = float(sum_age_list)/len(age_list) - -This obviously gives the mean age but there is a simpler way to do -this in Python - using the mean function:: - - mean(age_list) - -Mean can be used in more ways in case of 2 dimensional lists. Take a -two dimensional list :: - - two_dimension=[[1,5,6,8],[1,3,4,5]] - -The mean function by default gives the mean of the flattened sequence. -A Flattened sequence means a list obtained by concatenating all the -smaller lists into a large long list. In this case, the list obtained -by writing the two lists one after the other. :: - - mean(two_dimension) - flattened_seq=[1,5,6,8,1,3,4,5] - mean(flattened_seq) - -As you can see both the results are same. ``mean`` function can also -give us the mean of each column, or the mean of corresponding elements -in the smaller lists. :: - - mean(two_dimension, 0) - array([ 1. , 4. , 5. , 6.5]) - -we pass an extra argument 0 in that case. - -If we use an argument 1, we obtain the mean along the rows. :: - - mean(two_dimension, 1) - array([ 5. , 3.25]) - -We can see more option of mean using :: - - mean? - -Similarly we can calculate median and stanard deviation of a list -using the functions median and std:: - - median(age_list) - std(age_list) - -Median and std can also be calculated for two dimensional arrays along -columns and rows just like mean. - -For example :: - - median(two_dimension, 0) - std(two_dimension, 1) - -This gives us the median along the colums and standard devition along -the rows. - -Now lets apply this to a real world example - -We will a data file that is at the a path ``/home/fossee/sslc2.txt``. -It contains record of students and their performance in one of the -State Secondary Board Examination. It has 180, 000 lines of record. We -are going to read it and process this data. We can see the content of -file by double clicking on it. It might take some time to open since -it is quite a large file. Please don't edit the data. This file has -a particular structure. +For this tutorial We will use data file that is at the a path +``/home/fossee/sslc2.txt``. It contains record of students and their +performance in one of the State Secondary Board Examination. It has +180,000 lines of record. We are going to read it and process this +data. We can see the content of file by double clicking on it. It +might take some time to open since it is quite a large file. Please +don't edit the data. This file has a particular structure. We can do :: @@ -134,6 +65,9 @@ to check the contents of the file. + +{{{ Show the data structure on a slide }}} + Each line in the file is a set of 11 fields separated by semi-colons Consider a sample line from this file. A;015163;JOSEPH RAJ S;083;042;47;00;72;244;;; @@ -147,45 +81,97 @@ * Total marks 244 -Now lets try and find the mean of English marks of all students. - -For this we do. :: +Lets try and load this data as an array and then run various function on +it. - L=loadtxt('/home/fossee/sslc2.txt',usecols=(3,),delimiter=';') +To get the data as an array we do. :: + + L=loadtxt('/home/amit/sslc2.txt',usecols=(3,4,5,6,7,),delimiter=';') L - mean(L) + loadtxt function loads data from an external file.Delimiter specifies -the kind of character are the fields of data seperated by. -usecols specifies the columns to be used so (3,). The 'comma' is added -because usecols is a sequence. +the kind of character are the fields of data seperated by. usecols +specifies the columns to be used so (3,4,5,6,7) loads those +colums. The 'comma' is added because usecols is a sequence. -To get the median marks. :: +As we can see L is an array. We can get the shape of this array using:: - median(L) + L.shape + (185667, 5) + +Lets start applying statistics operations on these. We will start with +the most basic, summing. How do you find the sum of marks of all +subjects for the first student. + +As we know from our knowledge of accessing pieces of arrays. To acess +the first row we will do :: -Standard deviation. :: - - std(L) + L[0,:] + +Now to sum this we can say :: + totalmarks=sum(L[0,:]) + totalmarks -Now lets try and and get the mean for all the subjects :: +To get the mean we can do :: + + totalmarks/len(L[0,:]) + +or simply :: + + mean(L[0,:]) - L=loadtxt('/home/fossee/sslc2.txt',usecols=(3,4,5,6,7),delimiter=';') - mean(L,0) - array([ 73.55452504, 53.79828941, 62.83342759, 50.69806158, 63.17056881]) +But we have such a large data set calculating one by one the mean of +each student is impossible. Is there a way to reduce the work. + +For this we will look into the documentation of mean by doing:: -As we can see from the result mean(L,0). The resultant sequence -is the mean marks of all students that gave the exam for the five subjects. + mean? -and :: - +As we know L is a two dimensional array. We can calculate the mean +across each of the axis of the array. The axis of rows is referred by +number 0 and columns by 1. So to calculate mean accross all colums we +will pass extra parameter 1 for the axis.:: + mean(L,1) - -is the average accumalative marks of individual students. Clearly, mean(L,0) -was a row wise calcultaion while mean(L,1) was a column wise calculation. +L here is the two dimensional array. + +Similarly to calculate average marks scored by all the students for each +subject can be calculated using :: + + mean(L,0) + +Next lets now calculate the median of English marks for the all the students +We can access English marks of all students using :: + + L[:,0] + +To get the median we will do :: + + median(L[:,0]) +For all the subjects we can use the same syntax as mean and calculate +median across all rows using :: + + median(L,0) + + +Similarly to calculate standard deviation for English we can do:: + + std(L[:,0]) + +and for all rows:: + + std(L,0) + +Following is an exercise that you must do. + +%% %% In the given file football.txt at path /home/fossee/football.txt , one column is player name,second is goals at home and third goals away. + 1.Find the total goals for each player + 2.Mean home and away goals + 3.Standard deviation of home and away goals {{{ Show summary slide }}} diff -r 79a7ca3073d4 -r d49aee7ab1b9 statistics/slides.org --- a/statistics/slides.org Wed Nov 10 17:25:18 2010 +0530 +++ b/statistics/slides.org Thu Nov 11 01:37:32 2010 +0530 @@ -2,32 +2,73 @@ #+LaTeX_CLASS_OPTIONS: [presentation] #+BEAMER_FRAME_LEVEL: 1 -#+BEAMER_HEADER_EXTRA: \usetheme{Warsaw}\useoutertheme{infolines}\usecolortheme{default}\setbeamercovered{transparent} +#+BEAMER_HEADER_EXTRA: \usetheme{Warsaw}\usecolortheme{default}\useoutertheme{infolines}\setbeamercovered{transparent} #+COLUMNS: %45ITEM %10BEAMER_env(Env) %10BEAMER_envargs(Env Args) %4BEAMER_col(Col) %8BEAMER_extra(Extra) #+PROPERTY: BEAMER_col_ALL 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 :ETC -#+OPTIONS: H:5 num:t toc:nil \n:nil @:t ::t |:t ^:t -:t f:t *:t <:t + +#+LaTeX_CLASS: beamer +#+LaTeX_CLASS_OPTIONS: [presentation] + +#+LaTeX_HEADER: \usepackage[english]{babel} \usepackage{ae,aecompl} +#+LaTeX_HEADER: \usepackage{mathpazo,courier,euler} \usepackage[scaled=.95]{helvet} + +#+LaTeX_HEADER: \usepackage{listings} + +#+LaTeX_HEADER:\lstset{language=Python, basicstyle=\ttfamily\bfseries, +#+LaTeX_HEADER: commentstyle=\color{red}\itshape, stringstyle=\color{darkgreen}, +#+LaTeX_HEADER: showstringspaces=false, keywordstyle=\color{blue}\bfseries} #+TITLE: Statistics #+AUTHOR: FOSSEE -#+DATE: 2010-09-14 Tue +#+DATE #+EMAIL: info@fossee.in -# \author[FOSSEE] {FOSSEE} +#+DESCRIPTION: +#+KEYWORDS: +#+LANGUAGE: en +#+OPTIONS: H:3 num:nil toc:nil \n:nil @:t ::t |:t ^:t -:t f:t *:t <:t +#+OPTIONS: TeX:t LaTeX:nil skip:nil d:nil todo:nil pri:nil tags:not-in-toc + + +* Outline + - Doing statistical operations in Python + - Summing set of numbers + - Finding there mean + - Finding there Median + - Finding there Standard Deviation + +* Data set + - A;015163;JOSEPH RAJ S;083;042;47;00;72;244;;; -# \institute[IIT Bombay] {Department of Aerospace Engineering\\IIT Bombay} -# \date{} + The following are the fields in any given line. + - Region Code which is 'A' + - Roll Number 015163 + - Name JOSEPH RAJ S + - Marks of 5 subjects: -- English 083 -- + Hindi 042 -- Maths 47 -- + Science 35 -- Social 72 + - Total marks 244 -* Tutorial Plan -** Doing simple statistical operations in Python -** Using loadtxt to solve statistics problem +* Question + - In the given file football.txt at path /home/fossee/football.txt , + one column is player name,second is goals at home + and third goals away. + - Find the total goals for each player + - Mean home and away goals + - Standard deviation of home and away goals + +* Solution +#+begin_src python + L=loadtxt('/home/amit/football.txt',usecols=(1,2), + delimiter=',') + sum(L,1) + mean(L,0) + std(L,0) +#+end_src python * Summary -** seq=[1,5,6,8,1,3,4,5] -** sum(seq) -** mean(seq) -** median(seq) -** std(seq) + - sum + - mean + - median + - std -* Summary - -** loadtxt diff -r 79a7ca3073d4 -r d49aee7ab1b9 statistics/slides.tex --- a/statistics/slides.tex Wed Nov 10 17:25:18 2010 +0530 +++ b/statistics/slides.tex Thu Nov 11 01:37:32 2010 +0530 @@ -1,106 +1,122 @@ -%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -%Tutorial slides on Python. -% -% Author: FOSSEE -% Copyright (c) 2009, FOSSEE, IIT Bombay -%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% - -\documentclass[14pt,compress]{beamer} -%\documentclass[draft]{beamer} -%\documentclass[compress,handout]{beamer} -%\usepackage{pgfpages} -%\pgfpagesuselayout{2 on 1}[a4paper,border shrink=5mm] - -% Modified from: generic-ornate-15min-45min.de.tex -\mode -{ - \usetheme{Warsaw} - \useoutertheme{infolines} - \setbeamercovered{transparent} -} - -\usepackage[english]{babel} +% Created 2010-11-11 Thu 01:35 +\documentclass[presentation]{beamer} \usepackage[latin1]{inputenc} -%\usepackage{times} \usepackage[T1]{fontenc} - -\usepackage{ae,aecompl} -\usepackage{mathpazo,courier,euler} -\usepackage[scaled=.95]{helvet} - -\definecolor{darkgreen}{rgb}{0,0.5,0} - +\usepackage{fixltx2e} +\usepackage{graphicx} +\usepackage{longtable} +\usepackage{float} +\usepackage{wrapfig} +\usepackage{soul} +\usepackage{t1enc} +\usepackage{textcomp} +\usepackage{marvosym} +\usepackage{wasysym} +\usepackage{latexsym} +\usepackage{amssymb} +\usepackage{hyperref} +\tolerance=1000 +\usepackage[english]{babel} \usepackage{ae,aecompl} +\usepackage{mathpazo,courier,euler} \usepackage[scaled=.95]{helvet} \usepackage{listings} -\lstset{language=Python, - basicstyle=\ttfamily\bfseries, - commentstyle=\color{red}\itshape, - stringstyle=\color{darkgreen}, - showstringspaces=false, - keywordstyle=\color{blue}\bfseries} +\lstset{language=Python, basicstyle=\ttfamily\bfseries, +commentstyle=\color{red}\itshape, stringstyle=\color{darkgreen}, +showstringspaces=false, keywordstyle=\color{blue}\bfseries} +\providecommand{\alert}[1]{\textbf{#1}} -%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -% Macros -\setbeamercolor{emphbar}{bg=blue!20, fg=black} -\newcommand{\emphbar}[1] -{\begin{beamercolorbox}[rounded=true]{emphbar} - {#1} - \end{beamercolorbox} -} -\newcounter{time} -\setcounter{time}{0} -\newcommand{\inctime}[1]{\addtocounter{time}{#1}{\tiny \thetime\ m}} +\title{Statistics} +\author{FOSSEE} +\date{11 November 2010} -\newcommand{\typ}[1]{\lstinline{#1}} - -\newcommand{\kwrd}[1]{ \texttt{\textbf{\color{blue}{#1}}} } - -% Title page -\title{Your Title Here} - -\author[FOSSEE] {FOSSEE} - -\institute[IIT Bombay] {Department of Aerospace Engineering\\IIT Bombay} -\date{} - -% DOCUMENT STARTS +\usetheme{Warsaw}\usecolortheme{default}\useoutertheme{infolines}\setbeamercovered{transparent} \begin{document} -\begin{frame} - \maketitle -\end{frame} +\maketitle -\begin{frame}[fragile] - \frametitle{Outline} - \begin{itemize} - \item - \end{itemize} -\end{frame} + + -%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -%% All other slides here. %% -%% The same slides will be used in a classroom setting. %% -%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% + + -\begin{frame}[fragile] - \frametitle{Summary} - \begin{itemize} - \item - \end{itemize} -\end{frame} + + \begin{frame} - \frametitle{Thank you!} - \begin{block}{} - \begin{center} - This spoken tutorial has been produced by the - \textcolor{blue}{FOSSEE} team, which is funded by the - \end{center} - \begin{center} - \textcolor{blue}{National Mission on Education through \\ - Information \& Communication Technology \\ - MHRD, Govt. of India}. - \end{center} - \end{block} +\frametitle{Outline} +\label{sec-1} + +\begin{itemize} +\item Doing statistical operations in Python + +\begin{itemize} +\item Summing set of numbers +\item Finding there mean +\item Finding there Median +\item Finding there Standard Deviation +\end{itemize} + +\end{itemize} +\end{frame} +\begin{frame} +\frametitle{Data set} +\label{sec-2} + +\begin{itemize} +\item A;015163;JOSEPH RAJ S;083;042;47;00;72;244;;; +\end{itemize} + + + The following are the fields in any given line. +\begin{itemize} +\item Region Code which is `A' +\item Roll Number 015163 +\item Name JOSEPH RAJ S +\item Marks of 5 subjects: -- English 083 -- + Hindi 042 -- Maths 47 -- + Science 35 -- Social 72 +\item Total marks 244 +\end{itemize} +\end{frame} +\begin{frame} +\frametitle{Question} +\label{sec-3} + +\begin{itemize} +\item In the given file football.txt at path /home/fossee/football.txt , + one column is player name,second is goals at home + and third goals away. + +\begin{itemize} +\item Find the total goals for each player +\item Mean home and away goals +\item Standard deviation of home and away goals +\end{itemize} + +\end{itemize} +\end{frame} +\begin{frame}[fragile] +\frametitle{Solution} +\label{sec-4} + +\begin{verbatim} +L=loadtxt('/home/amit/football.txt',usecols=(1,2), + delimiter=',') +sum(L,1) +mean(L,0) +std(L,0) +\end{verbatim} +\end{frame} +\begin{frame} +\frametitle{Summary} +\label{sec-5} + +\begin{itemize} +\item sum +\item mean +\item median +\item std +\end{itemize} \end{frame} \end{document}