st-scripts: statistics.rst@2f30ecfd6007 (annotated)

177 eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	1	Hello friends and welcome to the tutorial on statistics using Python
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	2
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	3	{{{ Show the slide containing title }}}
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	4
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	5	{{{ Show the slide containing the outline slide }}}
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	6
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	7	In this tutorial, we shall learn
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	8	* Doing simple statistical operations in Python
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	9	* Applying these to real world problems
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	10
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	11	You will need Ipython with pylab running on your computer
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	12	to use this tutorial.
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	13
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	14	Also you will need to know about loading data using loadtxt to be
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	15	able to follow the real world application.
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	16
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	17	We will first start with the most necessary statistical
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	18	operation i.e finding mean.
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	19
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	20	We have a list of ages of a random group of people ::
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	21
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	22	age_list=[4,45,23,34,34,38,65,42,32,7]
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	23
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	24	One way of getting the mean could be getting sum of
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	25	all the elements and dividing by length of the list.::
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	26
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	27	sum_age_list =sum(age_list)
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	28
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	29	sum function gives us the sum of the elements.::
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	30
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	31	mean_using_sum=sum_age_list/len(age_list)
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	32
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	33	This obviously gives the mean age but python has another
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	34	method for getting the mean. This is the mean function::
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	35
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	36	mean(age_list)
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	37
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	38	Mean can be used in more ways in case of 2 dimensional lists.
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	39	Take a two dimensional list ::
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	40
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	41	two_dimension=[[1,5,6,8],[1,3,4,5]]
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	42
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	43	the mean function used in default manner will give the mean of the
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	44	flattened sequence. Flattened sequence means the two lists taken
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	45	as if it was a single list of elements ::
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	46
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	47	mean(two_dimension)
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	48	flattened_seq=[1,5,6,8,1,3,4,5]
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	49	mean(flattened_seq)
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	50
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	51	As you can see both the results are same. The other is mean
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	52	of each column.::
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	53
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	54	mean(two_dimension,0)
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	55	array([ 1. , 4. , 5. , 6.5])
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	56
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	57	or along the two rows seperately.::
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	58
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	59	mean(two_dimension,1)
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	60	array([ 5. , 3.25])
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	61
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	62	We can see more option of mean using ::
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	63
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	64	mean?
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	65
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	66	Similarly we can calculate median and stanard deviation of a list
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	67	using the functions median and std::
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	68
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	69	median(age_list)
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	70	std(age_list)
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	71
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	72
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	73
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	74	Now lets apply this to a real world example ::
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	75
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	76	We will a data file that is at the a path
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	77	``/home/fossee/sslc2.txt``.It contains record of students and their
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	78	performance in one of the State Secondary Board Examination. It has
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	79	180, 000 lines of record. We are going to read it and process this
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	80	data. We can see the content of file by double clicking on it. It
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	81	might take some time to open since it is quite a large file. Please
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	82	don't edit the data. This file has a particular structure.
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	83
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	84	We can do ::
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	85
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	86	cat /home/fossee/sslc2.txt
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	87
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	88	to check the contents of the file.
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	89
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	90	Each line in the file is a set of 11 fields separated
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	91	by semi-colons Consider a sample line from this file.
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	92	A;015163;JOSEPH RAJ S;083;042;47;00;72;244;;;
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	93
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	94	The following are the fields in any given line.
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	95	* Region Code which is 'A'
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	96	* Roll Number 015163
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	97	* Name JOSEPH RAJ S
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	98	* Marks of 5 subjects: English 083 Hindi 042 Maths 47
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	99	Science AA (Absent) ** Social 72
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	100	* Total marks 244
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	101	*
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	102
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	103	Now lets try and find the mean of English marks of all students.
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	104
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	105	For this we do. ::
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	106
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	107	L=loadtxt('/home/fossee/sslc2.txt',usecols=(3,),delimiter=';')
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	108	L
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	109	mean(L)
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	110
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	111	loadtxt function loads data from an external file.Delimiter specifies
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	112	the kind of character are the fields of data seperated by.
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	113	usecols specifies the columns to be used so (3,). The 'comma' is added
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	114	because usecols is a sequence.
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	115
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	116	To get the median marks. ::
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	117
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	118	median(L)
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	119
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	120	Standard deviation. ::
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	121
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	122	std(L)
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	123
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	124
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	125	Now lets try and and get the mean for all the subjects ::
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	126
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	127	L=loadtxt('sslc2.txt',usecols=(3,4,5,6,7),delimiter=';')
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	128	mean(L,0)
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	129	array([ 73.55452504, 53.79828941, 62.83342759, 50.69806158, 63.17056881])
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	130
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	131	As we can see from the result mean(L,0). The resultant sequence
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	132	is the mean marks of all students that gave the exam for the five subjects.
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	133
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	134	and ::
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	135
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	136	mean(L,1)
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	137
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	138
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	139	is the average accumalative marks of individual students. Clearly, mean(L,0)
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	140	was a row wise calcultaion while mean(L,1) was a column wise calculation.
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	141
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	142
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	143	{{{ Show summary slide }}}
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	144
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	145	This brings us to the end of the tutorial.
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	146	we have learnt
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	147
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	148	* How to do the standard statistical operations sum , mean
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	149	median and standard deviation in Python.
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	150	* Combine text loading and the statistical operation to solve
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	151	real world problems.
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	152
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	153	{{{ Show the "sponsored by FOSSEE" slide }}}
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	154
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	155
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	156	This tutorial was created as a part of FOSSEE project, NME ICT, MHRD India
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	157
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	158	Hope you have enjoyed and found it useful.
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	159	Thankyou
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	160
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	161	.. Author : Amit Sethi
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	162	Internal Reviewer 1 :
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	163	Internal Reviewer 2 :
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	164	External Reviewer :
eb5dd4c7c5be Initial commit statistics amit parents: diff changeset	165

author	Madhusudan.C.S <madhusudancs@gmail.com>
	Thu, 23 Sep 2010 16:42:47 +0530
changeset 207	2f30ecfd6007
parent 177	eb5dd4c7c5be
permissions	-rw-r--r--