statistics-script
author Shantanu <shantanu@fossee.in>
Sun, 11 Apr 2010 01:56:59 +0530
changeset 38 f248e91b1510
parent 37 c2634d874e33
child 41 513e6a26d618
permissions -rw-r--r--
Added changes to 3.1 session script.

Hello friends and welcome to the third tutorial in the series of tutorials on "Python for scientific computing."

In the previous tutorial we learnt how to read data from a file and plot the data
We used 'for' loops and lists to get data in desired format.
IPython -Pylab also provides with a function 'loadtxt' which can get us data without much hustle.

We know that, pendulum.txt contains two columns, with length being first and time period is second column, so to get both columns in two separate variables we type

l, t = loadtxt('pendulum.txt', unpack=True)

(unpack = True)? will give us all of first column(length) in l and second column(time) in t

to get more help type 

loadtxt?
This is really powerful tool to load data directly from files which are well structured and formatted. It supports many features like getting particular columns. 
now to get squared values of t we can simply do

tsq = t*t

and we dont have to use for loop anymore. This is benefit of arrays. If we try to something similar to lists we cant escape a 'for' loop.

Now to plot l vs tsq is same as we did in previous session

plot(l, tsq, 'o')

Additionally the basic equation for finding Time period of simple pendulum we use equation:

T = 2*pi*sqrt(L/g)

In our case we have t and l already, so to find g value for each element we can simply use:

g = 4*pi^2*L/T^2

g is array with 90 elements, so we take average of all these values to get acceleration due to gravity('g') by

print mean(g)

Mean again is provided by pylab module which calculates average of given set of values.
There are other handy statistical functions available, such as median, mode, std(for standard deviation) etc.

So in this small session we have covered 'better' way of loading data from text files.
Why arrays are better choice then lists in some cases, and how they are helpful during some mathematical operations.

Thank you!
-----------------------------------------------------------------------------------------------------------
In this tutorial we shall learn how to compute statistics using python.
We also shall learn how to represent data in the form of pie charts.

Let us start with the most basic need in statistics, the mean.

We shall calculate the mean acceleration due to gravity using the same 'pendulum.txt' that we used in the previous session.

As we know, 'pendulum.txt' contains two values in each line. The first being length of pendulum and second the time period.
To calculate acceleration due to gravity from these values, we shall use the expression T = 2*pi*sqrt(L/g)
So re-arranging this equation, we get g = 4*pi**2*L/T**2 .

We shall calculate the value of g for each pair of L and t and then calculate mean of all those g values.

## if we do loadtxt and numpy arrays then this part will change
	First we need something to store each value of g that we are going to compute.
	So we start with initialising an empty list called `g_list'.

	Now we read each line from the file 'pendulum.txt' and calculate g value for that pair of L and t and then append the computed g to our `g_list'.

	In []: for line in open('pendulum.txt'):
	  ....     point = line.split()
	  ....     L = float(point[0])
	  ....     t = float(point[1])
	  ....     g = 4 * pi * pi * L / (t * t)
	  ....     g_list.append(g)

	The first four lines of this code must be trivial. We read the file and store the values. 
	The fifth line where we do g equals to 4 star pi star and so on is the line which calculates g for each pair of L and t values from teh file. The last line simply stores the computed g value. In technical terms appends the computed value to g_list.

	Let us type this code in and see what g_list contains.
###############################

Each value in g_list is the g value computed from a pair of L and t values.

Now we have all the values for g. We must find the mean of these values. That is the sum of all these values divided by the total no.of values.

The no.of values can be found using len(g_list)

So we are left with the problem of finding the sum.
We shall create a variable and loop over the list and add each g value to that variable.
lets call it total.

In []: total = 0 
In []: for g in g_list:
 ....:     total += g
 ....:

So at of this piece of code we will have the sum of all the g values in the variable total.

Now calculating mean of g is as simple as doing total divided by len(g_list)

In []: g_mean = total / len(g_list)
In []: print 'Mean: ', g_mean

If we observe, we have to write a loop to do very simple thing such as finding sum of a list of values.
Python has a built-in function called sum to ease things.

sum takes a list of values and returns the sum of those values.
now calculating mean is much simpler.
we don't have to write any for loop.
we can directly use mean = sum(g_list) / len(g_list)

Still calculating mean needs writing an expression.
What if we had a built-in for calculating mean directly.
We do have and it is available through the pylab library.

Now the job of calculating mean is just a function away.
Call mean(g_list) directly and it gives you the mean of values in g_list.

Isn't that sweet. Ya and that is why I use python.