SciPy – Stats

All of the statistics functions are located in the sub-package scipy.stats and a fairly complete listing of these functions can be obtained using info(stats) function. A list of random variables available can also be obtained from the docstring for the stats sub-package. This module contains a large number of probability distributions as well as a growing library of statistical functions.

Each univariate distribution has its own subclass as described in the following table −

Sr. No.Class & Description
1rv_continuousA generic continuous random variable class meant for subclassing
2rv_discreteA generic discrete random variable class meant for subclassing
3rv_histogramGenerates a distribution given by a histogram

Normal Continuous Random Variable

A probability distribution in which the random variable X can take any value is continuous random variable. The location (loc) keyword specifies the mean. The scale (scale) keyword specifies the standard deviation.

As an instance of the rv_continuous class, norm object inherits from it a collection of generic methods and completes them with details specific for this particular distribution.

To compute the CDF at a number of points, we can pass a list or a NumPy array. Let us consider the following example.

from scipy.stats import norm
import numpy as np
print norm.cdf(np.array([1,-1., 0, 1, 3, 4, -2, 6]))

The above program will generate the following output.

array([ 0.84134475, 0.15865525, 0.5 , 0.84134475, 0.9986501 ,
0.99996833, 0.02275013, 1. ])

To find the median of a distribution, we can use the Percent Point Function (PPF), which is the inverse of the CDF. Let us understand by using the following example.

from scipy.stats import norm
print norm.ppf(0.5)

The above program will generate the following output.

0.0

To generate a sequence of random variates, we should use the size keyword argument, which is shown in the following example.

from scipy.stats import norm
print norm.rvs(size = 5)

The above program will generate the following output.

array([ 0.20929928, -1.91049255, 0.41264672, -0.7135557 , -0.03833048])

The above output is not reproducible. To generate the same random numbers, use the seed function.

Uniform Distribution

A uniform distribution can be generated using the uniform function. Let us consider the following example.

from scipy.stats import uniform
print uniform.cdf([0, 1, 2, 3, 4, 5], loc = 1, scale = 4)

The above program will generate the following output.

array([ 0. , 0. , 0.25, 0.5 , 0.75, 1. ])

Build Discrete Distribution

Let us generate a random sample and compare the observed frequencies with the probabilities.

Binomial Distribution

As an instance of the rv_discrete class, the binom object inherits from it a collection of generic methods and completes them with details specific for this particular distribution. Let us consider the following example.

from scipy.stats import uniform
print uniform.cdf([0, 1, 2, 3, 4, 5], loc = 1, scale = 4)

The above program will generate the following output.

array([ 0. , 0. , 0.25, 0.5 , 0.75, 1. ])

Descriptive Statistics

The basic stats such as Min, Max, Mean and Variance takes the NumPy array as input and returns the respective results. A few basic statistical functions available in the scipy.stats package are described in the following table.

Sr. No.Function & Description
1describe()Computes several descriptive statistics of the passed array
2gmean()Computes geometric mean along the specified axis
3hmean()Calculates the harmonic mean along the specified axis
4kurtosis()Computes the kurtosis
5mode()Returns the modal value
6skew()Tests the skewness of the data
7f_oneway()Performs a 1-way ANOVA
8iqr()Computes the interquartile range of the data along the specified axis
9zscore()Calculates the z score of each value in the sample, relative to the sample mean and standard deviation
10sem()Calculates the standard error of the mean (or standard error of measurement) of the values in the input array

Several of these functions have a similar version in the scipy.stats.mstats, which work for masked arrays. Let us understand this with the example given below.

from scipy import stats
import numpy as np
x = np.array([1,2,3,4,5,6,7,8,9])
print x.max(),x.min(),x.mean(),x.var()

The above program will generate the following output.

(9, 1, 5.0, 6.666666666666667)

T-test

Let us understand how T-test is useful in SciPy.

ttest_1samp

Calculates the T-test for the mean of ONE group of scores. This is a two-sided test for the null hypothesis that the expected value (mean) of a sample of independent observations ‘a’ is equal to the given population mean, popmean. Let us consider the following example.

from scipy import stats
rvs = stats.norm.rvs(loc = 5, scale = 10, size = (50,2))
print stats.ttest_1samp(rvs,5.0)

The above program will generate the following output.

Ttest_1sampResult(statistic = array([-1.40184894, 2.70158009]),
pvalue = array([ 0.16726344, 0.00945234]))

Comparing two samples

In the following examples, there are two samples, which can come either from the same or from different distribution, and we want to test whether these samples have the same statistical properties.

ttest_ind − Calculates the T-test for the means of two independent samples of scores. This is a two-sided test for the null hypothesis that two independent samples have identical average (expected) values. This test assumes that the populations have identical variances by default.

We can use this test, if we observe two independent samples from the same or different population. Let us consider the following example.

from scipy import stats
rvs1 = stats.norm.rvs(loc = 5,scale = 10,size = 500)
rvs2 = stats.norm.rvs(loc = 5,scale = 10,size = 500)
print stats.ttest_ind(rvs1,rvs2)

The above program will generate the following output.

Ttest_indResult(statistic = -0.67406312233650278, pvalue = 0.50042727502272966)

You can test the same with a new array of the same length, but with a varied mean. Use a different value in loc and test the same.

Leave a Reply