用Python做统计分析 (Scipy.stats的文档)
关键词:python统计分析、python数据分析、python数据挖掘、 python scipy.stats、scipy.stats python
这个文档说了以下内容,对python如何做统计分析感兴趣的人可以看看,毕竟Python的库也有点乱。有的看上去应该在一起的内容分散在scipy,pandas,sympy等库中。这里是一般统计功能的使用,在scipy库中。像什么时间序列之类的当然在其他地方,而且它们反过来就没这些功能。
随机变量样本抽取
84个连续性分布(告诉你有那么多,没具体介绍)
12个离散型分布
分布的密度分布函数,累计分布函数,残存函数,分位点函数,逆残存函数
分布的统计量:均值,方差,峰度,偏度,矩
分布的线性变换生成
数据的分布拟合
分布构造
描述统计
t检验,ks检验,卡方检验,正态性检,同分布检验
核密度估计(从样本估计概率密度分布函数)
Statistics (scipy.stats)
Introduction
介绍
In this tutorial we discuss many, but certainly not all, features of scipy.stats. The intention here is to provide a user with a working knowledge of this package. We refer to the reference manual for further details.
在这个教程我们讨论一些而非全部的scipy.stats模块的特性。这里我们的意图是提供给使用者一个关于这个包的实用性知识。我们推荐reference manual来介绍更多的细节。
Note: This documentation is work in progress.
注意:这个文档还在发展中。
Random Variables
随机变量
There are two general distribution classes that have been implemented for encapsulating continuous random variables anddiscrete random variables . Over 80 continuous random variables (RVs) and 10 discrete random variables have been implemented using these classes. Besides this, new routines and distributions can easily added by the end user. (If you create one, please contribute it).
有一些通用的分布类被封装在continuous random variables以及discrete random variables中。有80多个连续性随机变量(RVs)以及10个离散随机变量已经用这些类建立。同样,新的程序和分布可以被用户新创建(如果你创建了一个,请提供它帮助发展这个包)。
All of the statistics functions are located in the sub-package scipy.stats and a fairly complete listing of these functions can be obtained using info(stats). The list of the random variables available can also be obtained from the docstring for the stats sub-package.
所有统计函数被放在子包scipy.stats中,且有这些函数的一个几乎完整的列表可以使用info(stats)获得。这个列表里的随机变量也可以从stats子包的docstring中获得介绍。
In the discussion below we mostly focus on continuous RVs. Nearly all applies to discrete variables also, but we point out some differences here: Specific Points for Discrete Distributions.
在接下来的讨论中,沃恩着重于连续性随机变量(RVs)。几乎所有离散变量也符合下面的讨论,但是我们也要指出一些区别在Specific Points for Discrete Distributions中。
Getting Help
获得帮助
First of all, all distributions are accompanied with help functions. To obtain just some basic information we can call
在开始前,所有分布可以使用help函数得到解释。为获得这些信息只需要使用简单的调用:
>>>
>>> from scipy import stats
>>> from scipy.stats import norm
>>> print norm.__doc__
To find the support, i.e., upper and lower bound of the distribution, call:
为了找到支持,作为例子,我们用这种方式找分布的上下界
>>>
>>> print ‘bounds of distribution lower: %s, upper: %s’ % (norm.a,norm.b)
bounds of distribution lower: -inf, upper: inf
We can list all methods and properties of the distribution with dir(norm). As it turns out, some of the methods are private methods although they are not named as such (their name does not start with a leading underscore), for example veccdf, are only available for internal calculation (those methods will give warnings when one tries to use them, and will be removed at some point).
我们可以通过调用dir(norm)来获得关于这个(正态)分布的所有方法和属性。应该看到,一些方法是私有方法尽管其并没有以名称表示出来(比如它们前面没有以下划线开头),比如veccdf就只用于内部计算(试图使用那些方法将引发警告,它们可能会在后续开发中被移除)
To obtain the real main methods, we list the methods of the frozen distribution. (We explain the meaning of a frozen distribution below).
为了获得真正的主要方法,我们列举冻结分布的方法(我们将在下文解释何谓“冻结分布”)
>>>
>>> rv = norm()
>>> dir(rv) # reformatted
[‘__class__’, ‘__delattr__’, ‘__dict__’, ‘__doc__’, ‘__getattribute__’,
‘__hash__’, ‘__init__’, ‘__module__’, ‘__new__’, ‘__reduce__’, ‘__reduce_ex__’,
‘__repr__’, ‘__setattr__’, ‘__str__’, ‘__weakref__’, ‘args’, ‘cdf’, ‘dist’,
‘entropy’, ‘isf’, ‘kwds’, ‘moment’, ‘pdf’, ‘pmf’, ‘ppf’, ‘rvs’, ‘sf’, ‘stats’]
Finally, we can obtain the list of available distribution through introspection:
最后,我们能通过内省获得所有的可用分布。
>>>
>>> import warnings
>>> warnings.simplefilter(‘ignore’, DeprecationWarning)
>>> dist_continu = [d for d in dir(stats) if
… isinstance(getattr(stats,d), stats.rv_continuous)]
>>> dist_discrete = [d for d in dir(stats) if
… isinstance(getattr(stats,d), stats.rv_discrete)]
>>> print ‘number of continuous distributions:’, len(dist_continu)
number of continuous distributions: 84
>>> print ‘number of discrete distributions: ‘, len(dist_discrete)
number of discrete distributions: 12
Common Methods
通用方法
The main public methods for continuous RVs are:
连续随机变量的主要公共方法如下:
rvs: Random Variates
pdf: Probability Density Function
cdf: Cumulative Distribution Function
sf: Survival Function (1-CDF)
ppf: Percent Point Function (Inverse of CDF)
isf: Inverse Survival Function (Inverse of SF)
stats: Return mean, variance, (Fisher’s) skew, or (Fisher’s) kurtosis
moment: non-central moments of the distribution
rvs:随机变量
pdf:概率密度函。
cdf:累计分布函数
sf:残存函数(1-CDF)
ppf:分位点函数(CDF的逆)
isf:逆残存函数(sf的逆)
stats:返回均值,方差,(费舍尔)偏态,(费舍尔)峰度。
moment:分布的非中心矩。
Let’s take a normal RV as an example.
让我们取得一个标准的RV作为例子。
>>>
>>> norm.cdf(0)
0.5
To compute the cdf at a number of points, we can pass a list or a numpy array.
为了计算在一个点上的cdf,我们可以传递一个列表或一个numpy数组。
>>>
>>> norm.cdf([-1., 0, 1])
array([ 0.15865525, 0.5 , 0.84134475])
>>> import numpy as np
>>> norm.cdf(np.array([-1., 0, 1]))
array([ 0.15865525, 0.5 , 0.84134475])
Thus, the basic methods such as pdf, cdf, and so on are vectorized with np.vectorize.
Other generally useful methods are supported too:
相应的,像pdf,cdf之类的简单方法可以被矢量化通过np.vectorize.
其他游泳的方法可以像这样使用。
>>>
>>> norm.mean(), norm.std(), norm.var()
(0.0, 1.0, 1.0)
>>> norm.stats(moments = “mv”)
(array(0.0), array(1.0))
To find the median of a distribution we can use the percent point function ppf, which is the inverse of the cdf:
为了找到一个分部的中心,我们可以使用分位数函数ppf,其是cdf的逆。
>>>
>>> norm.ppf(0.5)
0.0
To generate a set of random variates:
为了产生一个随机变量集合。
>>>
>>> norm.rvs(size=5)
array([-0.35687759, 1.34347647, -0.11710531, -1.00725181, -0.51275702])
Don’t think that norm.rvs(5) generates 5 variates:
不要认为norm.rvs(5)产生了五个变量。
>>>
>>> norm.rvs(5)
7.131624370075814
This brings us, in fact, to the topic of the next subsection.
这个引导我们可以得以进入下一部分的内容。
Shifting and Scaling
位移与缩放(线性变换)
All continuous distributions take loc and scale as keyword parameters to adjust the location and scale of the distribution, e.g. for the standard normal distribution the location is the mean and the scale is the standard deviation.
所有连续分布可以操纵loc以及scale参数作为修正location和scale的方式。作为例子,标准正态分布的location是均值而scale是标准差。
>>>
>>> norm.stats(loc = 3, scale = 4, moments = “mv”)
(array(3.0), array(16.0))
In general the standardized distribution for a random variable X is obtained through the transformation (X – loc) / scale. The default values are loc = 0 and scale = 1.
通常经标准化的分布的随机变量X可以通过变换(X-loc)/scale获得。它们的默认值是loc=0以及scale=1.
Smart use of loc and scale can help modify the standard distributions in many ways. To illustrate the scaling further, the cdf of an exponentially distributed RV with mean 1/λ is given by
F(x)=1−exp(−λx)
By applying the scaling rule above, it can be seen that by taking scale = 1./lambda we get the proper scale.
聪明的使用loc与scale可以帮助以灵活的方式调整标准分布。为了进一步说明缩放的效果,下面给出期望为1/λ指数分布的cdf。
F(x)=1−exp(−λx)
通过像上面那样使用scale,可以看到得到想要的期望值。
>>>
>>> from scipy.stats import expon
>>> expon.mean(scale=3.)
3.0
The uniform distribution is also interesting:
均匀分布也是令人感兴趣的:
>>>
>>> from scipy.stats import uniform
>>> uniform.cdf([0, 1, 2, 3, 4, 5], loc = 1, scale = 4)
array([ 0. , 0. , 0.25, 0.5 , 0.75, 1. ])
Finally, recall from the previous paragraph that we are left with the problem of the meaning of norm.rvs(5). As it turns out, calling a distribution like this, the first argument, i.e., the 5, gets passed to set the loc parameter. Let’s see:
最后,联系起我们在前面段落中留下的norm.rvs(5)的问题。事实上,像这样调用一个分布,其第一个参数,在这里是5,是把loc参数调到了5,让我们看:
>>>
>>> np.mean(norm.rvs(5, size=500))
4.983550784784704
Thus, to explain the output of the example of the last section: norm.rvs(5) generates a normally distributed random variate with mean loc=5.
I prefer to set the loc and scale parameter explicitly, by passing the values as keywords rather than as arguments. This is less of a hassle as it may seem. We clarify this below when we explain the topic of freezing a RV.
在这里,为解释最后一段的输出:norm.rvs(5)产生了一个正态分布变量,其期望,即loc=5.
我倾向于明确的使用loc,scale作为关键字而非参数。这看上去只是个小麻烦。我们澄清这一点在我们解释冻结RV的主题之前。
Shape Parameters
形态参数
While a general continuous random variable can be shifted and scaled with the loc and scale parameters, some distributions require additional shape parameters. For instance, the gamma distribution, with density
γ(x,a)=λ(λx)a−1Γ(a)e−λx,
requires the shape parameter a. Observe that setting λ can be obtained by setting the scale keyword to 1/λ.
虽然一个一般的连续随机变量可以被位移和伸缩通过loc和scale参数,但一些分布还需要额外的形态参数。作为例子,看到这个伽马分布,这是它的密度函数
γ(x,a)=λ(λx)a−1Γ(a)e−λx,
要求一个形态参数a。注意到λ的设置可以通过设置scale关键字为1/λ进行。
Let’s check the number and name of the shape parameters of the gamma distribution. (We know from the above that this should be 1.)
让我们检查伽马分布的形态参数的名字的数量。(我们知道从上面知道其应该为1)
>>>
>>> from scipy.stats import gamma
>>> gamma.numargs
1
>>> gamma.shapes
‘a’
Now we set the value of the shape variable to 1 to obtain the exponential distribution, so that we compare easily whether we get the results we expect.
现在我们设置形态变量的值为1以变成指数分布。所以我们可以容易的比较是否得到了我们所期望的结果。
>>>
>>> gamma(1, scale=2.).stats(moments=”mv”)
(array(2.0), array(4.0))
Notice that we can also specify shape parameters as keywords:
注意我们也可以以关键字的方式指定形态参数:
>>>
>>> gamma(a=1, scale=2.).stats(moments=”mv”)
(array(2.0), array(4.0))
Freezing a Distribution
冻结分布
Passing the loc and scale keywords time and again can become quite bothersome. The concept of freezing a RV is used to solve such problems.
不断地传递loc与scale关键字最终会让人厌烦。而冻结RV的概念被用来解决这个问题。
>>>
>>> rv = gamma(1, scale=2.)
By using rv we no longer have to include the scale or the shape parameters anymore. Thus, distributions can be used in one of two ways, either by passing all distribution parameters to each method call (such as we did earlier) or by freezing the parameters for the instance of the distribution. Let us check this:
通过使用rv我们不用再更多的包含scale与形态参数在任何情况下。显然,分布可以被多种方式使用,我们可以通过传递所有分布参数给对方法的每次调用(像我们之前做的那样)或者可以对一个分布对象冻结参数。让我们看看是怎么回事:
>>>
>>> rv.mean(), rv.std()
(2.0, 2.0)
This is indeed what we should get.
这正是我们应该得到的。
Broadcasting
广播
The basic methods pdf and so on satisfy the usual numpy broadcasting rules. For example, we can calculate the critical values for the upper tail of the t distribution for different probabilites and degrees of freedom.
像pdf这样的简单方法满足numpy的广播规则。作为例子,我们可以计算t分布的右尾分布的临界值对于不同的概率值以及自由度。
>>>
>>> stats.t.isf([0.1, 0.05, 0.01], [[10], [11]])
array([[ 1.37218364, 1.81246112, 2.76376946],
[ 1.36343032, 1.79588482, 2.71807918]])
Here, the first row are the critical values for 10 degrees of freedom and the second row for 11 degrees of freedom (d.o.f.). Thus, the broadcasting rules give the same result of calling isf twice:
这里,第一行是以10自由度的临界值,而第二行是以11为自由度的临界值。所以,广播规则与下面调用了两次isf产生的结果相同。
>>>
>>> stats.t.isf([0.1, 0.05, 0.01], 10)
array([ 1.37218364, 1.81246112, 2.76376946])
>>> stats.t.isf([0.1, 0.05, 0.01], 11)
array([ 1.36343032, 1.79588482, 2.71807918])
If the array with probabilities, i.e, [0.1, 0.05, 0.01] and the array of degrees of freedom i.e., [10, 11, 12], have the same array shape, then element wise matching is used. As an example, we can obtain the 10% tail for 10 d.o.f., the 5% tail for 11 d.o.f. and the 1% tail for 12 d.o.f. by calling
但是如果概率数组,如[0.1,0.05,0.01]与自由度数组,如[10,11,12]具有相同的数组形态,则元素对应捕捉被作用,我们可以分别得到10%,5%,1%尾的临界值对于10,11,12的自由度。
>>>
>>> stats.t.isf([0.1, 0.05, 0.01], [10, 11, 12])
array([ 1.37218364, 1.79588482, 2.68099799])
Specific Points for Discrete Distributions
离散分布的特殊之处
Discrete distribution have mostly the same basic methods as the continuous distributions. However pdf is replaced the probability mass function pmf, no estimation methods, such as fit, are available, and scale is not a valid keyword parameter. The location parameter, keyword loc can still be used to shift the distribution.
离散分布的简单方法大多数与连续分布很类似。当然像pdf被更换为密度函数pmf,没有估计方法,像fit是可用的。而scale不是一个合法的关键字参数。Location参数,关键字loc则仍然可以使用用于位移。
The computation of the cdf requires some extra attention. In the case of continuous distribution the cumulative distribution function is in most standard cases strictly monotonic increasing in the bounds (a,b) and has therefore a unique inverse. The cdf of a discrete distribution, however, is a step function, hence the inverse cdf, i.e., the percent point function, requires a different definition:
ppf(q) = min{x : cdf(x) >= q, x integer}
Cdf的计算要求一些额外的关注。在连续分布的情况下,累积分布函数在大多数标准情况下是严格递增的,所以有唯一的逆。而cdf在离散分布,无论如何,是阶跃函数,所以cdf的逆,分位点函数,要求一个不同的定义:
ppf(q) = min{x : cdf(x) >= q, x integer}
For further info, see the docs here.
为了更多信息可以看这里。
We can look at the hypergeometric distribution as an example
>>>
>>> from scipy.stats import hypergeom
>>> [M, n, N] = [20, 7, 12]
我们可以看这个超几何分布的例子
>>>
>>> from scipy.stats import hypergeom
>>> [M, n, N] = [20, 7, 12]
If we use the cdf at some integer points and then evaluate the ppf at those cdf values, we get the initial integers back, for example
如果我们使用在一些整数点使用cdf,它们的cdf值再作用ppf会回到开始的值。
>>>
>>> x = np.arange(4)*2
>>> x
array([0, 2, 4, 6])
>>> prb = hypergeom.cdf(x, M, n, N)
>>> prb
array([ 0.0001031991744066, 0.0521155830753351, 0.6083591331269301,
0.9897832817337386])
>>> hypergeom.ppf(prb, M, n, N)
array([ 0., 2., 4., 6.])
If we use values that are not at the kinks of the cdf step function, we get the next higher integer back:
如果我们使用的值不是cdf的函数值,则我们得到一个更高的值。
>>>
>>> hypergeom.ppf(prb + 1e-8, M, n, N)
array([ 1., 3., 5., 7.])
>>> hypergeom.ppf(prb – 1e-8, M, n, N)
array([ 0., 2., 4., 6.])
Fitting Distributions
分布拟合
The main additional methods of the not frozen distribution are related to the estimation of distribution parameters:
非冻结分布的参数估计的主要方法:
fit: maximum likelihood estimation of distribution parameters, including location
and scale
fit:分布参数的极大似然估计,包括location与scale
fit_loc_scale: estimation of location and scale when shape parameters are given
fit_loc_scale:估计location与scale当形态参数给定时
nnlf: negative log likelihood function
nnlf:负对数似然函数
expect: Calculate the expectation of a function against the pdf or pmf
expect:计算函数pdf或pmf的期望值。
Performance Issues and Cautionary Remarks
性能问题与注意事项
The performance of the individual methods, in terms of speed, varies widely by distribution and method. The results of a method are obtained in one of two ways: either by explicit calculation, or by a generic algorithm that is independent of the specific distribution.
每个方法的性能与运行速度表现差异极大根据分布的不同。方法的结果可以由两种方式获得,精确的计算以及独立于各分布的通用算法。
Explicit calculation, on the one hand, requires that the method is directly specified for the given distribution, either through analytic formulas or through special functions in scipy.special or numpy.random for rvs. These are usually relatively fast calculations.
精确计算,一个分布能使用这种方式的第一种情况,这个分布是包中直接给你的(如标准正态分布),第二,给出解析形式,第三通过scipy.special或numpy.special或numpy.random的rvs特殊函数给出。一般使用精确计算会比较快。
The generic methods, on the other hand, are used if the distribution does not specify any explicit calculation. To define a distribution, only one of pdf or cdf is necessary; all other methods can be derived using numeric integration and root finding. However, these indirect methods can be very slow. As an example, rgh = stats.gausshyper.rvs(0.5, 2, 2, 2, size=100)creates random variables in a very indirect way and takes about 19 seconds for 100 random variables on my computer, while one million random variables from the standard normal or from the t distribution take just above one second.
另一方面,通用方法被用于当分布没有被指派明确的计算方法时使用。为了定义一个分布,只有pdf或cdf是必须的;通用方法使用数值积分和求根法得到结果。作为例子,rgh = stats.gausshyper.rvs(0.5, 2, 2, 2, size=100)以间接方式创建了100个随机变量(抽了100个值),这在我的电脑上花了19秒(译者:我花了3.5秒),对比取一百万个标准正态分布的值只需要1秒。
Remaining Issues
遗留问题
The distributions in scipy.stats have recently been corrected and improved and gained a considerable test suite, however a few issues remain:
scipy.stats里的分布最近进行了升级并且被仔细的检查过了,不过仍有一些问题存在。
the distributions have been tested over some range of parameters, however in some corner ranges, a few incorrect results may remain.
the maximum likelihood estimation in fit does not work with default starting parameters for all distributions and the user needs to supply good starting parameters. Also, for some distribution using a maximum likelihood estimator might inherently not be the best choice.
分布在很多参数区间上的值被测试过了,然而在一些奇葩的边界,仍然可能有错误的值存在。
fit的极大似然估计以默认值作为初始参数将不会工作的很好,用户必须指派合适的初始参数。并且,对于一些分布使用极大似然估计本质上就不是一个好的选择。
Building Specific Distributions
构建具体的分布
The next examples shows how to build your own distributions. Further examples show the usage of the distributions and some statistical tests.
下一个例子展示了如何建立你自己的分布。更多的例子见分布用法以及统计检验
Making a Continuous Distribution, i.e., Subclassing rv_continuous
创建一个连续分布,继承rv_continuous类
Making continuous distributions is fairly simple.
创建连续分布是非常简单的。
>>>
>>> from scipy import stats
>>> class deterministic_gen(stats.rv_continuous):
… def _cdf(self, x):
… return np.where(x < 0, 0., 1.)
… def _stats(self):
… return 0., 0., 0., 0.
>>>
>>> deterministic = deterministic_gen(name=”deterministic”)
>>> deterministic.cdf(np.arange(-3, 3, 0.5))
array([ 0., 0., 0., 0., 0., 0., 1., 1., 1., 1., 1., 1.])
Interestingly, the pdf is now computed automatically:
令人高兴的是,pdf现在也能被自动计算出来:
>>>
>>> deterministic.pdf(np.arange(-3, 3, 0.5))
array([ 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
5.83333333e+04, 4.16333634e-12, 4.16333634e-12,
4.16333634e-12, 4.16333634e-12, 4.16333634e-12])
Be aware of the performance issues mentions in Performance Issues and Cautionary Remarks. The computation of unspecified common methods can become very slow, since only general methods are called which, by their very nature, cannot use any specific information about the distribution. Thus, as a cautionary example:
注意这种用法的性能问题,参见“性能问题与注意事项”一节。这种缺乏信息的通用计算可能非常慢。而且看下面这个准确性的例子:
>>>
>>> from scipy.integrate import quad
>>> quad(deterministic.pdf, -1e-1, 1e-1)
(4.163336342344337e-13, 0.0)
But this is not correct: the integral over this pdf should be 1. Let’s make the integration interval smaller:
但这并不是对pdf积分的正确的结果,实际上结果应为1.让我们将积分变得更小一些。
>>>
>>> quad(deterministic.pdf, -1e-3, 1e-3) # warning removed
(1.000076872229173, 0.0010625571718182458)
This looks better. However, the problem originated from the fact that the pdf is not specified in the class definition of the deterministic distribution.
这样看上去好多了,然而,问题本身来源于pdf不是来自给定类的定义。