引言
最近写论文关于预测的特征选择遇到一些问题,想把自己查询学习到的东西整理记录一下,理一理头绪,希望能加深自己对这些东西的理解。首先介绍引入几个概念:自相关函数(autocorrelation function,ACF)、偏自相关函数(partial autocorrelation,PACF)和互相关函数(cross-correlation function,CCF)。接下来介绍每个指标的计算方法和用途:
1. ACF
顾名思义,自相关函数是针对单个时间序列的,对于时间序列
,滞后k阶的自协方差函数(autocovariance function,ACVF)为[1]:
,
即![c_0 = \frac{1}{n}\sum_{t=1}^{n} \left(x_t-\bar{x}\right) ^2](https://private.codecogs.com/gif.latex?c_0%20%3D%20%5Cfrac%7B1%7D%7Bn%7D%5Csum_%7Bt%3D1%7D%5E%7Bn%7D%20%5Cleft%28x_t-%5Cbar%7Bx%7D%5Cright%29%20%5E2)
ACF被定义为:
![acf_k = \frac{c_k}{c_0} = \text{Cor}(x_t,x_{t+k})](https://private.codecogs.com/gif.latex?acf_k%20%3D%20%5Cfrac%7Bc_k%7D%7Bc_0%7D%20%3D%20%5Ctext%7BCor%7D%28x_t%2Cx_%7Bt+k%7D%29)
我理解的也等同于序列
与
的pearson相关系数。置信区间一般使用下式进行计算[2]:
![\pm z_{1-\alpha/2} \sqrt{\frac{1}{N} (1 + 2 \sum_{i=1}^{k}{x_{i}^2})}](https://private.codecogs.com/gif.latex?%5Cpm%20z_%7B1-%5Calpha/2%7D%20%5Csqrt%7B%5Cfrac%7B1%7D%7BN%7D%20%281%20+%202%20%5Csum_%7Bi%3D1%7D%5E%7Bk%7D%7Bx_%7Bi%7D%5E2%7D%29%7D)
下面是一个简单的计算程序,是对statsmodels模块的源代码进行简化得到的[3]。
from scipy.stats import norm
import numpy as np
def acf(x,nlags=40, alpha=None):
# Calculate the autocorrelation function.
nobs = len(x)
avf = acovf(x,nlag = nlags)
acf = avf[:nlags + 1] / avf[0]
if alpha is not None:
varacf = np.ones(nlags + 1) / nobs
varacf[0] = 0
varacf[1] = 1. / nobs
varacf[2:] *= 1 + 2 * np.cumsum(acf[1:-1]**2)
interval = norm.ppf(1 - alpha / 2.) * np.sqrt(varacf)
confint = np.array(lzip(acf - interval, acf + interval))
return acf, confint
else:
return acf
def acovf(x,nlag=None):
# Estimate autocovariances.
xo = x - x.mean()
n = len(x)
lag_len = nlag
if nlag is None:
lag_len = n - 1
elif nlag > n - 1:
raise ValueError('nlag must be smaller than nobs - 1')
acov = np.empty(lag_len + 1)
acov[0] = xo.dot(xo)
for i in range(lag_len):
acov[i + 1] = xo[i + 1:].dot(xo[:-(i + 1)])
return acov
理解了原理可以这样。
import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.tsa.stattools import acf
from statsmodels.graphics.tsaplots import plot_acf
x = pd.read_excel('resample.xls',index_col=0,date_parse=True)
acf_x, interval = acf(x=x,nlags=10,alpha=0.05)
print('ACF:\n',acf_x)
print('ACF95%置信区间下界:\n',interval[:,0]-acf_x)
print('ACF95%置信区间上界:\n',interval[:,1]-acf_x)
plot_acf(x=x,lags=10,alpha=0.05)
plt.show()
输出结果:
ACF:
[1. 0.97339745 0.95013662 0.93091665 0.91399405 0.89897186
0.88282934 0.86933546 0.85708885 0.84621534 0.83720026]
ACF95%置信区间下界:
[ 0. -0.03625835 -0.06169254 -0.07861061 -0.09196861 -0.10322176
-0.11304703 -0.12177398 -0.12967655 -0.13692158 -0.14363264]
ACF95%置信区间上界:
[0. 0.03625835 0.06169254 0.07861061 0.09196861 0.10322176
0.11304703 0.12177398 0.12967655 0.13692158 0.14363264]
![acf](https://img-blog.csdnimg.cn/2020013110022152.jpg?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L1dpbGxfWmhhbg==,size_16,color_FFFFFF,t_70)
[1] https://nwfsc-timeseries.github.io/atsa-labs/sec-tslab-correlation-within-and-among-time-series.html : Book, Applied Time Series Analysis for Fisheries and Environmental Sciences
[2] https://www.statsmodels.org/stable/_modules/statsmodels/tsa/stattools.html#acf : Python modular, statsmodels
[3] https://www.itl.nist.gov/div898/handbook/eda/section3/autocopl.htm : Engineering statistics handbook
2. PACF
偏自相关函数也是针对单个时间序列的,关于其和ACF的区别,[4] 进行了一般性的易于理解的解释,我在这里根据自己的理解进行简要翻译一下。
偏自相关函数是序列
与滞后k阶的序列
的线性相关性,移除
的线性依赖,计算公式为:
![pacf_k = \begin{cases} \text{Cor}(x_1,x_0)=r_1 & \text{if } k = 1;\\ \text{Cor}(x_k-x_k^{k-1},x_0-x_0^{k-1}) & \text{if } k \geq 2; \end{cases}](https://private.codecogs.com/gif.latex?pacf_k%20%3D%20%5Cbegin%7Bcases%7D%20%5Ctext%7BCor%7D%28x_1%2Cx_0%29%3Dr_1%20%26%20%5Ctext%7Bif%20%7D%20k%20%3D%201%3B%5C%5C%20%5Ctext%7BCor%7D%28x_k-x_k%5E%7Bk-1%7D%2Cx_0-x_0%5E%7Bk-1%7D%29%20%26%20%5Ctext%7Bif%20%7D%20k%20%5Cgeq%202%3B%20%5Cend%7Bcases%7D)
滞后1阶的pacf与滞后1阶的acf相等,不存在滞后0阶的pacf。
import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.tsa.stattools import pacf
from statsmodels.graphics.tsaplots import plot_pacf
x = pd.read_excel('resample.xls',index_col=0,date_parse=True)
pacf_x, interval = pacf(x=x,nlags=10,alpha=0.05)
print('PACF:\n',pacf_x)
print('PACF95%置信区间下界:\n',interval[:,0]-pacf_x)
print('PACF95%置信区间上界:\n',interval[:,1]-pacf_x)
plot_pacf(x=x,lags=10,alpha=0.05)
plt.savefig('acf1.jpg',dpi=200)
plt.show()
PACF:
[ 1. 0.97373069 0.05083912 0.07012207 0.04486007 0.04104969
-0.01595041 0.0505279 0.02666616 0.03388021 0.04459499]
PACF95%置信区间下界:
[ 0. -0.03625835 -0.03625835 -0.03625835 -0.03625835 -0.03625835
-0.03625835 -0.03625835 -0.03625835 -0.03625835 -0.03625835]
PACF95%置信区间上界:
[0. 0.03625835 0.03625835 0.03625835 0.03625835 0.03625835
0.03625835 0.03625835 0.03625835 0.03625835 0.03625835]
![pacf](https://img-blog.csdnimg.cn/20200131100546203.jpg?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L1dpbGxfWmhhbg==,size_16,color_FFFFFF,t_70)
3. CCF
交叉相关系数是针对两个时间序列的,对于时间序列
和
,先计算互协方差函数:
![g_k^{xy} = \frac{1}{n}\sum_{t=1}^{n-k} \left(y_t-\bar{y}\right) \left(x_{t+k}-\bar{x}\right)](https://private.codecogs.com/gif.latex?g_k%5E%7Bxy%7D%20%3D%20%5Cfrac%7B1%7D%7Bn%7D%5Csum_%7Bt%3D1%7D%5E%7Bn-k%7D%20%5Cleft%28y_t-%5Cbar%7By%7D%5Cright%29%20%5Cleft%28x_%7Bt+k%7D-%5Cbar%7Bx%7D%5Cright%29)
CCF的定义类似ACF:
![r_k^{xy} = \frac{g_k^{xy}}{\sqrt{\text{SD}_x\text{SD}_y}}](https://private.codecogs.com/gif.latex?r_k%5E%7Bxy%7D%20%3D%20%5Cfrac%7Bg_k%5E%7Bxy%7D%7D%7B%5Csqrt%7B%5Ctext%7BSD%7D_x%5Ctext%7BSD%7D_y%7D%7D)
其中,
分别为
的标准差。同时,应该注意的是
,
。其中,y是解释变量,x是预测因子。CCF的95%的置信区间计算公式定义为[1]:
![-\frac{1}{n} \pm \frac{2}{\sqrt{n}}](https://private.codecogs.com/gif.latex?-%5Cfrac%7B1%7D%7Bn%7D%20%5Cpm%20%5Cfrac%7B2%7D%7B%5Csqrt%7Bn%7D%7D)
其中n是用于计算互相关系数的样本数量。
import pandas as pd
from statsmodels.tsa.stattools import ccf
x = pd.read_excel('resample.xls',index_col=0,date_parse=True)
ccf_x = ccf(x=x.values.ravel(),y=x.values.ravel())
print('CCF:\n',ccf_x[:10])
输出是:
CCF:
[1. 0.97373069 0.95078739 0.9318734 0.91524696 0.90051278
0.88464586 0.87142306 0.85944187 0.8488298 ]
附录:
1. 计算ccf,acf和pcf的置信区间的时候,和数据序列的长度有关系,长度取多少的时候比较合适?