我正在为具有 34 个因变量的 logit 模型进行数据建模,并且它不断抛出奇异矩阵错误,如下所示 -:
Traceback (most recent call last):
File "<pyshell#1116>", line 1, in <module>
test_scores = smf.Logit(m['event'], train_cols,missing='drop').fit()
File "/usr/local/lib/python2.7/site-packages/statsmodels-0.5.0-py2.7-linux-i686.egg/statsmodels/discrete/discrete_model.py", line 1186, in fit
disp=disp, callback=callback, **kwargs)
File "/usr/local/lib/python2.7/site-packages/statsmodels-0.5.0-py2.7-linux-i686.egg/statsmodels/discrete/discrete_model.py", line 164, in fit
disp=disp, callback=callback, **kwargs)
File "/usr/local/lib/python2.7/site-packages/statsmodels-0.5.0-py2.7-linux-i686.egg/statsmodels/base/model.py", line 357, in fit
hess=hess)
File "/usr/local/lib/python2.7/site-packages/statsmodels-0.5.0-py2.7-linux-i686.egg/statsmodels/base/model.py", line 405, in _fit_mle_newton
newparams = oldparams - np.dot(np.linalg.inv(H),
File "/usr/local/lib/python2.7/site-packages/numpy/linalg/linalg.py", line 445, in inv
return wrap(solve(a, identity(a.shape[0], dtype=a.dtype)))
File "/usr/local/lib/python2.7/site-packages/numpy/linalg/linalg.py", line 328, in solve
raise LinAlgError, 'Singular matrix'
LinAlgError: Singular matrix
就在那时,我偶然发现了这种将矩阵简化为其独立列的方法
def independent_columns(A, tol = 0):#1e-05):
"""
Return an array composed of independent columns of A.
Note the answer may not be unique; this function returns one of many
possible answers.
https://stackoverflow.com/q/13312498/190597 (user1812712)
http://math.stackexchange.com/a/199132/1140 (Gerry Myerson)
http://mail.scipy.org/pipermail/numpy-discussion/2008-November/038705.html
(Anne Archibald)
>>> A = np.array([(2,4,1,3),(-1,-2,1,0),(0,0,2,2),(3,6,2,5)])
2 4 1 3
-1 -2 1 0
0 0 2 2
3 6 2 5
# try with checking the rank of matrixs
>>> independent_columns(A)
np.array([[1, 4],
[2, 5],
[3, 6]])
"""
Q, R = linalg.qr(A)
independent = np.where(np.abs(R.diagonal()) > tol)[0]
#print independent
return A[:, independent], independent
A,independent_col_indexes=independent_columns(train_cols.as_matrix(columns=None))
#train_cols will not be converted back from a df to a matrix object,so doing this explicitly
A2=pd.DataFrame(A, columns=train_cols.columns[independent_col_indexes])
test_scores = smf.Logit(m['event'],A2,missing='drop').fit()
我仍然得到 LinAlgError ,尽管我希望现在矩阵等级会降低。
另外,我看到np.linalg.matrix_rank(train_cols)
返回 33(即,在调用独立列函数之前,“x”列总数为 34(即,len(train_cols.ix[0])=34
),这意味着我没有满秩矩阵),而np.linalg.matrix_rank(A2)
返回 33 (这意味着我已经删除了一列,但当我运行时我仍然看到 LinAlgError )test_scores = smf.Logit(m['event'],A2,missing='drop').fit()
,我缺少什么?
参考上面的代码 -如何找到协方差矩阵中的简并行/列
我尝试开始向前构建模型,通过一次引入每个变量,这不会给我带来奇异矩阵误差,但我宁愿有一种确定性的方法,让我知道我做错了什么&如何消除这些列。
编辑(更新了@@的帖子建议
下面是user333700)
1.你是对的,“A2”并没有降低33的排名。 IE。len(A2.ix[0]) =34
-> 意味着可能的共线列不会被删除 - 我应该增加“tol”,以获得 A2 的排名(及其列数)的容差,如 33。如果我将 tol 更改为上面的“1e-05”,然后我确实得到了len(A2.ix[0]) =33
,这对我来说意味着 tol >0 (严格来说)是一个指标。
之后我也做了同样的事情,test_scores = smf.Logit(m['event'],A2,missing='drop').fit()
,无需 nm 即可收敛。
2.尝试“nm”方法后出错。但奇怪的是,如果我只获取 20,000 行,我确实会得到结果。因为它没有显示内存错误,但是“Inverting hessian failed, no bse or cov_params available
" - 我假设有多个几乎相似的记录 - 你会怎么说?
m = smf.Logit(data['event_custom'].ix[0:1000000] , train_cols.ix[0:1000000],missing='drop')
test_scores=m.fit(start_params=None,method='nm',maxiter=200,full_output=1)
Warning: Maximum number of iterations has been exceeded
Warning (from warnings module):
File "/usr/local/lib/python2.7/site-packages/statsmodels-0.5.0-py2.7-linux-i686.egg/statsmodels/base/model.py", line 374
warn(warndoc, Warning)
Warning: Inverting hessian failed, no bse or cov_params available
test_scores.summary()
Traceback (most recent call last):
File "<pyshell#17>", line 1, in <module>
test_scores.summary()
File "/usr/local/lib/python2.7/site-packages/statsmodels-0.5.0-py2.7-linux-i686.egg/statsmodels/discrete/discrete_model.py", line 2396, in summary
yname_list)
File "/usr/local/lib/python2.7/site-packages/statsmodels-0.5.0-py2.7-linux-i686.egg/statsmodels/discrete/discrete_model.py", line 2253, in summary
use_t=False)
File "/usr/local/lib/python2.7/site-packages/statsmodels-0.5.0-py2.7-linux-i686.egg/statsmodels/iolib/summary.py", line 826, in add_table_params
use_t=use_t)
File "/usr/local/lib/python2.7/site-packages/statsmodels-0.5.0-py2.7-linux-i686.egg/statsmodels/iolib/summary.py", line 447, in summary_params
std_err = results.bse
File "/usr/local/lib/python2.7/site-packages/statsmodels-0.5.0-py2.7-linux-i686.egg/statsmodels/tools/decorators.py", line 95, in __get__
_cachedval = self.fget(obj)
File "/usr/local/lib/python2.7/site-packages/statsmodels-0.5.0-py2.7-linux-i686.egg/statsmodels/base/model.py", line 1037, in bse
return np.sqrt(np.diag(self.cov_params()))
File "/usr/local/lib/python2.7/site-packages/statsmodels-0.5.0-py2.7-linux-i686.egg/statsmodels/base/model.py", line 1102, in cov_params
raise ValueError('need covariance of parameters for computing '
ValueError: need covariance of parameters for computing (unnormalized) covariances
Edit 2:(下面更新了@user333700 的建议)
重申我正在尝试建模的内容 - 不到总数的约 1%
用户“转化”(成功结果) - 所以我采取了一个平衡的样本
35(+ve) /65(-ve)
我怀疑该模型虽然收敛,但并不稳健。因此,将使用来自不同数据集的“start_params”作为早期迭代的参数。
此编辑是为了确认“start_params”是否可以输入到结果中,如下所示 -:
A,independent_col_indexes=independent_columns(train_cols.as_matrix(columns=None))
A2=pd.DataFrame(A, columns=train_cols.columns[independent_col_indexes])
m = smf.Logit(data['event_custom'], A2,missing='drop')
#m = smf.Logit(data['event_custom'], train_cols,missing='drop')#,method='nm').fit()#This doesnt work, so tried 'nm' which work, but used lasso, as nm did not converge.
test_scores=m.fit_regularized(start_params=None, method='l1', maxiter='defined_by_method', full_output=1, disp=1, callback=None, alpha=0, \
trim_mode='auto', auto_trim_tol=0.01, size_trim_tol=0.0001, qc_tol=0.03)
a_good_looking_previous_result.params=test_scores.params #storing the parameters of pass1 to feed into pass2
test_scores.params
bidfloor_Quartile_modified_binned_0 0.305765
connectiontype_binned_0 -0.436798
day_custom_binned_Fri -0.040269
day_custom_binned_Mon 0.138599
day_custom_binned_Sat -0.319997
day_custom_binned_Sun -0.236507
day_custom_binned_Thu -0.058922
user_agent_device_family_binned_iPad -10.793270
user_agent_device_family_binned_iPhone -8.483099
user_agent_masterclass_binned_apple 9.038889
user_agent_masterclass_binned_generic -0.760297
user_agent_masterclass_binned_samsung -0.063522
log_height_width 0.593199
log_height_width_ScreenResolution -0.520836
productivity -1.495373
games 0.706340
entertainment -1.806886
IAB24 2.531467
IAB17 0.650327
IAB14 0.414031
utilities 9.968253
IAB1 1.850786
social_networking -2.814148
IAB3 -9.230780
music 0.019584
IAB9 -0.415559
C(time_day_modified)[(6, 12]]:C(country)[AUS] -0.103003
C(time_day_modified)[(0, 6]]:C(country)[HKG] 0.769272
C(time_day_modified)[(6, 12]]:C(country)[HKG] 0.406882
C(time_day_modified)[(0, 6]]:C(country)[IDN] 0.073306
C(time_day_modified)[(6, 12]]:C(country)[IDN] -0.207568
C(time_day_modified)[(0, 6]]:C(country)[IND] 0.033370
... more params here
现在在不同的数据集(pass2,用于索引)上,我的模型如下所示 -:
IE。我读取了一个新的数据帧,进行所有变量转换,然后像之前一样通过 Logit 进行建模。
m_pass2 = smf.Logit(data['event_custom'], A2_pass2,missing='drop')
test_scores_pass2=m_pass2.fit_regularized(start_params=a_good_looking_previous_result.params, method='l1', maxiter='defined_by_method', full_output=1, disp=1, callback=None, alpha=0, \
trim_mode='auto', auto_trim_tol=0.01, size_trim_tol=0.0001, qc_tol=0.03)
并且,可能通过从早期传递中获取“start_params”来继续迭代。