线性回归模型也并不适用于所有情况,有些结果可能包含而元数据(比如正面与反面)或者计数数据,广义线性模型可用于解释这类数据,使用的仍然是自变量的线性组合。
目录
逻辑回归
使用statsmodels
使用sklearn
泊松回归
使用statsmodels
负二项回归
逻辑回归
当响应变量为二元数据时,常用逻辑回归对数据进行建模。
以下数据来源于pandas活用所提供的数据,如需要可在此下载https://download.csdn.net/download/qq_57099024/79301082
import pandas as pdd=pd.read_csv('D:/pandas活用/pandas_for_everyone-master/data/acs_ny.csv')print(d.columns)print('@'*66)#输出特殊符号以区分两次输出print(d.head())'''以下为输出结果:Index(['Acres', 'FamilyIncome', 'FamilyType', 'NumBedrooms', 'NumChildren', 'NumPeople', 'NumRooms', 'NumUnits', 'NumVehicles', 'NumWorkers', 'OwnRent', 'YearBuilt', 'HouseCosts', 'ElectricBill', 'FoodStamp', 'HeatingFuel', 'Insurance', 'Language'], dtype='object')@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ Acres FamilyIncome FamilyType NumBedrooms NumChildren NumPeople 0 1-10 150 Married 4 1 3 1 1-10 180 Female Head 3 2 4 2 1-10 280 Female Head 4 0 2 3 1-10 330 Female Head 2 1 2 4 1-10 330 Male Head 3 1 2 NumRooms NumUnits NumVehicles NumWorkers OwnRent YearBuilt 0 9 Single detached 1 0 Mortgage 1950-1959 1 6 Single detached 2 0 Rented Before 1939 2 8 Single detached 3 1 Mortgage 2000-2004 3 4 Single detached 1 0 Rented 1950-1959 4 5 Single attached 1 0 Mortgage Before 1939 HouseCosts ElectricBill FoodStamp HeatingFuel Insurance Language 0 1800 90 No Gas 2500 English 1 850 90 No Oil 0 English 2 2600 260 No Oil 6600 Other European 3 1800 140 No Oil 0 English 4 860 150 No Gas 660 Spanish '''
以下对FamilyIncome 进行分箱操作:
d['income_15w']=pd.cut(d['FamilyIncome'],[0,150000,d['FamilyIncome'].max()],labels=[0,1])d['income_15w']=d['income_15w'].astype(int)
使用cut分箱操作,创建二值响应变量_我就是一个小怪兽的博客-CSDN博客
使用statsmodels
import statsmodels.formula.api as smfmodel=smf.logit('income_15w~HouseCosts+NumWorkers+OwnRent+NumBedrooms+FamilyType',data=d)results=model.fit()print(results.summary())
Optimization terminated successfully. Current function value: 0.391651 Iterations 7 Logit Regression Results ==============================================================================Dep、Variable: income_15w No、Observations: 22745Model: Logit Df Residuals: 22737Method: MLE Df Model: 7Date: Sat, 05 Feb 2022 Pseudo R-squ.: 0.2078Time: 08:46:18 Log-Likelihood: -8908.1converged: True LL-Null: -11244.Covariance Type: nonrobust LLR p-value: 0.000=========================================================================================== coef std err z P>|z| [0.025 0.975]-------------------------------------------------------------------------------------------Intercept -5.8081 0.120 -48.456 0.000 -6.043 -5.573OwnRent[T.Outright] 1.8276 0.208 8.782 0.000 1.420 2.236OwnRent[T.Rented] -0.8763 0.101 -8.647 0.000 -1.075 -0.678FamilyType[T.Male Head] 0.2874 0.150 1.913 0.056 -0.007 0.582FamilyType[T.Married] 1.3877 0.088 15.781 0.000 1.215 1.560HouseCosts 0.0007 1.72e-05 42.453 0.000 0.001 0.001NumWorkers 0.5873 0.026 22.393 0.000 0.536 0.639NumBedrooms 0.2365 0.017 13.985 0.000 0.203 0.270==================================================================================
使用sklearn
predictors=pd.get_dummies(d[['HouseCosts','NumWorkers','OwnRent','NumBedrooms','FamilyType']],drop_first=True)from sklearn import linear_modellr=linear_model.LogisticRegression()results=lr.fit(X=predictors,y=d['income_15w'])print(results.coef_)print('-*-'*10)print(results.intercept_)
[[ 5.86894916e-04 7.32489391e-01 2.86764784e-01 7.17542587e-02 -2.13282748e+00 -1.03910262e+00 2.63647146e-01]]-*--*--*--*--*--*--*--*--*--*-[-4.86108187]
泊松回归
常用于计数数据分析
使用statsmodelsresults=smf.poisson('NumChildren~FamilyIncome+FamilyType+OwnRent',data=d).fit()print(results.summary())
Optimization terminated successfully. Current function value: nan Iterations 1 Poisson Regression Results ==============================================================================Dep、Variable: NumChildren No、Observations: 22745Model: Poisson Df Residuals: 22739Method: MLE Df Model: 5Date: Sat, 05 Feb 2022 Pseudo R-squ.: nanTime: 09:05:28 Log-Likelihood: nanconverged: True LL-Null: -30977.Covariance Type: nonrobust LLR p-value: nan=========================================================================================== coef std err z P>|z| [0.025 0.975]-------------------------------------------------------------------------------------------Intercept nan nan nan nan nan nanFamilyType[T.Male Head] nan nan nan nan nan nanFamilyType[T.Married] nan nan nan nan nan nanOwnRent[T.Outright] nan nan nan nan nan nanOwnRent[T.Rented] nan nan nan nan nan nanFamilyIncome nan nan nan nan nan nan==================================================================================
负二项回归
如果泊松回归的假设不理想(例如数据过度离散),可使用负二项回归来代替
statsmodels的GLM文档列入了可以传入GLM参数的许多分布族,可在sm.familiese.
.links下找到连接函数:: Binomial(二项式分布)
Gamma(伽马分布)
InverseGaussian(逆高斯分布)
NegativeBinomial(负二项式分布)
Poisson(泊松分布)
Tweedie分布
import statsmodelsimport statsmodels.api as smimport statsmodels.formula.api as smfmodel=smf.glm('NumChildren~FamilyIncome+FamilyType+OwnRent',data=d,family=sm.families.NegativeBinomial(sm.genmod.families.links.log))results=model.fit()print(results.summary())
Generalized Linear Model Regression Results ==============================================================================Dep、Variable: NumChildren No、Observations: 22745Model: GLM Df Residuals: 22739Model Family: NegativeBinomial Df Model: 5link Function: log Scale: 1.0000Method: IRLS Log-Likelihood: -29749.Date: Sat, 05 Feb 2022 Deviance: 20731.Time: 10:06:21 Pearson chi2: 1.77e+04No、Iterations: 6 Covariance Type: nonrobust =========================================================================================== coef std err z P>|z| [0.025 0.975]-------------------------------------------------------------------------------------------Intercept -0.3345 0.029 -11.672 0.000 -0.391 -0.278FamilyType[T.Male Head] -0.0468 0.052 -0.905 0.365 -0.148 0.055FamilyType[T.Married] 0.1529 0.029 5.200 0.000 0.095 0.211OwnRent[T.Outright] -1.9737 0.243 -8.113 0.000 -2.450 -1.497OwnRent[T.Rented] 0.4164 0.030 13.754 0.000 0.357 0.476FamilyIncome 5.398e-07 9.55e-08 5.652 0.000 3.53e-07 7.27e-07=================================================================================