欢迎您访问365答案网,请分享给你的朋友!
生活常识 学习资料

【数据准备和特征工程】数据清理

时间:2023-08-17
【数据准备和特征工程】数据清理

1.基本概念2.转换数据类型3.处理重复数据4.处理缺失数据

a.检查缺失数据b.直接删除缺失数据c.用指定值填补缺失数据d.根据规律填补缺失值 5.离群数据

5.1 通过可视化Box-plotScatter-plot5.2 通过数学计算Z-ScoreIQR score5.3 处理离群数据 参考资料 1.基本概念

import pandas as pd df = pd.read_csv("test.csv") df.sample(10)

获取前几行数据

data.head()

获取数据维度信息

df.shape

获取数据表属性的相关信息

data.info()

获取数据表属性类型信息

data.head()

2.转换数据类型

import pandas as pddf = pd.Dataframe([{'col1':'a', 'col2':'1'},{'col1':'b', 'col2':'2'}])df.dtypes

Object转换为数值型

df['col2-int'] = df['col2'].astype(int)

含有不是数字的Object类型转换为数值型

#此时由于含有不是数字的值,所以就算忽略报错后转换后的类型还是Objects = pd.Series(['1', '2', '4.7', 'pandas', '10'])s.astype(float,errors='ignore')

#这种方法可以将其转换为数值型,Pandas则变为Nanpd.to_numeric(s, errors='coerce')

转换为日期类型

#可以将三列数据Month、Day、Year转换为日期类型的数据pd.to_datetime(df[['Month', 'Day', 'Year']])

最终的代码

import pandas as pdimport numpy as np def convert_money(value):new_value = value.replace(",","").replace("$","")return float(new_value) df2 = pd.read_csv("/home/aistudio/data/data20506/sales_types.csv",dtype = {'Customer Number': 'int'},converters = {'2016': convert_money,'2017': convert_money,'Percent Growth': lambda x: float(x.replace("%", "")) / 100,'Jan Units': lambda x: pd.to_numeric(x, errors='coerce'),'Active': lambda x: np.where(x =='Y', 1, 0),})df2['Date'] = pd.to_datetime(df[['Month', 'Day', 'Year']])df2

3.处理重复数据

import pandas as pdd = {'Name':['Newton', 'Galilei', 'Einstein', 'Feynman', 'Newton', 'Maxwell', 'Galilei'],'Age':[26, 30, 28, 28, 26, 39, 40],'Score':[90, 80, 90, 100, 90, 70, 90]}df = pd.Dataframe(d,columns=['Name','Age','Score'])df

df.duplicated()函数使用

Consider dataset containing ramen rating.>>> df = pd.Dataframe({..、 'brand': ['Yum Yum', 'Yum Yum', 'Indomie', 'Indomie', 'Indomie'],..、 'style': ['cup', 'cup', 'cup', 'pack', 'pack'],..、 'rating': [4, 4, 3.5, 15, 5]..、})>>> df brand style rating0 Yum Yum cup 4.01 Yum Yum cup 4.02 Indomie cup 3.53 Indomie pack 15.04 Indomie pack 5.0By default, for each set of duplicated values, the first occurrenceis set on False and all others on True.>>> df.duplicated()0 False1 True2 False3 False4 Falsedtype: boolBy using 'last', the last occurrence of each set of duplicated valuesis set on False and all others on True.>>> df.duplicated(keep='last')0 True1 False2 False3 False4 Falsedtype: boolBy setting ``keep`` on False, all duplicates are True.>>> df.duplicated(keep=False)0 True1 True2 False3 False4 Falsedtype: boolTo find duplicates on specific column(s), use ``subset``.>>> df.duplicated(subset=['brand'])0 False1 True2 False3 True4 Truedtype: bool

删除重复的

#保留第一个df.drop_duplicates()#保留最后一个重复的元素df.drop_duplicates('Age', keep='last')

4.处理缺失数据 a.检查缺失数据

#方法一#isna()函数,若为空则为False,若不为空则为Truedf = pd.Dataframe({"one":[1, 2, np.nan], "two":[np.nan, 3, 4]})df.isna() #返回m行n列,每个元素的值都会返回(True,False)df.isna().any() #只返回1列,只要有一个是False就整个属性的值就为False#方法二#可以统计缺失率(data.shape[0] - data.count()) / data.shape[0]

b.直接删除缺失数据

Dropna()函数

Examples-------->>> df = pd.Dataframe({"name": ['Alfred', 'Batman', 'Catwoman'],..、 "toy": [np.nan, 'Batmobile', 'Bullwhip'],..、 "born": [pd.NaT, pd.Timestamp("1940-04-25"),..、 pd.NaT]})>>> df name toy born0 Alfred NaN NaT1 Batman Batmobile 1940-04-252 Catwoman Bullwhip NaTDrop the rows where at least one element is missing.>>> df.dropna() name toy born1 Batman Batmobile 1940-04-25Drop the columns where at least one element is missing.>>> df.dropna(axis='columns') name0 Alfred1 Batman2 CatwomanDrop the rows where all elements are missing.>>> df.dropna(how='all') name toy born0 Alfred NaN NaT1 Batman Batmobile 1940-04-252 Catwoman Bullwhip NaTKeep only the rows with at least 2 non-NA values.>>> df.dropna(thresh=2) name toy born1 Batman Batmobile 1940-04-252 Catwoman Bullwhip NaTDefine in which columns to look for missing values.>>> df.dropna(subset=['name', 'born']) name toy born1 Batman Batmobile 1940-04-25Keep the Dataframe with valid entries in the same variable.在同一个变量改变>>> df.dropna(inplace=True)>>> df name toy born1 Batman Batmobile 1940-04-25

c.用指定值填补缺失数据

df = pd.Dataframe({'ColA':[1, np.nan, np.nan, 4, 5, 6, 7], 'ColB':[1, 1, 1, 1, 2, 2, 2]})df['ColA'].fillna(method='ffill')#以前面一个值填充df['ColA'].fillna(method='bfill')#以后面一个值填充

调用sklearn.impute中的SimpleImputer来填补缺失数据

from sklearn.impute import SimpleImputer#以均值填充空值imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean') col_values = imp_mean.fit_transform(pdf2['Height-na'].values.reshape((-1, 1)))#以常数填充空值df = pd.Dataframe({"name": ["Google", "Huawei", "Facebook", "Alibaba"],"price": [100, -1, -1, 90]})imp = SimpleImputer(missing_values=-1, strategy='constant', fill_value=110) # ⑤imp.fit_transform(df['price'].values.reshape((-1, 1)))

d.根据规律填补缺失值

利用sklearn.linear_model的LinearRegression来回归未知的数据

df = pd.Dataframe({"one":np.random.randint(1, 100, 10),"two": [2, 4, 6, 8, 10, 12, 14, 16, 18, 20],"three":[5, 9, 13, np.nan, 21, np.nan, 29, 33, 37, 41]})from sklearn.linear_model import LinearRegression # ⑥df_train = df.dropna() #训练集df_test = df[df['three'].isnull()] #测试集regr = LinearRegression()regr.fit(df_train['two'].values.reshape(-1, 1), df_train['three'].values.reshape(-1, 1)) # ⑦df_three_pred = regr.predict(df_test['two'].values.reshape(-1, 1)) # 将所得数值填补到原数据集中df.loc[(df.three.isnull()), 'three'] = df_three_preddf

使用sklearn.ensemble的RandomForestRegressor来进行预测

import pandas as pdtrain_data = pd.read_csv("/home/aistudio/data/data20507/train.csv")train_data.info() df = train_data[['Age','Fare', 'Parch', 'SibSp', 'Pclass']] #可能跟年龄有关的特征known_age = df[df['Age'].notnull()].valuesunknown_age = df[df['Age'].isnull()].valuesy = known_age[:, 0]X = known_age[:, 1:]from sklearn.ensemble import RandomForestRegressor # ⑩rfr = RandomForestRegressor(random_state=0, n_estimators=2000, n_jobs=-1) # ○11rfr.fit(X, y)pred_age = rfr.predict(unknown_age[:, 1:]) # ○13pred_age.mean()train_data.loc[(train_data.Age.isnull()), 'Age'] = pred_agetrain_data.isna().any()

还可以利用KNN来填补缺失值

%matplotlib inlineimport seaborn as snsfrom sklearn.datasets import load_iris # 引入鸢尾花数据集import numpy as npiris = load_iris()X = iris.data# 制造含有缺失值的数据集rng = np.random.RandomState(0)X_missing = X.copy()mask = np.abs(X[:, 2] - rng.normal(loc=5.5, scale=0.7, size=X.shape[0])) < 0.6X_missing[mask, 3] = np.nan # X_missing是包含了缺失值的数据集from missingpy import KNNImputer # 引入KNN填充缺失值的模型imputer = KNNImputer(n_neighbors=3, weights="uniform")X_imputed = imputer.fit_transform(X_missing)#填补之前的数据分布sns.distplot(X.reshape((-1, 1)))#填补缺失数据后的分布sns.distplot(X_imputed.reshape((-1, 1)))

5.离群数据 5.1 通过可视化

import pandas as pdimport matplotlib.pyplot as plt%matplotlib inlinefrom sklearn.datasets import load_bostonboston = load_boston()x = boston.datay = boston.targetcolumns = boston.feature_names #create the dataframeboston_df = pd.Dataframe(x)boston_df.columns = columnsboston_df.head()

Box-plot

import seaborn as snssns.boxplot(x=boston_df['INDUS']);

![[箱线图.png]]

sns.boxplot(x=boston_df['PTRATIO']);sns.boxplot(x=boston_df['DIS']);

Scatter-plot

fig, ax = plt.subplots(figsize=(16,8))ax.scatter(boston_df['INDUS'], boston_df['TAX'])ax.set_xlabel('Proportion of non-retail business acres per town')ax.set_ylabel('Full-value property-tax rate per $10,000')plt.show();

5.2 通过数学计算 Z-Score

Z-Score是指观测点或数据的值超过观测值或测量值平均值的标准差的符号数。

from scipy import statsimport numpy as npz = np.abs(stats.zscore(boston_df))print(z)threshold = 3print(np.where(z > 3))

IQR score

四分位数范围(IQR),又称平均数或50%平均数,或技术上称为H-spread,是衡量统计学分散度的指标,等于75%和25%之间的差值,或上四分位数和下四分位数之间的差值,IQR=Q3 - Q1。百度百科解释
![[IQR.png]]

Q1 = boston_df.quantile(0.25)Q3 = boston_df.quantile(0.75)IQR = Q3 - Q1print(IQR)

既然我们现在有了IQR估计,那么就可以选择离群值。下面的代码将产生带有真值和假值的结果。带有False的数据点表示这些值是有效的,而True则表示有释放。

print((boston_df < (Q1 - 1.5 * IQR)) | (boston_df > (Q3 + 1.5 * IQR)))

5.3 处理离群数据

识别异常值–分析有无异常值的结果模型–做出结论。如果你确定数值是错误的,就修正它。如果离群值不在利益分配范围内,则删除。考虑到数据的差距,使用抗离群值的统计工具,例如,稳健回归(用另一种参数估计方法)Robust_regression。离群值的常见原因是两种分布的混合,可能是两个不同的子人群,也可能表明 "测量正确 "和 “测量误差”;这通常是由混合模型来建模。 (Mixture model)、参考资料

https://blog.csdn.net/weixin_42199542/article/details/106898892
https://blog.csdn.net/bbbeoy/article/details/72124004

Copyright © 2016-2020 www.365daan.com All Rights Reserved. 365答案网 版权所有 备案号:

部分内容来自互联网,版权归原作者所有,如有冒犯请联系我们,我们将在三个工作时内妥善处理。