1.特征数值化
1.1 Replace()函数1.2 Sklearn包中的LabelEncoder1.3 category_encoders包 2.特征二值化
2.1 手动设置2.2 Sklearn包中的Binarizer 3.One-Hot编码
3.1 手动转换3.2 调用sklearn包的OneHotEncoder 4.数据变换
4.1 将变量取对数4.2 调用sklearn包的PolynomialFeatures 5.特征离散化
5.1 无监督离散化5.2 有监督离散化 6.数据规范化
6.1 sklearn包的StandardScaler6.2 sklearn包的MinMaxScaler6.3 sklearn包的RobustScaler6.4 示例 1.特征数值化 1.1 Replace()函数
import pandas as pddf = pd.Dataframe({"gene_segA": [1, 0, 0, 1, 1, 1, 0, 0, 1, 0],"gene_segB": [1, 0, 1, 0, 1, 1, 0, 0, 1, 0],"hypertension": ["Y", 'N', 'N', 'N', 'N', 'N', 'Y', 'N', 'Y', 'N'],"Gallstones": ['Y', 'N', 'N', 'N', 'Y', 'Y', 'Y', 'N', 'N', 'Y']})dfdf.replace({"N": 0, 'Y': 1})
1.2 Sklearn包中的LabelEncoderfrom sklearn.preprocessing import LabelEncoderle = LabelEncoder() # ①le.fit(['white', 'green', 'red', 'green', 'white']) # ②le.classes_ # ③le.transform(["green", 'green', 'green', 'white']) # ④
1.3 category_encoders包 里面封装了多种编码方式
https://mattzheng.blog.csdn.net/article/details/107851162
import numpy as nppm25['bdays'] = np.where(pm25["Exposed days"] > pm25["Exposed days"].mean(), 1, 0)pm25.sample(10)
2.2 Sklearn包中的Binarizer这个就是根据阈值将数值型转变为二进制型,阈值可以进行设定,另外只能对数值型数据进行处理,且传入的参数必须为2D数组,也就是不能是Series这种类型,shape为(m,n)而不是(n,)类型的数组,下面看下例子
df = Dataframe(np.arange(12).reshape(4,3),columns=['A','B','C'])df第一列为索引值A B C0 0 1 21 3 4 52 6 7 83 9 10 11将小于等于5的数值转为0,大于5的转为1binarize = Binarizer(threshold=5)binarize.fit_transform(df)array([[0, 0, 0], [0, 0, 0], [1, 1, 1], [1, 1, 1]])也可以传入df[['A','B']]来对两列进行转换,注意,不可以是df['A']或者df.A,因为df.A是Series不是二维的
from sklearn.preprocessing import Binarizerbn = Binarizer(threshold=pm25["Exposed days"].mean()) # ①result = bn.fit_transform(pm25[["Exposed days"]]) # ②pm25['sk-bdays'] = resultpm25.sample(10)
补充知识点:
reshape函数是在不改变数组的数据情况下,改变其格式、参数、返回值。reshape(m, -1) 改变维度为m行、列数未知reshape(-1, m) 改变维度为m列,行数未知
3.One-Hot编码df = pd.Dataframe({"color": ['green', 'red', 'blue', 'red'],"size": ['M', 'L', 'XL', 'L'],"price": [29.9, 69.9, 99.9, 59.9],"classlabel": ['class1', 'class2', 'class1', 'class1']})df
3.1 手动转换size_mapping = {'XL': 3, 'L': 2, 'M': 1}df['size'] = df['size'].map(size_mapping) # ②df
3.2 调用sklearn包的oneHotEncoderfrom sklearn.preprocessing import oneHotEncoderohe = OneHotEncoder()fs = ohe.fit_transform(df[['color']])fs_ohe = pd.Dataframe(fs.toarray()[:, 1:], columns=["color_green", 'color_red'])df = pd.concat([df, fs_ohe], axis=1)df
4.数据变换 4.1 将变量取对数%matplotlib inlineimport seaborn as snsax = sns.scatterplot(x='time', y='location', data=data)import numpy as npdata.drop([0], inplace=True) # 去掉0,不计算log0data['logtime'] = np.log10(data['time']) # ①data['logloc'] = np.log10(data['location']) # ②data.head()ax2 = sns.scatterplot(x='logtime', y='logloc', data=data)from sklearn.linear_model import LinearRegressionreg = LinearRegression()reg.fit(data['logtime'].values.reshape(-1, 1), data['logloc'].values.reshape(-1, 1))(reg.coef_, reg.intercept_)
4.2 调用sklearn包的PolynomialFeaturessklearn包的PolynomialFeatures
import numpy as npX = np.arange(6).reshape(3, 2)Xfrom sklearn.preprocessing import PolynomialFeatures # ③poly = PolynomialFeatures(2) # ④poly.fit_transform(X)
综合案例
%matplotlib inlineimport pandas as pdimport matplotlib.pyplot as pltfrom sklearn.linear_model import Ridgefrom sklearn.preprocessing import PolynomialFeaturesfrom sklearn.pipeline import make_pipelinedf = pd.read_csv("/home/aistudio/data/data20514/xsin.csv")colors = ['teal', 'yellowgreen', 'gold']plt.scatter(df['x'], df['y'], color='navy', s=30, marker='o', label="training points")for count, degree in enumerate([3, 4, 5]):model = make_pipeline(PolynomialFeatures(degree), Ridge()) # ③model.fit(df[['x']], df[['y']])y_pre = model.predict(df[['x']])plt.plot(df['x'], y_pre, color=colors[count], linewidth=2,label="degree %d" % degree)plt.legend()
5.特征离散化 5.1 无监督离散化利用pandas的cut()函数将属性分组
ages2 = pd.Dataframe({'years':[10, 14, 30, 53, 300, 32, 45], 'name':['A', 'B', 'C', 'D', 'E', 'F', 'G']})klass2 = pd.cut(ages2['years'], 3, labels=['Young', 'Middle', 'Senior']) # ②ages2['label'] = klass2ages2
ages2 = pd.Dataframe({'years':[10, 14, 30, 53, 300, 32, 45], 'name':['A', 'B', 'C', 'D', 'E', 'F', 'G']})klass2 = pd.cut(ages2['years'], bins=[9, 30, 50, 300], labels=['Young', 'Middle', 'Senior']) # ③ages2['label'] = klass2ages2
调用sklearn的KBinsDiscretizer实现
from sklearn.preprocessing import KBinsDiscretizerkbd = KBinsDiscretizer(n_bins=3, encode='ordinal', strategy='uniform') # ④trans = kbd.fit_transform(ages[['years']]) # ⑤ages['kbd'] = trans[:, 0] # ⑥ages
KBinsDiscretizer介绍
https://scikit-learn.org.cn/view/722.html
Sklearn官方例子
该示例比较了带有或不带有离散化实值特征的线性回归(线性模型)和决策树(基于树的模型)的预测结果。
如离散化之前的结果所示,线性模型的建立速度很快,解释起来也相对简单,但是只能建模线性关系,而决策树则可以构建更为复杂的数据模型。使线性模型在连续数据上更强大的一种方法是使用离散化(也称为分箱)。在示例中,我们离散化了特征,并对转换后的数据进行了一次热编码。请注意,如果分箱的宽度不太合理,则过拟合的风险似乎会大大增加,因此通常应在交叉验证下调整离散器参数。
离散化之后,线性回归和决策树做出完全相同的预测。由于每个分箱仓中的要素都是恒定的,因此任何模型都必须为仓中的所有点预测相同的值。与离散化之前的结果相比,线性模型变得更加灵活,而决策树的灵活性则大大降低。请注意,合并功能通常不会对基于树的模型产生任何有益影响,因为这些模型可以学习将数据拆分到任何地方。
import numpy as np import matplotlib.pyplot as plt from sklearn.linear_model import LinearRegression from sklearn.preprocessing import KBinsDiscretizer from sklearn.tree import DecisionTreeRegressor print(__doc__) # 构建数据集 rnd = np.random.RandomState(42) X = rnd.uniform(-3, 3, size=100) y = np.sin(X) + rnd.normal(size=len(X)) / 3 X = X.reshape(-1, 1) # 用KBinsDiscretizer转换数据集 enc = KBinsDiscretizer(n_bins=10, encode='onehot') X_binned = enc.fit_transform(X) # 用原始数据集进行预测 fig, (ax1, ax2) = plt.subplots(ncols=2, sharey=True, figsize=(10, 4)) line = np.linspace(-3, 3, 1000, endpoint=False).reshape(-1, 1) reg = LinearRegression().fit(X, y) ax1.plot(line, reg.predict(line), linewidth=2, color='green', label="linear regression") reg = DecisionTreeRegressor(min_samples_split=3, random_state=0).fit(X, y) ax1.plot(line, reg.predict(line), linewidth=2, color='red', label="decision tree") ax1.plot(X[:, 0], y, 'o', c='k') ax1.legend(loc="best") ax1.set_ylabel("Regression output") ax1.set_xlabel("Input feature") ax1.set_title("Result before discretization") # 用转换后的数据进行预测 line_binned = enc.transform(line) reg = LinearRegression().fit(X_binned, y) ax2.plot(line, reg.predict(line_binned), linewidth=2, color='green', linestyle='-', label='linear regression') reg = DecisionTreeRegressor(min_samples_split=3, random_state=0).fit(X_binned, y) ax2.plot(line, reg.predict(line_binned), linewidth=2, color='red', linestyle=':', label='decision tree') ax2.plot(X[:, 0], y, 'o', c='k') ax2.vlines(enc.bin_edges_[0], *plt.gca().get_ylim(), linewidth=1, alpha=.2) ax2.legend(loc="best") ax2.set_xlabel("Input feature") ax2.set_title("Result after discretization") plt.tight_layout() plt.show()
![[KBinsDiscretizer.png]]
5.2 有监督离散化调用entropy_based_binning包
import entropy_based_binning as ebbA = np.array([[1,1,2,3,3], [1,1,0,1,0]])ebb.bin_array(A, nbins=2, axis=1)
用法介绍
Docstring:Find and apply the maximum entropy binning to an integer array,given the number of target bins.Convenience wrapper around bin_sequence().Arguments:----------A: (N, M) ndarray input array; must be integernbins: int number of binsaxis: None or int (default None) axis along which to bin; if None, the optimal binning is chosen based on all values in the array;Returns:--------B: (N, M) ndarray binned array
MDLP介绍
https://github.com/hlin117/mdlp-discretization
这是 Usama Fayyad 的基于熵的专家分箱方法的实现
示例
from mdlp.discretization import MDLPfrom sklearn.datasets import load_iristransformer = MDLP()iris = load_iris()X, y = iris.data, iris.targetX_disc = transformer.fit_transform(X, y)X_disc
6.数据规范化 6.1 sklearn包的StandardScalerfrom sklearn import datasetsfrom sklearn.preprocessing import StandardScaleriris = datasets.load_iris()iris_std = StandardScaler().fit_transform(iris.data) # ①
6.2 sklearn包的MinMaxScalerfrom sklearn.preprocessing import MinMaxScaleriris_mm = MinMaxScaler().fit_transform(iris.data)
6.3 sklearn包的RobustScalerfrom sklearn.preprocessing import RobustScaleriris_mm = RobustScaler().fit_transform(iris.data)
6.4 示例示例1
'''np.random.normal()第一个参数是均值第二个参数是标准差第三个参数是个数np.concatenate()将数组拼接起来'''#构建数据import pandas as pdX = pd.Dataframe({'x1': np.concatenate([np.random.normal(20, 1, 1000), np.random.normal(1, 1, 25)]),'x2': np.concatenate([np.random.normal(30, 1, 1000), np.random.normal(50, 1, 25)]),})X.sample(10)#创建RobustScaler, MinMaxScaler规范化模型from sklearn.preprocessing import RobustScaler, MinMaxScalerrobust = RobustScaler()robust_scaled = robust.fit_transform(X)robust_scaled = pd.Dataframe(robust_scaled, columns=['x1', 'x2'])minmax = MinMaxScaler()minmax_scaled = minmax.fit_transform(X)minmax_scaled = pd.Dataframe(minmax_scaled, columns=['x1', 'x2'])#绘图%matplotlib inlineimport matplotlib.pyplot as pltimport seaborn as snsfig, (ax1, ax2, ax3) = plt.subplots(ncols=3, figsize=(9, 5))ax1.set_title('Before Scaling')sns.kdeplot(X['x1'], ax=ax1)sns.kdeplot(X['x2'], ax=ax1)ax2.set_title('After Robust Scaling')sns.kdeplot(robust_scaled['x1'], ax=ax2)sns.kdeplot(robust_scaled['x2'], ax=ax2)ax3.set_title('After Min-Max Scaling')sns.kdeplot(minmax_scaled['x1'], ax=ax3)sns.kdeplot(minmax_scaled['x2'], ax=ax3)
示例2
from sklearn.preprocessing import Normalizerfrom mpl_toolkits.mplot3d import Axes3D df = pd.Dataframe({'x1': np.random.randint(-100, 100, 1000).astype(float),'y1': np.random.randint(-80, 80, 1000).astype(float),'z1': np.random.randint(-150, 150, 1000).astype(float),})#Normalizer()规范化的效果scaler = Normalizer()scaled_df = scaler.fit_transform(df)scaled_df = pd.Dataframe(scaled_df, columns=df.columns)fig = plt.figure(figsize=(9, 5))ax1 = fig.add_subplot(121, projection='3d')ax2 = fig.add_subplot(122, projection='3d')ax1.scatter(df['x1'], df['y1'], df['z1'])ax2.scatter(scaled_df['x1'], scaled_df['y1'], scaled_df['z1'])#MinMaxScaler()规范化的效果scaler = MinMaxScaler()scaled_df = scaler.fit_transform(df)scaled_df = pd.Dataframe(scaled_df, columns=df.columns)fig = plt.figure(figsize=(9, 5))ax1 = fig.add_subplot(121, projection='3d')ax2 = fig.add_subplot(122, projection='3d')ax1.scatter(df['x1'], df['y1'], df['z1'])ax2.scatter(scaled_df['x1'], scaled_df['y1'], scaled_df['z1'])