目录
1、pandas 解决什么问题
以下面的例子认识dataframe
columns 的介绍
2、表格数据的读写 read and write tabular data
读数据
写数据
3、数据表子集的操作
4、绘图 create plots in pandas
单数据,plt 绘制
多数据,O-O style
5、 create new columns and 列的名称修改
列的重新命名
6、calculate summary statistics 列表数据信息统计
Aggregating statistics
汇总按类别分组的统计信息 Aggregating statistics grouped by category
Count number of records by category
7、排序
Sort table rows 按某列的元素对表格排序
8、combine data from multiple tables 合并
多个表格按行列合并
连接两个表格 merge
1、pandas 解决什么问题
What kind of data does pandas handle?
When working with tabular data(表格数据), such as data stored in spreadsheets or databases, pandas is the right tool for you、pandas will help you to explore, clean, and process your data.
In pandas, a data table(数据表) is called a Dataframe.
以下面的例子认识dataframe
import numpy as npimport pandas as pddf = pd.Dataframe( { "name": ["Braund, Mr、Owen Harris", "Allen, Mr、William Henry", "Bonnell, Miss、Elizabeth",], "age": [22, 35, 58], "sex": ["male", "female", "male"] })print(df)print(df.describe()) # 只针对数字类型的数据
A Dataframe is a 2-dimensional data structure that can store data of different types (including characters, integers, floating point values, categorical data and more) in columns(列).
columns 的介绍
Each column in a Dataframe is a Series
import numpy as npimport pandas as pddf = pd.Dataframe( { "name": ["Braund, Mr、Owen Harris", "Allen, Mr、William Henry", "Bonnell, Miss、Elizabeth",], "age": [22, 35, 58], "sex": ["male", "female", "male"] })print(df["age"])
单纯的series
2、表格数据的读写 read and write tabular data 读数据
import numpy as npimport pandas as pdti_data = pd.read_excel("titanic.xlsx") # 读取 excel 数据print(ti_data) # 打印各列数据的类型print(ti_data.dtypes)print(ti_data.head(3)) # 只看头三个数据print(ti_data.tail(2)) # 末尾 两个
写数据
import numpy as npimport pandas as pddf = pd.Dataframe( { "name": ["Braund, Mr、Owen Harris", "Allen, Mr、William Henry", "Bonnell, Miss、Elizabeth",], "age": [22, 35, 58], "sex": ["male", "female", "male"] })'''写数据'''df.to_excel("df.xlsx")
3、数据表子集的操作
原列表
import numpy as npimport pandas as pdt_data = pd.read_excel('df.xlsx')print(t_data)age = t_data[["age"]] # 选择特定的列print(age)age30 = t_data[t_data["age"] > 30] #选择某个数值进行筛选print(age30)'''行列综合操作'''print('键')sex_age = t_data.loc[t_data["age"] > 30, 'age']print(sex_age)print('坐标')row_col = t_data.iloc[1:2,1:3]print(row_col)
4、绘图 create plots in pandas 单数据,plt 绘制
import numpy as npimport pandas as pdimport matplotlib.pyplot as pltt_data = pd.read_excel('df.xlsx')t_data["ID"] = [1,2,3] #增加了一列ID 数值1,2,3print(t_data)fig = t_data["age"].plot()fig.set_title("age")plt.show()
多数据,O-O style
import numpy as npimport pandas as pdimport matplotlib.pyplot as pltt_data = pd.read_excel('df.xlsx')t_data["ID"] = [1,2,3] #增加了一列ID 数值1,2,3print(t_data)fig, axs = plt.subplots(figsize=(12, 4))t_data.plot(ax=axs)axs.set_title("age and ID")plt.show()
5、 create new columns and 列的名称修改
import numpy as npimport pandas as pdimport matplotlib.pyplot as pltt_data = pd.read_excel('df.xlsx')print(t_data)print('修改后的表格')t_data["ID"] = [1,2,3] #增加了一列ID 数值1,2,3t_data["age's cubic"] = t_data["age"] **3print(t_data)
列的重新命名
import numpy as npimport pandas as pdimport matplotlib.pyplot as pltt_data = pd.read_excel('df.xlsx')print(t_data)print('修改后的表格')#修改列的名字t_data = t_data.rename( columns ={ "age":"年龄" })print(t_data)
6、calculate summary statistics 列表数据信息统计 Aggregating statistics
import numpy as npimport pandas as pdimport matplotlib.pyplot as pltt_data = pd.read_excel('df.xlsx')print(t_data)print("mean of age: ",t_data["age"].mean())print(t_data.describe())
汇总按类别分组的统计信息 Aggregating statistics grouped by category
import numpy as npimport pandas as pdimport matplotlib.pyplot as pltt_data = pd.read_excel('df.xlsx')t_data["ID"] = [1,2,3] #增加了一列ID 数值1,2,3t_data["age's cubic"] = t_data["age"] **3print(t_data)#按 name 进行统计#group = t_data.groupby("name").mean()group = t_data[["age","ID","name"]].groupby("name").mean()print(group)
Count number of records by category
import numpy as npimport pandas as pdimport matplotlib.pyplot as pltt_data = pd.read_excel('df.xlsx')t_data["ID"] = [1,2,3] #增加了一列ID 数值1,2,3t_data["age's cubic"] = t_data["age"] **3print(t_data)print('first way')print(t_data["age"].value_counts())print('second way')print(t_data.groupby("age")["age"].count()) # 之前学的按group 进行统计
7、排序 Sort table rows 按某列的元素对表格排序
import numpy as npimport pandas as pdimport matplotlib.pyplot as pltt_data = pd.read_excel('df.xlsx')print(t_data)#按年龄进行排序 顺序print(t_data.sort_values(by="age"))#按年龄进行排序 逆序print(t_data.sort_values(by="age",ascending=False))
8、combine data from multiple tables 合并 多个表格按行列合并
import numpy as npimport pandas as pdimport matplotlib.pyplot as pltdata1 = pd.Dataframe( { "name": ["Braund, Mr、Owen Harris", "Allen, Mr、William Henry", "Bonnell, Miss、Elizabeth", ] })print(data1)data2 = pd.Dataframe( { "age": [22, 35, 58], "sex": ["male", "female", "male"] })print(data2)#合并data = pd.concat([data1,data2],axis=1) #axis = 0, 列print(data)
连接两个表格 merge
import numpy as npimport pandas as pdimport matplotlib.pyplot as pltdata1 = pd.Dataframe( { "name": ["Braund, Mr、Owen Harris", "Allen, Mr、William Henry", "Bonnell, Miss、Elizabeth", ], "age": [22, 35, 58] })print(data1)data2 = pd.Dataframe( { "age": [22, 35, 58], "sex": ["male", "female", "male"] })print(data2)#按其中的 age 列 进行 mergedata = pd.merge(data1,data2,how='left',on='age')print(data)
还有用多个列的参数进行合并的操作