欢迎您访问365答案网,请分享给你的朋友!
生活常识 学习资料

数据分析实际案例之:pandas在泰坦尼特号乘客数据中的使用

时间:2023-04-18
文章目录

简介泰坦尼特号乘客数据使用pandas对数据进行分析

引入依赖包读取和分析数据 图形化表示和矩阵转换 简介

1912年4月15日,号称永不沉没的泰坦尼克号因为和冰山相撞沉没了。因为没有足够的救援设备,2224个乘客中有1502个乘客不幸遇难。事故已经发生了,但是我们可以从泰坦尼克号中的历史数据中发现一些数据规律吗?今天本文将会带领大家灵活的使用pandas来进行数据分析。

泰坦尼特号乘客数据

我们从kaggle官网中下载了部分泰坦尼特号的乘客数据,主要包含下面几个字段:

变量名含义取值survival是否生还0 = No, 1 = Yespclass船票的级别1 = 1st, 2 = 2nd, 3 = 3rdsex性别Age年龄sibsp配偶信息parch父母或者子女信息ticket船票编码fare船费cabin客舱编号embarked登录的港口C = Cherbourg, Q = Queenstown, S = Southampton

下载下来的文件是一个csv文件。接下来我们来看一下怎么使用pandas来对其进行数据分析。

使用pandas对数据进行分析 引入依赖包

本文主要使用pandas和matplotlib,所以需要首先进行下面的通用设置:

from numpy.random import randnimport numpy as npnp.random.seed(123)import osimport matplotlib.pyplot as pltimport pandas as pdplt.rc('figure', figsize=(10, 6))np.set_printoptions(precision=4)pd.options.display.max_rows = 20

读取和分析数据

pandas提供了一个read_csv方法可以很方便的读取一个csv数据,并将其转换为Dataframe:

path = '../data/titanic.csv'df = pd.read_csv(path)df

我们看下读入的数据:

PassengerIdPclassNameSexAgeSibSpParchTicketFareCabinEmbarked08923Kelly, Mr、Jamesmale34.5003309117.8292NaNQ18933Wilkes, Mrs、James (Ellen Needs)female47.0103632727.0000NaNS28942Myles, Mr、Thomas Francismale62.0002402769.6875NaNQ38953Wirz, Mr、Albertmale27.0003151548.6625NaNS48963Hirvonen, Mrs、Alexander (Helga E Lindqvist)female22.011310129812.2875NaNS58973Svensson, Mr、Johan Cervinmale14.00075389.2250NaNS68983Connolly, Miss、Katefemale30.0003309727.6292NaNQ78992Caldwell, Mr、Albert Francismale26.01124873829.0000NaNS89003Abrahim, Mrs、Joseph (Sophie Halaut Easu)female18.00026577.2292NaNC99013Davies, Mr、John Samuelmale21.020A/4 4887124.1500NaNS………………………………40813003Riordan, Miss、Johanna Hannah""femaleNaN003349157.7208NaNQ40913013Peacock, Miss、Treasteallfemale3.011SOTON/O.Q、310131513.7750NaNS41013023Naughton, Miss、HannahfemaleNaN003652377.7500NaNQ41113031Minahan, Mrs、William Edward (Lillian E Thorpe)female37.0101992890.0000C78Q41213043Henriksson, Miss、Jenny Lovisafemale28.0003470867.7750NaNS41313053Spector, Mr、WoolfmaleNaN00A.5、32368.0500NaNS41413061Oliva y Ocana, Dona、Ferminafemale39.000PC 17758108.9000C105C41513073Saether, Mr、Simon Sivertsenmale38.500SOTON/O.Q、31012627.2500NaNS41613083Ware, Mr、FrederickmaleNaN003593098.0500NaNS41713093Peter, Master、Michael JmaleNaN11266822.3583NaNC

418 rows × 11 columns

调用df的describe方法可以查看基本的统计信息:

PassengerIdPclassAgeSibSpParchFarecount418.000000418.000000332.000000418.000000418.000000417.000000mean1100.5000002.26555030.2725900.4473680.39234435.627188std120.8104580.84183814.1812090.8967600.98142955.907576min892.0000001.0000000.1700000.0000000.0000000.00000025%996.2500001.00000021.0000000.0000000.0000007.89580050%1100.5000003.00000027.0000000.0000000.00000014.45420075%1204.7500003.00000039.0000001.0000000.00000031.500000max1309.0000003.00000076.0000008.0000009.000000512.329200

如果要想查看乘客登录的港口,可以这样选择:

df['Embarked'][:10]

0 Q1 S2 Q3 S4 S5 S6 Q7 S8 C9 SName: Embarked, dtype: object

使用value_counts 可以对其进行统计:

embark_counts=df['Embarked'].value_counts()embark_counts[:10]

S 270C 102Q 46Name: Embarked, dtype: int64

从结果可以看出,从S港口登录的乘客有270个,从C港口登录的乘客有102个,从Q港口登录的乘客有46个。

同样的,我们可以统计一下age信息:

age_counts=df['Age'].value_counts()age_counts.head(10)

前10位的年龄如下:

24.0 1721.0 1722.0 1630.0 1518.0 1327.0 1226.0 1225.0 1123.0 1129.0 10Name: Age, dtype: int64

计算一下年龄的平均数:

df['Age'].mean()

30.272590361445783

实际上有些数据是没有年龄的,我们可以使用平均数对其填充:

clean_age1 = df['Age'].fillna(df['Age'].mean())clean_age1.value_counts()

可以看出平均数是30.27,个数是86。

30.27259 8624.00000 1721.00000 1722.00000 1630.00000 1518.00000 1326.00000 1227.00000 1225.00000 1123.00000 11 ..36.50000 140.50000 111.50000 134.00000 115.00000 17.00000 160.50000 126.50000 176.00000 134.50000 1Name: Age, Length: 80, dtype: int64

使用平均数来作为年龄可能不是一个好主意,还有一种办法就是丢弃平均数:

clean_age2=df['Age'].dropna()clean_age2age_counts = clean_age2.value_counts()ageset=age_counts.head(10)ageset

24.0 1721.0 1722.0 1630.0 1518.0 1327.0 1226.0 1225.0 1123.0 1129.0 10Name: Age, dtype: int64

图形化表示和矩阵转换

图形化对于数据分析非常有帮助,我们对于上面得出的前10名的age使用柱状图来表示:

import seaborn as snssns.barplot(x=ageset.index, y=ageset.values)

接下来我们来做一个复杂的矩阵变换,我们先来过滤掉age和sex都为空的数据:

cframe=df[df.Age.notnull() & df.Sex.notnull()]cframe

PassengerIdPclassNameSexAgeSibSpParchTicketFareCabinEmbarked08923Kelly, Mr、Jamesmale34.5003309117.8292NaNQ18933Wilkes, Mrs、James (Ellen Needs)female47.0103632727.0000NaNS28942Myles, Mr、Thomas Francismale62.0002402769.6875NaNQ38953Wirz, Mr、Albertmale27.0003151548.6625NaNS48963Hirvonen, Mrs、Alexander (Helga E Lindqvist)female22.011310129812.2875NaNS58973Svensson, Mr、Johan Cervinmale14.00075389.2250NaNS68983Connolly, Miss、Katefemale30.0003309727.6292NaNQ78992Caldwell, Mr、Albert Francismale26.01124873829.0000NaNS89003Abrahim, Mrs、Joseph (Sophie Halaut Easu)female18.00026577.2292NaNC99013Davies, Mr、John Samuelmale21.020A/4 4887124.1500NaNS………………………………40312951Carrau, Mr、Jose Pedromale17.00011305947.1000NaNS40412961Frauenthal, Mr、Isaac Geraldmale43.0101776527.7208D40C40512972Nourney, Mr、Alfred (Baron von Drachstedt")"male20.000SC/PARIS 216613.8625D38C40612982Ware, Mr、William Jefferymale23.0102866610.5000NaNS40712991Widener, Mr、George Duntonmale50.011113503211.5000C80C40913013Peacock, Miss、Treasteallfemale3.011SOTON/O.Q、310131513.7750NaNS41113031Minahan, Mrs、William Edward (Lillian E Thorpe)female37.0101992890.0000C78Q41213043Henriksson, Miss、Jenny Lovisafemale28.0003470867.7750NaNS41413061Oliva y Ocana, Dona、Ferminafemale39.000PC 17758108.9000C105C41513073Saether, Mr、Simon Sivertsenmale38.500SOTON/O.Q、31012627.2500NaNS

332 rows × 11 columns

接下来使用groupby对age和sex进行分组:

by_sex_age = cframe.groupby(['Age', 'Sex'])by_sex_age.size()

Age Sex 0.17 female 10.33 male 10.75 male 10.83 male 10.92 female 11.00 female 32.00 female 1 male 13.00 female 15.00 male 1 ..60.00 female 360.50 male 161.00 male 262.00 male 163.00 female 1 male 164.00 female 2 male 167.00 male 176.00 female 1Length: 115, dtype: int64

使用unstack将Sex的列数据变成行:

SexfemalemaleAge0.171.00.00.330.01.00.750.01.00.830.01.00.921.00.01.003.00.02.001.01.03.001.00.05.000.01.06.000.03.0………58.001.00.059.001.00.060.003.00.060.500.01.061.000.02.062.000.01.063.001.01.064.002.01.067.000.01.076.001.00.0

79 rows × 2 columns

我们把同样age的人数加起来,然后使用argsort进行排序,得到排序过后的index:

indexer = agg_counts.sum(1).argsort()indexer.tail(10)

Age58.0 3759.0 3160.0 2960.5 3261.0 3462.0 2263.0 3864.0 2767.0 2676.0 30dtype: int64

从agg_counts中取出最后的10个,也就是最大的10个:

count_subset = agg_counts.take(indexer.tail(10))count_subset=count_subset.tail(10)count_subset

SexfemalemaleAge29.05.05.025.01.010.023.05.06.026.04.08.027.04.08.018.07.06.030.06.09.022.010.06.021.03.014.024.05.012.0

上面的操作可以简化为下面的代码:

agg_counts.sum(1).nlargest(10)

Age21.0 17.024.0 17.022.0 16.030.0 15.018.0 13.026.0 12.027.0 12.023.0 11.025.0 11.029.0 10.0dtype: float64

将count_subset 进行stack操作,方便后面的画图:

stack_subset = count_subset.stack()stack_subset

Age Sex 29.0 female 5.0 male 5.025.0 female 1.0 male 10.023.0 female 5.0 male 6.026.0 female 4.0 male 8.027.0 female 4.0 male 8.018.0 female 7.0 male 6.030.0 female 6.0 male 9.022.0 female 10.0 male 6.021.0 female 3.0 male 14.024.0 female 5.0 male 12.0dtype: float64

stack_subset.name = 'total'stack_subset = stack_subset.reset_index()stack_subset

AgeSextotal029.0female5.0129.0male5.0225.0female1.0325.0male10.0423.0female5.0523.0male6.0626.0female4.0726.0male8.0827.0female4.0927.0male8.01018.0female7.01118.0male6.01230.0female6.01330.0male9.01422.0female10.01522.0male6.01621.0female3.01721.0male14.01824.0female5.01924.0male12.0

作图如下:

sns.barplot(x='total', y='Age', hue='Sex', data=stack_subset)

本文例子可以参考: https://github.com/ddean2009/learn-ai/

本文已收录于 http://www.flydean.com/01-pandas-titanic/

最通俗的解读,最深刻的干货,最简洁的教程,众多你不

欢迎关注我的公众号:「程序那些事」,懂技术,更懂你!

Copyright © 2016-2020 www.365daan.com All Rights Reserved. 365答案网 版权所有 备案号:

部分内容来自互联网,版权归原作者所有,如有冒犯请联系我们,我们将在三个工作时内妥善处理。