final_data为脱敏后的数据
final_data.head()
data = final_data[['user_id', 'item_id', 'behavior_type', 'time']]data.head()
data.shape
(12256906, 4)
data['date'] = data['time'].map(lambda x:x.split(' ')[0])data['hour'] = data['time'].map(lambda x:x.split(' ')[1])data.head()
C:worksoftwareAnaconda5.3.0libsite-packagesipykernel_launcher.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a Dataframe.Try using .loc[row_indexer,col_indexer] = value insteadSee the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy """Entry point for launching an IPython kernel.
data.drop(['time'], axis=1, inplace=True)data.head()
map函数是Series对象的一个函数,Dataframe中没有map(),map()的功能是将一个自定义函数作用于Series对象的每个元素。apply()函数的功能是将一个自定义函数作用于Dataframe的行或者列applymap()函数的功能是将自定义函数作用于Dataframe的所有元素
总结:三者区别在于应用对象的不同
data.shape
(12256906, 5)
data.dtypes
user_id int64item_id int64behavior_type int64date objecthour objectdtype: object
# 转换时间类型data['date'] = pd.to_datetime(data['date'])data['hour'] = data['hour'].astype('int32')data.dtypes
user_id int64item_id int64behavior_type int64date datetime64[ns]hour int32dtype: object
# 看下用户数data['user_id'].nunique()
10000
# 是否有缺失值data.isnull().sum()
user_id 0item_id 0behavior_type 0date 0hour 0dtype: int64
流量指标分析
流量指标:用户在该网站操作的每一个步骤记录的量化指标
指标有浏览量PV,独立访客数UV
针对每一位访客,还可以用以下的指标衡量访客的质量
平均在线时间:平均每个UV访问页面的停留时间平均访问深度:平均每个UV的PV数量跳失率:浏览某个页面后就离开的访问次数/该页面的全部访问次数# 总pv值是什么,一定时间周期内(本次)的pv数也就是全部的记录数total_pv = data.shape[0]total_pv
12256906
# 计算日均pvpv = data.groupby(['date'])['user_id'].count().reset_index()
pv
pv = pv.rename(columns={'user_id':'pv'})pv.head()
#日均uv的计算uv = data.groupby(['date'])['user_id'].apply(lambda x:x.drop_duplicates().count())uv.head()
date2014-11-18 63432014-11-19 64202014-11-20 63332014-11-21 62762014-11-22 6187Name: user_id, dtype: int64
uv = uv.reset_index().rename(columns={'user_id':'uv'})uv.head()
# 画图import matplotlib.pyplot as pltfont = {'family':'SimHei', 'size':'20'}plt.rc('font', **font)plt.figure(figsize=(20,5))plt.xticks(rotation=30)plt.plot(pv['date'], pv['pv'])plt.title('日均pv')plt.show()
# 日均uvplt.figure(figsize=(20,5))plt.xticks(rotation=30)plt.plot(uv['date'], uv['uv'])plt.title('日均uv')# 保存图片plt.savefig('日UV.png')plt.show()
pv和uv都是在12月12日达到峰值在双十二前后会有较高波动,而平常的波动比较平稳 每一个时刻的PV和UV值
data.head()
pv_hour = data.groupby(['hour'])['user_id'].count()pv_hour.head()
hour0 5174041 2676822 1470903 985164 80487Name: user_id, dtype: int64
pv_hour = pv_hour.reset_index().rename(columns={'user_id':'pv'})pv_hour.head()
uv_hour = data.groupby(['hour'])['user_id'].apply(lambda x:x.drop_duplicates().count())uv_hour = uv_hour.reset_index().rename(columns={'user_id':"uv"})uv_hour.head()
plt.figure(figsize=(20,5))plt.plot(uv_hour['hour'],uv_hour['uv'])plt.xticks(rotation=30)plt.title('每小时UV')# 保存图片plt.savefig('每小时UV.png')plt.show()
plt.figure(figsize=(20,5))plt.plot(pv_hour['hour'],pv_hour['pv'])plt.xticks(rotation=30)plt.title('每小时PV')# 保存图片plt.savefig('每小时PV.png')plt.show()
# 每个UV的平均访问深度# 全部的PV除以用户数round(data.shape[0] / data['user_id'].nunique(), 2)
1225.69
# 每个UV的日均访问深度round(data['user_id'].shape[0] / data['user_id'].nunique() / data['date'].nunique(), 2)
39.54