某电商网站用户行为分析【已脱敏】

时间：2023-05-25

用户行为分析

final_data为脱敏后的数据

final_data.head()

user_iditem_idbehavior_typeuser_geohashitem_categorytime054007195796335351NaN39402014-11-24 1611369526423378002941NaN48302014-11-22 1221212551581089267881NaN19702014-11-22 083722560731440907861NaN40082014-12-09 2046564593325002918519t4qqgn28252014-11-25 17

data = final_data[['user_id', 'item_id', 'behavior_type', 'time']]data.head()

user_iditem_idbehavior_typetime0540071957963353512014-11-24 16113695264233780029412014-11-22 12212125515810892678812014-11-22 0837225607314409078612014-12-09 2046564593325002918512014-11-25 17

data.shape

(12256906, 4)

data['date'] = data['time'].map(lambda x:x.split(' ')[0])data['hour'] = data['time'].map(lambda x:x.split(' ')[1])data.head()

C:worksoftwareAnaconda5.3.0libsite-packagesipykernel_launcher.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a Dataframe.Try using .loc[row_indexer,col_indexer] = value insteadSee the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy """Entry point for launching an IPython kernel.

user_iditem_idbehavior_typetimedatehour0540071957963353512014-11-24 162014-11-2416113695264233780029412014-11-22 122014-11-2212212125515810892678812014-11-22 082014-11-220837225607314409078612014-12-09 202014-12-092046564593325002918512014-11-25 172014-11-2517

data.drop(['time'], axis=1, inplace=True)data.head()

user_iditem_idbehavior_typedatehour0540071957963353512014-11-2416113695264233780029412014-11-2212212125515810892678812014-11-220837225607314409078612014-12-092046564593325002918512014-11-2517 pandas中的map()、apply()、applymap()函数的区别

map函数是Series对象的一个函数，Dataframe中没有map()，map()的功能是将一个自定义函数作用于Series对象的每个元素。apply()函数的功能是将一个自定义函数作用于Dataframe的行或者列applymap()函数的功能是将自定义函数作用于Dataframe的所有元素

总结：三者区别在于应用对象的不同

data.shape

(12256906, 5)

data.dtypes

user_id int64item_id int64behavior_type int64date objecthour objectdtype: object

# 转换时间类型data['date'] = pd.to_datetime(data['date'])data['hour'] = data['hour'].astype('int32')data.dtypes

user_id int64item_id int64behavior_type int64date datetime64[ns]hour int32dtype: object

# 看下用户数data['user_id'].nunique()

10000

# 是否有缺失值data.isnull().sum()

user_id 0item_id 0behavior_type 0date 0hour 0dtype: int64

流量指标分析

流量指标：用户在该网站操作的每一个步骤记录的量化指标

指标有浏览量PV，独立访客数UV

针对每一位访客，还可以用以下的指标衡量访客的质量

平均在线时间：平均每个UV访问页面的停留时间平均访问深度：平均每个UV的PV数量跳失率：浏览某个页面后就离开的访问次数/该页面的全部访问次数

# 总pv值是什么，一定时间周期内（本次）的pv数也就是全部的记录数total_pv = data.shape[0]total_pv

12256906

# 计算日均pvpv = data.groupby(['date'])['user_id'].count().reset_index()

dateuser_id02014-11-1836670112014-11-1935882322014-11-2035342932014-11-2133310442014-11-2236135552014-11-2338270262014-11-2437834272014-11-2537023982014-11-2636089692014-11-27371384102014-11-28340638112014-11-29364697122014-11-30401620132014-12-01394611142014-12-02405216152014-12-03411606162014-12-04399952172014-12-05361878182014-12-06389610192014-12-07399751202014-12-08386667212014-12-09398025222014-12-10421910232014-12-11488508242014-12-12691712252014-12-13407160262014-12-14402541272014-12-15398356282014-12-16395085292014-12-17384791302014-12-18375597

pv = pv.rename(columns={'user_id':'pv'})pv.head()

datepv02014-11-1836670112014-11-1935882322014-11-2035342932014-11-2133310442014-11-22361355

#日均uv的计算uv = data.groupby(['date'])['user_id'].apply(lambda x:x.drop_duplicates().count())uv.head()

date2014-11-18 63432014-11-19 64202014-11-20 63332014-11-21 62762014-11-22 6187Name: user_id, dtype: int64

uv = uv.reset_index().rename(columns={'user_id':'uv'})uv.head()

dateuv02014-11-18634312014-11-19642022014-11-20633332014-11-21627642014-11-226187

# 画图import matplotlib.pyplot as pltfont = {'family':'SimHei', 'size':'20'}plt.rc('font', **font)plt.figure(figsize=(20,5))plt.xticks(rotation=30)plt.plot(pv['date'], pv['pv'])plt.title('日均pv')plt.show()

# 日均uvplt.figure(figsize=(20,5))plt.xticks(rotation=30)plt.plot(uv['date'], uv['uv'])plt.title('日均uv')# 保存图片plt.savefig('日UV.png')plt.show()

pv和uv都是在12月12日达到峰值在双十二前后会有较高波动，而平常的波动比较平稳每一个时刻的PV和UV值

data.head()

user_iditem_idbehavior_typedatehour0540071957963353512014-11-2416113695264233780029412014-11-2212212125515810892678812014-11-22837225607314409078612014-12-092046564593325002918512014-11-2517

pv_hour = data.groupby(['hour'])['user_id'].count()pv_hour.head()

hour0 5174041 2676822 1470903 985164 80487Name: user_id, dtype: int64

pv_hour = pv_hour.reset_index().rename(columns={'user_id':'pv'})pv_hour.head()

hourpv00517404112676822214709033985164480487

uv_hour = data.groupby(['hour'])['user_id'].apply(lambda x:x.drop_duplicates().count())uv_hour = uv_hour.reset_index().rename(columns={'user_id':"uv"})uv_hour.head()

houruv005786113780222532331937441765

plt.figure(figsize=(20,5))plt.plot(uv_hour['hour'],uv_hour['uv'])plt.xticks(rotation=30)plt.title('每小时UV')# 保存图片plt.savefig('每小时UV.png')plt.show()

plt.figure(figsize=(20,5))plt.plot(pv_hour['hour'],pv_hour['pv'])plt.xticks(rotation=30)plt.title('每小时PV')# 保存图片plt.savefig('每小时PV.png')plt.show()

# 每个UV的平均访问深度# 全部的PV除以用户数round(data.shape[0] / data['user_id'].nunique(), 2)

1225.69

# 每个UV的日均访问深度round(data['user_id'].shape[0] / data['user_id'].nunique() / data['date'].nunique(), 2)

39.54

上一篇：Djangoshell

下一篇：python：Fastapi请求体-嵌套模型