Tencent 2019 Data Science Competition[Ads Exposure Rate] Part 2
Posted onEdited onInProject Symbols count in article: 11kReading time ≈10 mins.
In my previous post I summarized the project information and challenges, now let’s take a look at the data.
Data download
First, the raw dataset can be downloaded here.
1 2
Download data from [https://pan.baidu.com/s/1ASQMms_u70psRgW_KEyT2Q] Password: burw
Exam the data
load libs
1 2 3 4 5 6 7 8
import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns import pandas_profiling from scipy.stats import norm from scipy import stats from loguru import logger
load test data
1 2 3 4
collist_test = ['Id','AdId','Date','AdSize','AdIndustryId','CommodityType','CommodityId', 'AdAccountId','ExpoTime','Crowd','Bid'] with open('test_sample.dat') as ft: df_test = pd.read_csv(ft,sep='\t',names = collist_test)
check the data
1
df_test.head()
Id
AdId
Date
AdSize
AdIndustryId
CommodityType
CommodityId
AdAccountId
ExpoTime
Crowd
Bid
0
1
394352
1529648412
34
84
13
29663
26657
281474976645120,281474976645120,28147497664512...
age:819,608,988,741,202,837,400,394,942,361,72...
120
1
2
585401
1553076190
40
221
1
-1
6262
281474976579587,281474976579587,28147497657958...
age:819,433,479,741,229,347,522,79,753,601|edu...
42
2
3
419408
1553031394
30
122
13
32110
17436
17592185782272,17592185782272,17592185782272,1...
all
6
3
4
405326
1553238836
64
136
1
-1
22359
281474976694272,281474976694272,28147497669427...
age:333,1|gender:2|area:11505,1874,3790,4566,5...
181
4
5
578942
1541191585
34
12
13
6372
24082
68719214592,68719214592,68719214592,6871921459...
age:819,608,988,741,202,837,400,394,942,361,72...
31
check nulls: good, no null values. the object values can be handled later.
print('test dataset has %i records' % df_test['Id'].shape[0]) print('however, it has %i unique adid' % df_test.nunique()[1]) print('which means for the same adid, if other parameters are changed, then it shouldnot be treated as same ad')
test dataset has 20290 records
however, it has 1954 unique adid
which means for the same adid, if other parameters are changed, then it shouldnot be treated as same ad
Interesting fact: the ad provider will modify their ad parameters a lot, so if any parameter changes, the ad should be treated as a new ad.
Ok there is nothing more we can do about the testset, we will save it to use it later.
1
df_test.to_hdf('df_testset.h5', key = 'df')
ad_Static_feature
Now open ad_static_feature file and ad_operation file
1 2 3 4 5 6 7 8
# open ad_static_feature file collist = ['Id','Date','AdAccountId','CommodityId','CommodityType','AdIndustryId','AdSize'] with open('ad_static_feature.out') as f: df = pd.read_csv(f,sep='\t',names = collist) # open ad_operation file collist2 = ['Id','Date','Type','UpdateAttr','UpdateAttrValue'] with open('ad_operation.dat') as f: df2 = pd.read_csv(f,sep='\t',names = collist2)
understand the data
what we need in training dataset:
ID, ADID: discard before use
Date: date the ad is created
AdSize:
AdIndustryId:
CommodityType:
CommodityId:
AdAccountId:
ExposureTime:
Crowd:
Bid:
1
df.head()
Id
Date
AdAccountId
CommodityId
CommodityType
AdIndustryId
AdSize
0
106452
1529958950
22226
16088
13
225
NaN
1
233649
1538221936
25681
7356
13
136
1
2
547531
1550731020
20696
-1
1
186
40
3
707841
1551857857
3968
-1
3
186
40
4
457009
1550439402
23614
7447
13
172
NaN
Notice AdSize has NAN values, we need to discard these records.
check the description, we know CommodityID = -1 is actually another type of commodity, so shouldn’t be discard
Now we have a useful ad_feature dataframe, next check ad operation file.
ad operation file
1
df2.head()
Id
Date
Type
UpdateAttr
UpdateAttrValue
0
593323
0
2
2
90
1
593323
0
2
3
all
2
593323
0
2
4
281474976710655,281474976710655,28147497671065...
3
593323
20190217000000
1
1
0
4
593323
20190218233855
1
2
90
Column description:
Id: the AdId can be found in static feature file
Date: the operation date in %yyyy%mm%dd%hh%mm%ss format
Type: Type of operations: 1 = modify, 2 = create
UpdateAttr: Which attribute is updated: 1= ad off(0) or on(1), 2 = Bid, 3 = Crowd, 4 = ExpoTime
UpdateAttrValue: the values the update is
The first thing is to seperate ad creation operations and ad modification operations
1 2 3 4 5 6 7 8 9 10 11 12 13
defcreate_ad_create(df2): ad_create = df2[df2['Type'] == 2] temp = ad_create.set_index(['Id','Date','Type','UpdateAttr']).unstack() temp.columns = [col for col in temp.columns.values] temp.columns = ['Bid','Crowd','Time'] ad_create_sort = temp.reset_index() return ad_create_sort
df_ad_create = create_ad_create(df2) # note when date = 0, we have duplicated results for the same id, # when the ad is create, date = 0, the initial setings. # first check when ads are created # for each new ad, three attributes are updated: 2,3,4. corresponding to: bid(unit cent),crowd,and exposuretime
1
df_ad_create.head()
Id
Date
Type
Bid
Crowd
Time
0
31
0
2
100
area:11442
281474976694272,281474976694272,28147497669427...
1
32
0
2
83
area:7572
70368475742208,70368475742208,70368475742208,7...
2
69
0
2
180
area:7572
281474976694272,281474976694272,28147497669427...
3
84
0
2
100
area:6410
4398045462528,4398045462528,4398045462528,4398...
4
85
0
2
70
age:217,601,202,837,942,638,287,5,394,347,731,...
281474976694272,281474976694272,28147497669427...
When the ad is created, some default parameters are given. the date is zero.
Next process the ad_modify data.
1 2 3 4 5 6 7 8 9 10 11
defcreate_ad_modify(df2): # get all modify histoy ad_modify = df2[df2['Type'] == 1] # change of the state value is not necessary information. discard ad_modify = ad_modify[ad_modify['UpdateAttr'] != 1] # convert date into epoch seconds temp = ad_modify['Date'] ad_modify['Date'] = (pd.to_datetime(temp,format="%Y%m%d%H%M%S").astype(int) / 10**9).astype(int).astype(str) return ad_modify
df_ad_modify = create_ad_modify(df2)
1
df_ad_modify.sort_values('Id').head()
Id
Date
Type
UpdateAttr
UpdateAttrValue
250517
32
1552627896
1
2
72
250493
32
1551721282
1
2
92
250510
32
1552282289
1
2
86
250488
32
1551504704
1
2
97
250483
32
1551331895
1
2
106
Merge static and operation file
Next we merge ad_static, ad_create and ad_modify to summerize a complete dataset of the ads information.
Since in training set we need to know Adsize and some information, so only ads exist in both files are kept.
Once the ads have some parameter updated, even with the same AdID, it’s no longer the same ad anymore.
this file is huuuuuuuge. but we only need certain inforamtion from it.
We only need exposures of the adid list we already have. so we filter the dataset and save it.
1 2 3 4 5 6 7 8 9 10 11
# next deal with totalexposure.log # this file is huge...break it into chunks row_count = 500000 expo_colname = ['AdRequestId','Time','LocationId','UserId','Id','AdSize','Bid','pctr','quality_ecpm','total_ecpm'] AdIdSet = set(df_ad_merge['Id']) chunkcount = 0 for chunk in pd.read_csv('totalExposureLog.out', names = expo_colname, sep='\t', chunksize=row_count): x = chunk[chunk['Id'].isin(AdIdSet)] chunkcount += len(x) print("useful request in this chunk is ",chunkcount) x.to_hdf('store_test.h5', key = 'df',format='table',append=True, mode='a')
useful request in this chunk is 19620
useful request in this chunk is 39010
...
useful request in this chunk is 6005879
1
opt = pd.read_hdf('store_test.h5', 'df')
1
opt.info(memory_usage='deep')
<class 'pandas.core.frame.DataFrame'>
Int64Index: 6005879 entries, 2 to 102386677
Data columns (total 10 columns):
AdRequestId int64
Time int64
LocationId int64
UserId int64
Id int64
AdSize int64
Bid int64
pctr float64
quality_ecpm float64
total_ecpm float64
dtypes: float64(3), int64(7)
memory usage: 504.0 MB
Note the userid file is not very important in this project, but we can reveal some interesting insights from it. we can explore it later as an interesting project.