The poor rabbit chased by Python and Anaconda :p

0%

Tencent 2019 Data Science Competition[Ads Exposure Rate] Part 2

In my previous post I summarized the project information and challenges, now let’s take a look at the data.

Data download

First, the raw dataset can be downloaded here.

1
2
Download data from [https://pan.baidu.com/s/1ASQMms_u70psRgW_KEyT2Q]
Password: burw

Exam the data

load libs

1
2
3
4
5
6
7
8
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas_profiling
from scipy.stats import norm
from scipy import stats
from loguru import logger

load test data

1
2
3
4
collist_test = ['Id','AdId','Date','AdSize','AdIndustryId','CommodityType','CommodityId',
'AdAccountId','ExpoTime','Crowd','Bid']
with open('test_sample.dat') as ft:
df_test = pd.read_csv(ft,sep='\t',names = collist_test)

check the data

1
df_test.head()

Id AdId Date AdSize AdIndustryId CommodityType CommodityId AdAccountId ExpoTime Crowd Bid
0 1 394352 1529648412 34 84 13 29663 26657 281474976645120,281474976645120,28147497664512... age:819,608,988,741,202,837,400,394,942,361,72... 120
1 2 585401 1553076190 40 221 1 -1 6262 281474976579587,281474976579587,28147497657958... age:819,433,479,741,229,347,522,79,753,601|edu... 42
2 3 419408 1553031394 30 122 13 32110 17436 17592185782272,17592185782272,17592185782272,1... all 6
3 4 405326 1553238836 64 136 1 -1 22359 281474976694272,281474976694272,28147497669427... age:333,1|gender:2|area:11505,1874,3790,4566,5... 181
4 5 578942 1541191585 34 12 13 6372 24082 68719214592,68719214592,68719214592,6871921459... age:819,608,988,741,202,837,400,394,942,361,72... 31

check nulls: good, no null values. the object values can be handled later.

1
df_test.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20290 entries, 0 to 20289
Data columns (total 11 columns):
Id               20290 non-null int64
AdId             20290 non-null int64
Date             20290 non-null int64
AdSize           20290 non-null int64
AdIndustryId     20290 non-null int64
CommodityType    20290 non-null int64
CommodityId      20290 non-null int64
AdAccountId      20290 non-null int64
ExpoTime         20290 non-null object
Crowd            20290 non-null object
Bid              20290 non-null int64
dtypes: int64(9), object(2)
memory usage: 1.7+ MB
1
2
3
print('test dataset has %i records' % df_test['Id'].shape[0])
print('however, it has %i unique adid' % df_test.nunique()[1])
print('which means for the same adid, if other parameters are changed, then it shouldnot be treated as same ad')
test dataset has 20290 records
however, it has 1954 unique adid
which means for the same adid, if other parameters are changed, then it shouldnot be treated as same ad

Interesting fact: the ad provider will modify their ad parameters a lot, so if any parameter changes, the ad should be treated as a new ad.

Ok there is nothing more we can do about the testset, we will save it to use it later.

1
df_test.to_hdf('df_testset.h5', key = 'df')

ad_Static_feature

Now open ad_static_feature file and ad_operation file

1
2
3
4
5
6
7
8
# open ad_static_feature file
collist = ['Id','Date','AdAccountId','CommodityId','CommodityType','AdIndustryId','AdSize']
with open('ad_static_feature.out') as f:
df = pd.read_csv(f,sep='\t',names = collist)
# open ad_operation file
collist2 = ['Id','Date','Type','UpdateAttr','UpdateAttrValue']
with open('ad_operation.dat') as f:
df2 = pd.read_csv(f,sep='\t',names = collist2)

understand the data

what we need in training dataset:

  • ID, ADID: discard before use
  • Date: date the ad is created
  • AdSize:
  • AdIndustryId:
  • CommodityType:
  • CommodityId:
  • AdAccountId:
  • ExposureTime:
  • Crowd:
  • Bid:
1
df.head()

Id Date AdAccountId CommodityId CommodityType AdIndustryId AdSize
0 106452 1529958950 22226 16088 13 225 NaN
1 233649 1538221936 25681 7356 13 136 1
2 547531 1550731020 20696 -1 1 186 40
3 707841 1551857857 3968 -1 3 186 40
4 457009 1550439402 23614 7447 13 172 NaN
  • Notice AdSize has NAN values, we need to discard these records.
  • check the description, we know CommodityID = -1 is actually another type of commodity, so shouldn’t be discard
  • Date is represent as epoch seconds
1
df.info(memory_usage='deep')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 735911 entries, 0 to 735910
Data columns (total 7 columns):
Id               735911 non-null int64
Date             735911 non-null int64
AdAccountId      735911 non-null int64
CommodityId      735911 non-null object
CommodityType    735911 non-null int64
AdIndustryId     735911 non-null object
AdSize           509252 non-null object
dtypes: int64(4), object(3)
memory usage: 119.8 MB
1
2
3
4
5
def create_ad_static_feature(df):
# if AdSize = NaN, discard
df = df.dropna(how='any')
return df
df_ad_static_feature = create_ad_static_feature(df)

Now we have a useful ad_feature dataframe, next check ad operation file.

ad operation file

1
df2.head()

Id Date Type UpdateAttr UpdateAttrValue
0 593323 0 2 2 90
1 593323 0 2 3 all
2 593323 0 2 4 281474976710655,281474976710655,28147497671065...
3 593323 20190217000000 1 1 0
4 593323 20190218233855 1 2 90

Column description:

  • Id: the AdId can be found in static feature file
  • Date: the operation date in %yyyy%mm%dd%hh%mm%ss format
  • Type: Type of operations: 1 = modify, 2 = create
  • UpdateAttr: Which attribute is updated: 1= ad off(0) or on(1), 2 = Bid, 3 = Crowd, 4 = ExpoTime
  • UpdateAttrValue: the values the update is

The first thing is to seperate ad creation operations and ad modification operations

1
2
3
4
5
6
7
8
9
10
11
12
13
def create_ad_create(df2):
ad_create = df2[df2['Type'] == 2]
temp = ad_create.set_index(['Id','Date','Type','UpdateAttr']).unstack()
temp.columns = [col for col in temp.columns.values]
temp.columns = ['Bid','Crowd','Time']
ad_create_sort = temp.reset_index()
return ad_create_sort

df_ad_create = create_ad_create(df2)
# note when date = 0, we have duplicated results for the same id,
# when the ad is create, date = 0, the initial setings.
# first check when ads are created
# for each new ad, three attributes are updated: 2,3,4. corresponding to: bid(unit cent),crowd,and exposuretime
1
df_ad_create.head()

Id Date Type Bid Crowd Time
0 31 0 2 100 area:11442 281474976694272,281474976694272,28147497669427...
1 32 0 2 83 area:7572 70368475742208,70368475742208,70368475742208,7...
2 69 0 2 180 area:7572 281474976694272,281474976694272,28147497669427...
3 84 0 2 100 area:6410 4398045462528,4398045462528,4398045462528,4398...
4 85 0 2 70 age:217,601,202,837,942,638,287,5,394,347,731,... 281474976694272,281474976694272,28147497669427...

When the ad is created, some default parameters are given. the date is zero.

Next process the ad_modify data.

1
2
3
4
5
6
7
8
9
10
11
def create_ad_modify(df2):
# get all modify histoy
ad_modify = df2[df2['Type'] == 1]
# change of the state value is not necessary information. discard
ad_modify = ad_modify[ad_modify['UpdateAttr'] != 1]
# convert date into epoch seconds
temp = ad_modify['Date']
ad_modify['Date'] = (pd.to_datetime(temp,format="%Y%m%d%H%M%S").astype(int) / 10**9).astype(int).astype(str)
return ad_modify

df_ad_modify = create_ad_modify(df2)
1
df_ad_modify.sort_values('Id').head()

Id Date Type UpdateAttr UpdateAttrValue
250517 32 1552627896 1 2 72
250493 32 1551721282 1 2 92
250510 32 1552282289 1 2 86
250488 32 1551504704 1 2 97
250483 32 1551331895 1 2 106

Merge static and operation file

Next we merge ad_static, ad_create and ad_modify to summerize a complete dataset of the ads information.

  • Since in training set we need to know Adsize and some information, so only ads exist in both files are kept.
  • Once the ads have some parameter updated, even with the same AdID, it’s no longer the same ad anymore.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
def create_ad_merge(df_ad_static_feature,df_ad_create,df_ad_modify):
merge1 = pd.merge(df_ad_static_feature,df_ad_create,how='inner',on='Id',sort=True)
merge2 = pd.merge(merge1,df_ad_modify,how = 'left', on = 'Id')
merge2_p1 = merge2[merge2['UpdateAttr'] == 2]
merge2_p1['Bid'] = merge2_p1['UpdateAttrValue']
merge2_p2 = merge2[merge2['UpdateAttr'] == 3]
merge2_p2['Crowd'] = merge2_p2['UpdateAttrValue']
merge2_p3 = merge2[merge2['UpdateAttr'] == 4]
merge2_p3['Time'] = merge2_p3['UpdateAttrValue']
merge2_p4 = merge2[merge2['UpdateAttr'].isnull()]
merge_final = pd.concat([merge2_p1,merge2_p2,merge2_p3,merge2_p4])
merge_final = merge_final.sort_values("Id",axis = 0)
drop_list = ['Type_y','UpdateAttr','UpdateAttrValue','Date_y','Type_x']
merge_final.drop(columns = drop_list, inplace = True)
renamelist = ['Id','Date_x','Date','Bid','Crowd','Time','AdAccountId','CommodityId',
'CommodityType','AdIndustryId','AdSize']
merge_final = merge_final.reindex(columns = renamelist)
merge_final = merge_final.sort_values(['Id','Date'],axis=0)
return merge_final

df_ad_merge = create_ad_merge(df_ad_static_feature,df_ad_create,df_ad_modify)
1
df_ad_merge.head()

Id Date_x Date Bid Crowd Time AdAccountId CommodityId CommodityType AdIndustryId AdSize
0 31 1552432970 NaN 100 area:11442 281474976694272,281474976694272,28147497669427... 12577 18683 13 224 40
1 32 1543563617 1550727091 91 area:7572 70368475742208,70368475742208,70368475742208,7... 18752 32534 13 136 40
2 32 1543563617 1551110479 90 area:7572 70368475742208,70368475742208,70368475742208,7... 18752 32534 13 136 40
3 32 1543563617 1551245501 96 area:7572 70368475742208,70368475742208,70368475742208,7... 18752 32534 13 136 40
4 32 1543563617 1551331895 106 area:7572 70368475742208,70368475742208,70368475742208,7... 18752 32534 13 136 40

in this dataset, each record is a time stamp ad status. Future processing is needed, but let’s save it for now.

1
df_ad_merge.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 138530 entries, 0 to 138529
Data columns (total 11 columns):
Id               138530 non-null int64
Date_x           138530 non-null int64
Date             135686 non-null object
Bid              138530 non-null object
Crowd            138530 non-null object
Time             138530 non-null object
AdAccountId      138530 non-null int64
CommodityId      138530 non-null object
CommodityType    138530 non-null int64
AdIndustryId     138530 non-null object
AdSize           138530 non-null object
dtypes: int64(4), object(7)
memory usage: 12.7+ MB
1
df_ad_merge.to_hdf('df_ad_merge.h5', key = 'df')

load total exposure log

this file is huuuuuuuge. but we only need certain inforamtion from it. We only need exposures of the adid list we already have. so we filter the dataset and save it.

1
2
3
4
5
6
7
8
9
10
11
# next deal with totalexposure.log
# this file is huge...break it into chunks
row_count = 500000
expo_colname = ['AdRequestId','Time','LocationId','UserId','Id','AdSize','Bid','pctr','quality_ecpm','total_ecpm']
AdIdSet = set(df_ad_merge['Id'])
chunkcount = 0
for chunk in pd.read_csv('totalExposureLog.out', names = expo_colname, sep='\t', chunksize=row_count):
x = chunk[chunk['Id'].isin(AdIdSet)]
chunkcount += len(x)
print("useful request in this chunk is ",chunkcount)
x.to_hdf('store_test.h5', key = 'df',format='table',append=True, mode='a')
useful request in this chunk is  19620

useful request in this chunk is  39010

...
useful request in this chunk is  6005879
1
opt = pd.read_hdf('store_test.h5', 'df')
1
opt.info(memory_usage='deep')
<class 'pandas.core.frame.DataFrame'>
Int64Index: 6005879 entries, 2 to 102386677
Data columns (total 10 columns):
AdRequestId     int64
Time            int64
LocationId      int64
UserId          int64
Id              int64
AdSize          int64
Bid             int64
pctr            float64
quality_ecpm    float64
total_ecpm      float64
dtypes: float64(3), int64(7)
memory usage: 504.0 MB

Note the userid file is not very important in this project, but we can reveal some interesting insights from it. we can explore it later as an interesting project.