The poor rabbit chased by Python and Anaconda :p

0%

Tencent 2019 Data Science Competition[Ads Exposure Rate] Part 1

First, Why I choose this project?

This Competition is one of the “unlike Kaggle” Competition: This is a real-world problem.

The target of this project is to estimate an Ad’s Exposure rate (daily) from a bunch of files provided. When a company decides to post an Ad on a certain website, there are many features they need to decide

  • what size?
  • for how long?
  • at which part of the page?
  • how much I wanna pay?
  • etc

Most of the time the company has no idea how to choose the parameters above. So this project aims to build a model for the company to input the parameters, and get the estimated Exposure rate, which they care the most. Then they can decide which parameters to use when they bid for an Ad.

Background Information

This Competition surprised a lot people by how raw the data is and how long time you need to spend to study the background Information of the project, which is VERY VERY close to the real world problem.

What is Ads bid?

Tencent applied “Generalized Second Price” bid method for the ads provider to bid for a certain ad position. However, because it’s an online ad, so there are other factors we need to consider. The provider has the highest bid CANNOT occupy the ad position all the time. For example:

The same ad position will display “fashion related ads” for people who has an interest of ads, will display “gardening” related ads for people who interested in gardening.

Therefore, the exposure rate is not a direct measured rate from the parameters the ad provider give.

features

The dataset has the following features:

  • Raw dataset: ~10G huge dataset contains 4 different tables.
  • No direct labels: in the training set, they don’t have a y label indicating the exposure rate for a certain ad. You need to calculate it from all 4 tables provided.
  • The test set contains old ads as well as new ads
  • A special evaluation score called “Monoscore”, to exam if the higher the bid is, the higher the exposure rate is.

Conclusion

Overall, I think it is a very interesting project to work on.

Skills practiced:

  • Real-world Machine Learning Problem handling
  • Background information learning and Data cleaning
  • Huge data handling
  • Feature Engineering