Melbourne Rental Project Part 3: word2vec

Posted on 2019-11-26 Edited on 2019-11-27 In Project
Symbols count in article: 9.6k Reading time ≈ 9 mins.

Now let’s try word2vec on the property details page and see if we can find something interesting.

load libs

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import gensim
import nltk
import string
import re
%matplotlib inline

Melbourne Rental Project Part 2: Data Visualization

Posted on 2019-11-25 In Project
Symbols count in article: 8.3k Reading time ≈ 8 mins.

While we’re waiting for the data, let’s do some data visualization, take one day’s rental data as an example.

load library

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import calendar
%matplotlib inline

matplotlib: Have you ever confused by plt./ax./fig./ ?

Posted on 2019-11-24 In Tips & Tricks
Symbols count in article: 3.8k Reading time ≈ 3 mins.

If you:

Followed some tutorials called “how to plot in python in 5 mins” and learned matplotlib.
Know to add ‘from matplotlib.pyplot import plt’.
But always need to open the google page when doing any plot.
Always struggling at adding labels/text/ticks/ to the plots.
Know some tasks(like adding xlabels) can be done in several ways, but don’t know which is the best way.
Open matplotlib official doc several times but have no idea what they are talking about.
Confused by the terms: figure, axes, axis.

Tencent 2019 Data Science Competition[Ads Exposure Rate] Extra: Data Visualization

Posted on 2019-11-22 In Project
Symbols count in article: 3.4k Reading time ≈ 3 mins.

Just for some extra fun, Let’s do some plots to explore the ads dataset a bit

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import calendar
%matplotlib inline

SVM / CART / LG / NB ?

Posted on 2019-11-21 Edited on 2019-11-27 In Tips & Tricks
Symbols count in article: 1.3k Reading time ≈ 1 mins.

I think there is no solid evidence to prove which is better than another. These two algorithms build from different methods with different hyper-parameters to tune. Therefore, I think the right approach is to understand the pros and cons of the two, recall the pros and cons when solving your specific problem.

Understand XGboost and LightGBM

Posted on 2019-11-21 In Tips & Tricks
Symbols count in article: 1.5k Reading time ≈ 1 mins.

Gradient Boosting Tree Basic

start with a weak and simpler learner (e.g. mean()), get a prediction
use a lost function J to compute the error between y_true and y_predict

How to Learn Pandas from Zero: Use a Mindmap!

Posted on 2019-11-20 In Tips & Tricks
Symbols count in article: 749 Reading time ≈ 1 mins.

When I first started learning Pandas library, I seriously suffered a memory issue:

My brain said:

It’s hard to write the pandas functions into the disk, while your memory is not enough.

Tencent 2019 Data Science Competition[Ads Exposure Rate] Part 3

Posted on 2019-11-19 Edited on 2019-11-21 In Project
Symbols count in article: 7.8k Reading time ≈ 7 mins.

Just another day of Data cleaning… this dataset really required lots and lots of cleaning…

Data prepare: continued

From last notebook we’ve get the correct label: next 24 hour exposure rate.

Multiple values per cell: How to one-hot encoder it

Posted on 2019-11-19 In Tips & Tricks
Symbols count in article: 6.5k Reading time ≈ 6 mins.

In data science data cleaning stage, we may encounter this situation:

In one column you have multiple features, each feature has multiple values, they all stacked in one column.

Tencent 2019 Data Science Competition[Ads Exposure Rate] Part 2

Posted on 2019-11-14 Edited on 2019-11-15 In Project
Symbols count in article: 11k Reading time ≈ 10 mins.

In my previous post I summarized the project information and challenges, now let’s take a look at the data.

Data download

First, the raw dataset can be downloaded here.