The poor rabbit chased by Python and Anaconda :p

0%

Melbourne Rental Project Part 2: Data Visualization

While we’re waiting for the data, let’s do some data visualization, take one day’s rental data as an example.

load library

1
2
3
4
5
6
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import calendar
%matplotlib inline

load data

1
2
with open('14_11_2019_rentals_raw.csv') as f:
df = pd.read_csv(f)

some basic cleaning before eda

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
def data_clean(df):
# price drop nan, should only contain numbers, drop the record if not.
temp = df[~df['price'].isnull()]
df = temp[temp['price'].str.isdigit()].reset_index(drop=True)
# reformat some features
df['price'] = df['price'].astype(float)
# some price is not reasonable, drop them
df = df.loc[df['price'] > 140,].reset_index(drop=True)
df['suburb'] = df['address'].str.rsplit(',',1,expand=True)[1].str.strip()
df['bond'] = df['bond'].str.split('$',expand=True)[1].str.replace(',','').astype(float)
df['agent_suburb'] = df['agent_brand'].str.split('-',expand=True)[1]
df['agent_brand'] = df['agent_brand'].str.split('-',expand=True)[0]
df['agent_brand'] = df['agent_brand'].str.strip()
temp = df['agent_name'].str.split(':',expand=True)
replace = temp.loc[temp[0].str.lower() == 'agent',]
old = temp.loc[temp[0].str.lower() != 'agent',]
replace[0] = replace[1]
df['agent_name'] = pd.concat([old,replace]).sort_index()[0]
return df
df_clean = data_clean(df)

EDA

how many properties are listed today for each property type?

1
2
3
4
5
6
7
8
9
10
def plot_property_type(df_clean):
plotdata = df_clean.groupby(['suburb','property_type']).count()['price'].unstack(level=1).reset_index().melt(id_vars='suburb',value_name='property_count')
fig, ax = plt.subplots(figsize=(24,10))
sns.set(palette='muted',style="whitegrid", color_codes=True)
ax = sns.barplot(x='property_type',y='property_count',hue='suburb',data=plotdata)
ax.set_title('Property distributions across suburbs of Melbourne',fontsize=20)
ax.tick_params(axis='both', which='major', labelsize=16)
ax.set_xlabel('Property Type',fontsize=18)
ax.set_ylabel('Property Counts',fontsize=18)
ax.legend(fontsize='x-large', title_fontsize='40',loc='upper right')
1
plot_property_type(df_clean)

png

  • Apparently most of the properties are apartments, which make sense :)
  • Melbourne suburbs have A LOT apartments…Impressive.
  • Melbourne suburbs have quite a few studios as well. Since it’s so expansive to live in the city, so studio is a popular choices for people.
1
2
3
4
5
6
7
8
9
10
def plot_suburb_type(df_clean):
plotdata = df_clean.groupby(['suburb','property_type']).count()['price'].unstack(level=1).reset_index().melt(id_vars='suburb',value_name='property_count')
plt.figure(figsize=(24,10))
sns.set(palette='muted',style="whitegrid", color_codes=True)
ax = sns.barplot(x='suburb',y='property_count',hue='property_type',data=plotdata)
plt.title('Property distributions across suburbs of Melbourne',fontsize=20)
ax.tick_params(axis='both', which='major', labelsize=16)
ax.set_xlabel('Suburbs',fontsize=18)
ax.set_ylabel('Property Counts',fontsize=18)
plt.legend(fontsize='x-large', title_fontsize='40',loc='upper right')
1
plot_suburb_type(df_clean)

png

  • just another view of the property count, see if I can find something else.
  • Chadstone, SpringVale, Mount Waverley, and Glen Waverley are suburbs have more houses than apartments.

What’s the average rental price for suburbs?

1
2
3
4
5
6
7
8
9
10
11
12
def plot_avg_price(df_clean):
plotdata = df_clean.loc[:,['property_type','price','suburb']]
f,axes = plt.subplots(2,1,figsize=(24,20))
sns.set(palette='muted',style="whitegrid", color_codes=True)
sns.swarmplot(x='suburb',y='price',hue='property_type',data=plotdata,ax=axes[0])
sns.boxplot(x='suburb',y='price',hue='property_type',data=plotdata,ax=axes[1])
axes[0].set_title('Swarm Plot of Rental Price of Melbourne Suburbs',fontsize=16)
axes[1].set_title('Box Plot of Rental Price of Melbourne Suburbs',fontsize=16)
for i in range(2):
axes[i].tick_params(axis='both', which='major', labelsize=16)
axes[i].set_xlabel('Suburbs',fontsize=18)
axes[i].set_ylabel('Price',fontsize=18)
1
plot_avg_price(df_clean)

png

  • two different views of the rental price of all suburbs.
  • scatter point plots can show distribution more clearly. There are some extreme values present.
  • boxplots show mean price better, but I just don’t like the style.
  • consider split the property_type and suburbs for a better visualization.
  • I also plot the bond as well, but they are almost the same. the bond should be one month rent, so they are linearly relationship.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
def plot_avg_apt_price(df_clean):
plotdata = df_clean.loc[df_clean['property_type'] == 'Apartment',['property_type','price','suburb']]
plotorder = plotdata.groupby('suburb')['price'].median().sort_values(ascending=False).index
medians = plotdata.groupby('suburb')['price'].median().sort_values(ascending=False).values
median_labels = [str(np.round(s, 2)) for s in medians]
#plot
plt.figure(figsize=(24,10))
sns.set(palette='muted',style="whitegrid", color_codes=True)
ax = sns.boxplot(x='suburb',y='price',data=plotdata,order = plotorder)
ax.set_ylim(0,1500)
ax.set_title('Apartment Rental Price of Melbourne Suburbs',fontsize=16)
ax.tick_params(axis='both', which='major', labelsize=16)
ax.set_xlabel('Suburbs',fontsize=18)
ax.set_ylabel('Price',fontsize=18)
# refine plot
pos = range(len(medians))
for tick,label in zip(pos,ax.get_xticklabels()):
ax.text(pos[tick], medians[tick] + 3, median_labels[tick],
horizontalalignment='center', size='large', color='k', weight='semibold')
1
plot_avg_apt_price(df_clean)

png

  • The figure shows the average rental price for apartments of different suburbs.
  • The most expensive suburb is Southbank, while Melbourne suburb has quite a few very very expensive apartments, probably these skyhigh apartments.
  • The apartment rental price vary a lot in Toorak suburb.
  • This is an average of all apartments. of course 1b1b is cheaper than 2b1b, so next lets exam them separately

Rental Price of Aparments: details

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
def plot_detail_apt_price(df_clean):
plotdata = df_clean.loc[(df_clean['property_type'] == 'Apartment') & (df_clean['feature_bedroom'] <= 2) & (df_clean['feature_bedroom'] != 0),]
plotdata = plotdata.loc[:,['price','suburb','feature_bedroom']]
plotorder = plotdata.groupby('suburb')['price'].median().sort_values(ascending=False).index
medians = plotdata.groupby('suburb')['price'].median().sort_values(ascending=False).values
median_labels = [str(np.round(s, 2)) for s in medians]
#plot
plt.figure(figsize=(24,10))
sns.set(palette='muted',style="whitegrid", color_codes=True)
ax = sns.boxplot(x='suburb',y='price',hue='feature_bedroom',data=plotdata,order = plotorder)
ax.set_ylim(0,1500)
ax.set_title('Apartment Rental Price of Melbourne Suburbs',fontsize=16)
ax.tick_params(axis='both', which='major', labelsize=16)
ax.set_xlabel('Suburbs',fontsize=18)
ax.set_ylabel('Price',fontsize=18)
1
plot_detail_apt_price(df_clean)

png

  • It seems like Glen Waverley’s 1 bedroom apartments are just cheaper than usual
  • I used to think Toorak is very very expensive to live, but apparently not..South bank is a more expensive place to live in.
  • Glen Waverley has the third expensive rental price but no very high end apartments.

Anything interesting about the agency?

1
2
3
4
5
6
7
8
9
10
11
12
def plot_top10_agency(df_clean):
plotdata = df_clean.groupby('agent_brand').count()['price'] \
.sort_values(ascending=False)[0:10] \
.reset_index()
sns.set(palette='muted',style="white", color_codes=True)
fig,ax = plt.subplots(figsize=(30,8))
ax = sns.barplot(plotdata.agent_brand,plotdata.price)
ax.set_title('Top 10 Agent Brands dominating the Melbourne market',fontsize=20)
ax.set_xlabel('Agents',fontsize=18)
ax.set_ylabel('Number of properties',fontsize=18)
ax.tick_params(axis='both', which='major', labelsize=16)
ax.xaxis.set_tick_params(rotation=15,labelsize=18)
1
plot_top10_agency(df_clean)

png

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
def plot_price_agent(df_clean):
plotdata = df_clean.groupby('agent_brand').count()['price'] \
.sort_values(ascending=False)[0:10] \
.reset_index()
plot_data2 = df_clean.loc[(df_clean['property_type'] == 'Apartment') & (df_clean['feature_bedroom'] <= 2) & (df_clean['feature_bedroom'] != 0),]
temp = plot_data2.loc[plot_data2['agent_brand'].isin(plotdata.agent_brand),] \
.groupby(('agent_brand','feature_bedroom')).mean()['price'].unstack(level=1)
temp.columns = ['bed1','bed2']
plotorder = temp.mean(axis=1).sort_values(ascending=False).index
sns.set_color_codes("pastel")
fig,ax = plt.subplots(figsize=(14,10))
ax = sns.barplot(temp.bed2,temp.index,label="2 bedrooms",color='b',order = plotorder)
sns.set_color_codes("muted")
ax = sns.barplot(temp.bed1,temp.index,label="1 bedroom",color='b',order = plotorder)
ax.legend(ncol=2, loc="lower right", frameon=True)
ax.set(xlabel="Average Rental Price per Week",ylabel="Agency Name")
sns.despine(left=True, bottom=True)
1
plot_price_agent(df_clean)

png

  • It’s hard to say that Kay & Burton is just more expensive than the rest of the agency. The price can be related to many factors.
  • but it’s interesting to see that Ray White has such a small difference between 1 bedroom and 2 bedrooms.