The poor rabbit chased by Python and Anaconda :p

0%

Melbourne Rental Project Part 1: Web Scraping

Before anything

How to convert a jupyter notebook to md file?

1
$ jupyter nbconvert --to markdown input.ipynb

Introduction

Seeking for rentals in a limited time is always a painful experience. For me, because this process is so painful, sometimes even I’m seriously not happy with the property I’m renting, I will choose to live with it.

So here comes this project. I will use www.realestate.com.au as the data source. Several key steps are:

  • Web scraping rental information of certain suburbs of Melbourne
  • Data Analysis and Explore
    • Which suburb is the most expensive?
    • Any Market Trend in the Project Period?
    • What are the common features in the rental description?
    • What is the biggest agency and who is the most popular agent?
    • etc…
  • Machine learning
    • If I have a property I want to rent out (oh I wish), which price I should list?
    • NLP: auto generate property description?
    • Estimate the rental price for a given property?
    • etc…

Import the libs

There are multiple libs available for extract information from html, as a newbie, Requests and beautifulsoup should work for me.

1
2
3
4
5
6
import requests
import re
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
from time import sleep

Install and use Requests

First, we need to install Requests to parse the html information. Then use

1
2
url = "https://www.realestate.com.au/rent/in-clayton/"
response = requests.get(url)

to get the html content.

However, this will likely give you a “permission denied” because most of the websites don’t like web scraping. What we need to do is to pretend we’re a brower, not a code.

We can achieve this by seting up a user-agent.

What is my user-agent?

1
headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.70 Safari/537.36"}

Add the headers parameters to the code above, it should solve the permission denied problem.

1
2
url = "https://www.realestate.com.au/rent/in-clayton/"
response = requests.get(url, headers = headers)

Structure the web scraping code

The structure of the code should be:

1
2
3
4
5
for (each suburb):
Parse the html of the suburb, get total number pages of that suburb
for (each page):
get information for each property
save information to a dataframe

To achieve this, we need to use the Chrome’s “inspect” feature to inspect the webpage

Get the total number of pages for iteration

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
def get_num_pages(suburb,headers):
'''
a sub function to get total number of pages of a suburb
arguments:
suburb: a string of suburb name
headers: user agent headers

return:
page_num: total number of pages of a suburb, integer
'''

# get num of pages availiable
firstpage = "https://www.realestate.com.au/rent/in-"+suburb+"/list-1"
response = requests.get(firstpage, headers = headers)
if response.status_code != 200:
print('Link Error, try again')
else:
soup = BeautifulSoup(response.content, 'html.parser')
pages = soup.find_all(class_="pagination__link rui-button-basic")
return int(pages[-1].get_text())

Get the information for each property

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
def get_features(response,df):
'''
this function extract features from the html, and style it to a dataframe
arguments:
response: output of requests.get(url)
df: a dataframe to store the features

return:
df: the data frame contain features.
'''

if response.status_code != 200:
print('Link Error, try again')
else:
exit = 0
soup = BeautifulSoup(response.content, 'html.parser')
html = list(soup.children)[2]
cards = html.find(class_="tiered-results tiered-results--exact")
articles = cards.select("div article")
if len(articles) == 0:
print("no records on this page")
#todo: add an exit feature so no scanning for the rest of pages
exit = 1
return df, exit
for article in articles:
card = article.find(class_="residential-card__content")
price = card.find(class_="property-price").get_text().split()[0][1:]
address_card = card.find(class_="residential-card__address-heading")
address = address_card.select("h2 a span")[0].get_text()
address_link = 'https://www.realestate.com.au' + address_card.find_all("a")[0].get('href')
try:
feature_bedroom = card.find(class_="general-features__icon general-features__beds").get_text()
except:
feature_bedroom = 0
try:
feature_bathroom = card.find(class_="general-features__icon general-features__baths").get_text()
except:
feature_bathroom = 0
try:
feature_parking = card.find(class_="general-features__icon general-features__cars").get_text()
except:
feature_parking = 0
property_type = card.find(class_="residential-card__property-type").get_text()
try:
agent_brand = article.find(class_="branding__image").get('alt')
except:
agent_brand = np.nan
try:
agent_name = article.find(class_="agent__name").get_text()
except:
agent_name = np.nan
detail_response = requests.get(address_link, headers = headers)
if detail_response.status_code == 200:
detail_soup = BeautifulSoup(detail_response.content, 'html.parser')
try:
bond = detail_soup.select("div.property-info__property-price-details p")[0].get_text()
except:
bond = np.nan
available_date = detail_soup.find(class_="property-info__footer-content").get_text()
property_details = detail_soup.find(class_="property-description__content").get_text()
else:
bond,available_date,property_details = np.nan, np.nan, np.nan
print('LINK_ERROR')
df_temp = pd.DataFrame({'property_type':[property_type],
'price':[price],
'bond':[bond],
'address':[address],
'feature_bedroom':[feature_bedroom],
'feature_bathroom':[feature_bathroom],
'feature_parking':[feature_parking],
'agent_brand':[agent_brand],
'agent_name':[agent_name],
'available_date':[available_date],
'property_details':[property_details]})
df = df.append(df_temp)
return df,exit

Define the main function to go through the suburb list

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
def webspider(suburbs,headers,df):
'''
grab rental information of a centain suburb from a list of suburbs
arguments:
suburbs: a array of suburb names
headers: headers of user browser agent
df: an empty dataframe with column names

return:
df: a dataframe with rental data
'''

print('we will process %i suburbs' % len(suburbs))
for index,suburb in enumerate(suburbs):
page_num = get_num_pages(suburb,headers)
print('processing %i suburb: %s, total of %i pages' % (index+1,suburb,page_num))
url_base = "https://www.realestate.com.au/rent/in-"+suburb+"/list-"
# enumrat through all pages of a suburb
for i in range(1,page_num):
url = url_base + str(i)
#sleep(3)
response = requests.get(url, headers = headers)
print('start processing page %s' % i)
df,exit = get_features(response,df)
print('end of this page, have a total of %i records' % df.shape[0])
if exit == 1:
print('no more records,exit')
break

return df

Go through all pages for all suburbs

For all the suburbs, I will get around 2000 records. Sometimes the web server will detect it and refuse my connection because so many requests I’m sorry..:(. You can add sleep(3) to pulse your code for 3 seconds before next request. In my case this is not helpful. So I split the suburbs list into two list, run one list a time, and merge the dataframe later.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# define inputs for the spider function
# define the user header
headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.70 Safari/537.36"}
# define the suburb list
suburbs1 = ["clayton","chadstone","mount+waverley,+vic+3149",
"hungtingdale","glen+waverley,+vic+3150",
"springvale,+vic+3171","melbourne","south+melbourne,+vic+3205"]
suburbs2 = ["southbank,+vic+3006","south+yarra,+vic+3141","toorak,+vic+3142",
"richmond,+vic+3121",]
# initial empty dataframe with column names
df1 = pd.DataFrame(columns=['property_type','price','bond','address','feature_bedroom',
'feature_bathroom','feature_parking','agent_brand',
'agent_name','available_date','property_details'])
df2 = pd.DataFrame(columns=['property_type','price','bond','address','feature_bedroom',
'feature_bathroom','feature_parking','agent_brand',
'agent_name','available_date','property_details'])

df1 = webspider(suburbs2,headers,df1)
df2 = webspider(suburbs2,headers,df2)

Finally merge the df and save it to csv

1
2
3
4
5
6
7
8
9
10
def df_output(df1,df2):
df = pd.concat([df1,df2])
from datetime import date
today = date.today()
d1 = today.strftime("%d_%m_%Y")
csvtitle = d1 + "_rentals_raw.csv"
print("output to file: %s" % csvtitle)
df.to_csv(csvtitle,index=False)

df_output(df1,df2)