Before anything

How to convert a jupyter notebook to md file?

1	`$ jupyter nbconvert --to markdown input.ipynb`

Introduction

Seeking for rentals in a limited time is always a painful experience. For me, because this process is so painful, sometimes even I’m seriously not happy with the property I’m renting, I will choose to live with it.

So here comes this project. I will use www.realestate.com.au as the data source. Several key steps are:

Web scraping rental information of certain suburbs of Melbourne
Data Analysis and Explore
- Which suburb is the most expensive?
- Any Market Trend in the Project Period?
- What are the common features in the rental description?
- What is the biggest agency and who is the most popular agent?
- etc…
Machine learning
- If I have a property I want to rent out (oh I wish), which price I should list?
- NLP: auto generate property description?
- Estimate the rental price for a given property?
- etc…

Import the libs

There are multiple libs available for extract information from html, as a newbie, Requests and beautifulsoup should work for me.

import requests
import re
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
from time import sleep

Install and use Requests

First, we need to install Requests to parse the html information. Then use

1 2	`url = "https://www.realestate.com.au/rent/in-clayton/" response = requests.get(url)`

to get the html content.

However, this will likely give you a “permission denied” because most of the websites don’t like web scraping. What we need to do is to pretend we’re a brower, not a code.

We can achieve this by seting up a user-agent.

What is my user-agent?

1	`headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.70 Safari/537.36"}`

Add the headers parameters to the code above, it should solve the permission denied problem.

1 2	`url = "https://www.realestate.com.au/rent/in-clayton/" response = requests.get(url, headers = headers)`

Structure the web scraping code

The structure of the code should be:

for (each suburb):
  Parse the html of the suburb, get total number pages of that suburb
  for (each page):
    get information for each property
    save information to a dataframe

To achieve this, we need to use the Chrome’s “inspect” feature to inspect the webpage

Get the total number of pages for iteration

def get_num_pages(suburb,headers):
    '''
    a sub function to get total number of pages of a suburb
    arguments:
    suburb: a string of suburb name
    headers: user agent headers

    return:
    page_num: total number of pages of a suburb, integer
    '''
    # get num of pages availiable
    firstpage = "https://www.realestate.com.au/rent/in-"+suburb+"/list-1"
    response = requests.get(firstpage, headers = headers)
    if response.status_code != 200:
        print('Link Error, try again')
    else:
        soup = BeautifulSoup(response.content, 'html.parser')
        pages = soup.find_all(class_="pagination__link rui-button-basic")
    return int(pages[-1].get_text())

Get the information for each property

def get_features(response,df):
    '''
    this function extract features from the html, and style it to a dataframe
    arguments:
    response: output of requests.get(url)
    df: a dataframe to store the features

    return:
    df: the data frame contain features.
    '''
    if response.status_code != 200:
        print('Link Error, try again')
    else:
        exit = 0
        soup = BeautifulSoup(response.content, 'html.parser')
        html = list(soup.children)[2]
        cards = html.find(class_="tiered-results tiered-results--exact")
        articles = cards.select("div article")
        if len(articles) == 0:
            print("no records on this page")
            #todo: add an exit feature so no scanning for the rest of pages
            exit = 1
            return df, exit
        for article in articles:
            card = article.find(class_="residential-card__content")
            price = card.find(class_="property-price").get_text().split()[0][1:]
            address_card = card.find(class_="residential-card__address-heading")
            address = address_card.select("h2 a span")[0].get_text()
            address_link = 'https://www.realestate.com.au' + address_card.find_all("a")[0].get('href')
            try:
                feature_bedroom = card.find(class_="general-features__icon general-features__beds").get_text()
            except:
                feature_bedroom = 0
            try:
                feature_bathroom = card.find(class_="general-features__icon general-features__baths").get_text()
            except:
                feature_bathroom = 0
            try:
                feature_parking = card.find(class_="general-features__icon general-features__cars").get_text()
            except:
                feature_parking = 0
            property_type = card.find(class_="residential-card__property-type").get_text()
            try:
                agent_brand = article.find(class_="branding__image").get('alt')
            except:
                agent_brand = np.nan
            try:
                agent_name = article.find(class_="agent__name").get_text()
            except:
                agent_name = np.nan
            detail_response = requests.get(address_link, headers = headers)
            if detail_response.status_code == 200:
                detail_soup = BeautifulSoup(detail_response.content, 'html.parser')
                try:
                    bond = detail_soup.select("div.property-info__property-price-details p")[0].get_text()
                except:
                    bond = np.nan
                available_date = detail_soup.find(class_="property-info__footer-content").get_text()
                property_details = detail_soup.find(class_="property-description__content").get_text()
            else:
                bond,available_date,property_details = np.nan, np.nan, np.nan
                print('LINK_ERROR')
            df_temp = pd.DataFrame({'property_type':[property_type],
                                    'price':[price],
                                    'bond':[bond],
                                    'address':[address],
                                    'feature_bedroom':[feature_bedroom],
                                    'feature_bathroom':[feature_bathroom],
                                    'feature_parking':[feature_parking],
                                    'agent_brand':[agent_brand],
                                    'agent_name':[agent_name],
                                    'available_date':[available_date],
                                    'property_details':[property_details]})
            df = df.append(df_temp)
    return df,exit

Define the main function to go through the suburb list

def webspider(suburbs,headers,df):
    '''
    grab rental information of a centain suburb from a list of suburbs
    arguments:
    suburbs: a array of suburb names
    headers: headers of user browser agent
    df: an empty dataframe with column names

    return:
    df: a dataframe with rental data
    '''
    print('we will process %i suburbs' % len(suburbs))
    for index,suburb in enumerate(suburbs):
        page_num = get_num_pages(suburb,headers)
        print('processing %i suburb: %s, total of %i pages' % (index+1,suburb,page_num))
        url_base = "https://www.realestate.com.au/rent/in-"+suburb+"/list-"
        # enumrat through all pages of a suburb
        for i in range(1,page_num):
            url = url_base + str(i)
            #sleep(3)
            response = requests.get(url, headers = headers)
            print('start processing page %s' % i)
            df,exit = get_features(response,df)
            print('end of this page, have a total of %i records' % df.shape[0])
            if exit == 1:
                print('no more records,exit')
                break

    return df

Go through all pages for all suburbs

For all the suburbs, I will get around 2000 records. Sometimes the web server will detect it and refuse my connection because so many requests I’m sorry..:(. You can add sleep(3) to pulse your code for 3 seconds before next request. In my case this is not helpful. So I split the suburbs list into two list, run one list a time, and merge the dataframe later.

# define inputs for the spider function
# define the user header
headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.70 Safari/537.36"}
# define the suburb list
suburbs1 = ["clayton","chadstone","mount+waverley,+vic+3149",
           "hungtingdale","glen+waverley,+vic+3150",
           "springvale,+vic+3171","melbourne","south+melbourne,+vic+3205"]
suburbs2 = ["southbank,+vic+3006","south+yarra,+vic+3141","toorak,+vic+3142",
           "richmond,+vic+3121",]
# initial empty dataframe with column names
df1 = pd.DataFrame(columns=['property_type','price','bond','address','feature_bedroom',
                           'feature_bathroom','feature_parking','agent_brand',
                           'agent_name','available_date','property_details'])
df2 = pd.DataFrame(columns=['property_type','price','bond','address','feature_bedroom',
                           'feature_bathroom','feature_parking','agent_brand',
                           'agent_name','available_date','property_details'])

df1 = webspider(suburbs2,headers,df1)
df2 = webspider(suburbs2,headers,df2)

Finally merge the df and save it to csv

def df_output(df1,df2):
    df = pd.concat([df1,df2])
    from datetime import date
    today = date.today()
    d1 = today.strftime("%d_%m_%Y")
    csvtitle = d1 + "_rentals_raw.csv"
    print("output to file: %s" % csvtitle)
    df.to_csv(csvtitle,index=False)

df_output(df1,df2)