Posted onEdited onInProject Symbols count in article: 7.6kReading time ≈7 mins.
Before anything
How to convert a jupyter notebook to md file?
1
$ jupyter nbconvert --to markdown input.ipynb
Introduction
Seeking for rentals in a limited time is always a painful experience. For me, because this process is so painful, sometimes even I’m seriously not happy with the property I’m renting, I will choose to live with it.
So here comes this project. I will use www.realestate.com.au as the data source. Several key steps are:
Web scraping rental information of certain suburbs of Melbourne
Data Analysis and Explore
Which suburb is the most expensive?
Any Market Trend in the Project Period?
What are the common features in the rental description?
What is the biggest agency and who is the most popular agent?
etc…
Machine learning
If I have a property I want to rent out (oh I wish), which price I should list?
NLP: auto generate property description?
Estimate the rental price for a given property?
etc…
Import the libs
There are multiple libs available for extract information from html, as a newbie, Requests and beautifulsoup should work for me.
1 2 3 4 5 6
import requests import re import pandas as pd import numpy as np from bs4 import BeautifulSoup from time import sleep
Install and use Requests
First, we need to install Requests to parse the html information. Then use
However, this will likely give you a “permission denied” because most of the websites don’t like web scraping. What we need to do is to pretend we’re a brower, not a code.
for (each suburb): Parse the html of the suburb, get total number pages of that suburb for (each page): get information for each property save information to a dataframe
To achieve this, we need to use the Chrome’s “inspect” feature to inspect the webpage
Get the total number of pages for iteration
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
defget_num_pages(suburb,headers): ''' a sub function to get total number of pages of a suburb arguments: suburb: a string of suburb name headers: user agent headers
return: page_num: total number of pages of a suburb, integer ''' # get num of pages availiable firstpage = "https://www.realestate.com.au/rent/in-"+suburb+"/list-1" response = requests.get(firstpage, headers = headers) if response.status_code != 200: print('Link Error, try again') else: soup = BeautifulSoup(response.content, 'html.parser') pages = soup.find_all(class_="pagination__link rui-button-basic") return int(pages[-1].get_text())
defget_features(response,df): ''' this function extract features from the html, and style it to a dataframe arguments: response: output of requests.get(url) df: a dataframe to store the features
return: df: the data frame contain features. ''' if response.status_code != 200: print('Link Error, try again') else: exit = 0 soup = BeautifulSoup(response.content, 'html.parser') html = list(soup.children)[2] cards = html.find(class_="tiered-results tiered-results--exact") articles = cards.select("div article") if len(articles) == 0: print("no records on this page") #todo: add an exit feature so no scanning for the rest of pages exit = 1 return df, exit for article in articles: card = article.find(class_="residential-card__content") price = card.find(class_="property-price").get_text().split()[0][1:] address_card = card.find(class_="residential-card__address-heading") address = address_card.select("h2 a span")[0].get_text() address_link = 'https://www.realestate.com.au' + address_card.find_all("a")[0].get('href') try: feature_bedroom = card.find(class_="general-features__icon general-features__beds").get_text() except: feature_bedroom = 0 try: feature_bathroom = card.find(class_="general-features__icon general-features__baths").get_text() except: feature_bathroom = 0 try: feature_parking = card.find(class_="general-features__icon general-features__cars").get_text() except: feature_parking = 0 property_type = card.find(class_="residential-card__property-type").get_text() try: agent_brand = article.find(class_="branding__image").get('alt') except: agent_brand = np.nan try: agent_name = article.find(class_="agent__name").get_text() except: agent_name = np.nan detail_response = requests.get(address_link, headers = headers) if detail_response.status_code == 200: detail_soup = BeautifulSoup(detail_response.content, 'html.parser') try: bond = detail_soup.select("div.property-info__property-price-details p")[0].get_text() except: bond = np.nan available_date = detail_soup.find(class_="property-info__footer-content").get_text() property_details = detail_soup.find(class_="property-description__content").get_text() else: bond,available_date,property_details = np.nan, np.nan, np.nan print('LINK_ERROR') df_temp = pd.DataFrame({'property_type':[property_type], 'price':[price], 'bond':[bond], 'address':[address], 'feature_bedroom':[feature_bedroom], 'feature_bathroom':[feature_bathroom], 'feature_parking':[feature_parking], 'agent_brand':[agent_brand], 'agent_name':[agent_name], 'available_date':[available_date], 'property_details':[property_details]}) df = df.append(df_temp) return df,exit
Define the main function to go through the suburb list
defwebspider(suburbs,headers,df): ''' grab rental information of a centain suburb from a list of suburbs arguments: suburbs: a array of suburb names headers: headers of user browser agent df: an empty dataframe with column names
return: df: a dataframe with rental data ''' print('we will process %i suburbs' % len(suburbs)) for index,suburb in enumerate(suburbs): page_num = get_num_pages(suburb,headers) print('processing %i suburb: %s, total of %i pages' % (index+1,suburb,page_num)) url_base = "https://www.realestate.com.au/rent/in-"+suburb+"/list-" # enumrat through all pages of a suburb for i in range(1,page_num): url = url_base + str(i) #sleep(3) response = requests.get(url, headers = headers) print('start processing page %s' % i) df,exit = get_features(response,df) print('end of this page, have a total of %i records' % df.shape[0]) if exit == 1: print('no more records,exit') break
return df
Go through all pages for all suburbs
For all the suburbs, I will get around 2000 records. Sometimes the web server will detect it and refuse my connection because so many requests I’m sorry..:(. You can add sleep(3) to pulse your code for 3 seconds before next request. In my case this is not helpful. So I split the suburbs list into two list, run one list a time, and merge the dataframe later.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
# define inputs for the spider function # define the user header headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.70 Safari/537.36"} # define the suburb list suburbs1 = ["clayton","chadstone","mount+waverley,+vic+3149", "hungtingdale","glen+waverley,+vic+3150", "springvale,+vic+3171","melbourne","south+melbourne,+vic+3205"] suburbs2 = ["southbank,+vic+3006","south+yarra,+vic+3141","toorak,+vic+3142", "richmond,+vic+3121",] # initial empty dataframe with column names df1 = pd.DataFrame(columns=['property_type','price','bond','address','feature_bedroom', 'feature_bathroom','feature_parking','agent_brand', 'agent_name','available_date','property_details']) df2 = pd.DataFrame(columns=['property_type','price','bond','address','feature_bedroom', 'feature_bathroom','feature_parking','agent_brand', 'agent_name','available_date','property_details'])