Crawling Website Data with Python

For data science tasks, the first step is usually data collection. Among all data collection methods, data collection with web crawler technology stands out because its relatively low cost.

Web crawler technology usually contains 3 steps, namely (1) retrieve web page url, (2) download web page and (3) extract and save data from downloaded web page.

The first and the third steps are usually chained together, which explains for the word "crawler".

Since Python programming language is most suitable to do the tasks of second step and the third steps, web crawler technology is usually connected with Python.

In this article, I am going to explain how to use Python to perform the second and third steps of web crawler technology.

Download Web Page with Python

Two good packages of Python for this step are Requests and Selenium. The difference is Selenium is a browser alike package while Request only makes the request.

Demo of Request

import requests
header = {
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Encoding': 'gzip, deflate',
    'Accept-Language': 'zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3',
    'Cache-Control': 'max-age=0',
    'Connection': 'keep-alive',
    'Cookie': '_ga=GA1.2.1331385028.1520683261; _gid=GA1.2.74339605.1524632336; Hm_lvt_11e98055978f56e335e962403a8d0618=1524713182,1524726423,1524726430,1524726464; Hm_lpvt_11e98055978f56e335e962403a8d0618=1524741787',
    # 'Host': 'www.johannhuang.com',
    'Referer': 'http://www.johannhuang.com/',
    'Upgrade-Insecure-Requests': '1',
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.117 Safari/537.36'
}

proxyHost = 'your_proxy_host'
proxyPort = 'your_proxy_port'
proxyUser = 'your_username'
proxyPass = 'your_password'
proxyMeta = '%(user)s:%(pass)s@%(host)s:%(port)s' % {
    'host': proxyHost,
    'port': proxyPort,
    'user': proxyUser,
    'pass': proxyPass,
}
proxies = {
    'http': 'http://' + proxyMeta,
    'https': 'https://' + proxyMeta,
    # 'http': 'http://127.0.0.1:1087',
    # 'https': 'https://127.0.0.1:1087',
}

r = requests.get('http://www.google.com/', headers=header, proxies=proxies)

print(r.status_code)
print(r.text)

Demo of Selenium

from selenium import webdriver
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.firefox.firefox_binary import FirefoxBinary

## create the webdriver
driver = webdriver.Firefox(firefox_binary=FirefoxBinary('../Sites/~drivers/geckodriver-v0.15.0-macos/geckodriver'))
driver = webdriver.Firefox(executable_path='../Sites/~drivers/geckodriver-v0.15.0-macos/')
driver = webdriver.Firefox() ## with `brew install geckodriver` done before on macOS

## make the request
driver.get('https://www.google.com')

## extract information
cookie1 = driver.get_cookies()
driver.get_screenshot_as_file('./cache.html')

print(driver.page_source)

## mock mouse and keyboard actions
# driver.find_element_by_xpath("html/body/div[1]/div[2]/div/div/div[2]/input").clear()
# driver.find_element_by_xpath("html/body/div[1]/div[2]/div/div/div[2]/input").send_keys("男鞋")
# driver.find_element_by_xpath("html/body/div[1]/div[2]/div/div/div[2]/span").click()

Extract Data from Web Page with Python

Some good helper packages are BeautifulSoup

Demo Code Piece of BeautifulSoup

import re
from bs4 import BeautifulSoup

## create BeautifulSoup from html string
bsobj = BeautifulSoup(r.text, "lxml") ## for requests
bsobj = BeautifulSoup(driver.page_source, "lxml") ## for selenium

## locate elements with find() method
style = bsobj.find('i', attrs={"class": "resource-tag"}).string

c1 = bsobj.find('div', attrs={"class": "crumb"}).find_all('a')
.find('div', attrs={'class': 'dper-info'}).find('a').text.strip()
find('div', attrs={'class': 'dper-info'}).find('a')['href'])[0]

## locate elements with methods like find_next_sibling()
actions = review.find('span', attrs={'class': 'actions'}).find_all('a')[0].find_next_sibling('em')

## locate elements with regex
date2 = re.findall('>(.+)<', str(date))

More

For the first step, retrieving web page url, Charles and Chrome can be two good tools which can do web request and response package inspection.

Acknowledgement

Some code pieces are based on codes provided by Hu Xin Ph.D candidate of SJTU.


* cached version, generated at 2019-06-22 03:45:38 UTC.

Subscribe by RSS