I write about my explorations in AI and other quaintitative areas.
For more about me and my other interests, visit playgrd, quaintitative or socials below
More often than not, easy access to data via an API is not possible. Scraping the webpage might then be the only practical way to get at the data. Doing this in Python is fairly straightforward, with the help of some libraries, and a basic understanding of HTML.
This is a super basic tutorial, but I am just writing these points down to remind myself.
We first import the following libraries -
import bs4
import requests
from slugify import slugify
import os
Note: Remember to install awesome-slugify (pip install awesome-slugify
) instead of slugify.
Next, specify the webpage which you would like to scrape data off. Say the list of visual art topics on Wikipedia.
websource = ['https://en.wikipedia.org/wiki/Category:Lists_of_visual_art_topics']
Some string manipulation first. I need the wikipedia address later. Simple split the string at ‘wiki/‘ and get the first item that is returned.
domain = websource[0].split("/wiki")[0]
You will get this.
'https://en.wikipedia.org'
Next, we get the content of the page in websource.
html = requests.get(websource[0]).content
Then we parse it with BeautifulSoup. This then allows us to get at all the links using the findAll function.
soup = bs4.BeautifulSoup(html, 'html5lib')
links = set(soup.findAll('a', href=True))
Next, we use what we have to locate a link with ‘mathematical’ inside to find the page with the list of mathematical artists.
for link in links:
if 'mathematical' in link['href']:
page = requests.get(domain+link['href']).content
clean_page = bs4.BeautifulSoup(page, 'html5lib')
The Jupyter notebook with the code is here