Webscraping Metacritic
Posted on Wed 27 February 2019 in python
In looking for interesting data sets to play with I'm often searching through Kaggle thinking 'this looks awesome, but if only it had this attribute' or 'data covering a different time period'. Most recently, this happened when looking through the 'Kaggle data set for metacritic'. The natural next step was to start looking at scraping my own data, which is the process I want to cover in this post.
Inevitably the html syntax for metacritic has changed from the process documented in the Kaggle kernel, so a reworking of the html parsing has been required, which also makes this an excellent excuse to get some practise in natural languages and parsing, thanks Chomsky!
Requesting pages¶
The search functionality within metacritic allows you to pull up all the games for a given year. While it's possible to use the same techniques to cycle through all years and pull out the max page, in an effort to stick to the ethics of scraping, and request data at a reasonable rate, this sample code just pulls the first page.
We've also initialised a pandas dataframe, ready for the information to be stored.
import requests
from bs4 import BeautifulSoup
import re
import pandas as pd
import numpy as np
import seaborn as sns
platform = "all" # "all", "ps4, ""xboxone", "pc", "ps3", "wii-u", etc.
year = 2018
max_page = 1
meta = pd.DataFrame(columns=["title", "platform", "developer", "critic_score", "critic_score_no", "user_score", "user_score_no", "genre", "release_date", "rating", "url"])
for i in range (0, max_page):
URL = 'http://www.metacritic.com/browse/games/score/metascore/year/all/filtered?year_selected=' + str(year) + '&sort=desc&page=' + str(i)
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
request = requests.get(URL, headers=headers)
soup_main = BeautifulSoup(request.content.decode(), 'html.parser')
# Write the requested page to a text file for analysis of html structure
#file = open(r'metacritic_output_main.txt', 'w', encoding='utf-8')
#file.write(str(soup_main.prettify))
#file.close()
# For demonstration purposes, lets just print the first few lines..
print(str(soup_main.prettify)[:207])
Parsing¶
While the search page gives summary information for each of the games, the individual game pages have additional useful content. So the first step is to pull the url for each game, we'll then request each of the pages and extract the relevant information.
Writing the parsing commands is time consuming and requires some trial and error, but hopefully this gives a good overview of a varied range of data types. Some exception catching is required where data isn't present, this is something you'll inevitable come across during testing.
Finally the data is written to the pandas dataframe, following some definition of data type and conditioning. In this demonstration code, I'm just writing the first three games in the search to the database.
link = []
for game in soup_main.find('div', class_='body').find_all('div', class_='product_wrap'):
url = game.find('div', class_='basic_stat product_title').find('a').get("href")
link.append(url)
for url in link[0:3]: # Remove the indexing to add all of the games to the database
URL = 'http://www.metacritic.com' + url
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
request = requests.get(URL, headers=headers)
soup_page = BeautifulSoup(request.content.decode(), 'html.parser')
title = soup_page.find('div', class_='product_title').find('h1').contents[0]
platform = soup_page.find('div', class_='product_title').find('span', class_='platform').get_text(strip=True)
try: developer = soup_page.find('li', class_='summary_detail developer').find('span', class_='data').get_text(strip=True)
except AttributeError: developer = None
critic_score = soup_page.find('span', itemprop='ratingValue').contents[0]
if critic_score == 'tbd': critic_score = -990
try: critic_score_no = soup_page.find('div', class_='details main_details').find('span', class_='count').find('a').find('span').get_text(strip=True)
except AttributeError: critic_score_no = -990
user_score = soup_page.find('div', re.compile('metascore_w user large')).contents[0]
if user_score == 'tbd': user_score = -99
try: user_score_no = soup_page.find('div', class_='details side_details').find('div', class_='summary').find('a').contents[0].split(' ')[0]
except AttributeError: user_score_no = -990
genre = soup_page.find('li', class_='summary_detail product_genre').find_all('span', class_='data')
for i in range(0, len(genre)): genre[i] = genre[i].contents[0] # Select primary genre entry
release_date = soup_page.find('li', class_='summary_detail release_data').find('span', class_='data').contents[0]
try: rating = soup_page.find('li', class_='summary_detail product_rating').find('span', class_='data').contents[0]
except AttributeError: rating = None
meta = meta.append({"title": str(title),
"platform": str(platform),
"developer": str(developer),
"critic_score": int(critic_score),
"critic_score_no": int(critic_score_no),
"user_score": int(float(user_score)*10),
"user_score_no": int(user_score_no),
"genre": str(genre[0]),
"release_date": str(release_date),
"rating": str(rating),
"url": url},
ignore_index=True)
#print('Output: ', len(meta), ' of up to', max_page*100)
print(meta.head(3))
Analysing the data¶
I'm still keen to get to grips with the dataset properly, and I have a few theories I'd like to test out (for example, do the stats really back-up that Nintendo fanboys view games through rose-tinted spectacles). But just to give a quick impression, lets load the full data, do a quick clean-up and take a look a quick-look at the relationship between critic and user scores. Feel free to download the data and take a look yourself.
meta = pd.read_csv('data/meta_2018.csv')
# Set datetime data type and create year feature
meta['release_date'] = pd.to_datetime(meta['release_date'])
meta['year'] = meta['release_date'].dt.year
#meta['release_date'] = meta['release_date'].dt.date
# Set feature data types
for col in ['platform', 'developer', 'genre', 'year']:
meta[col] = meta[col].astype('category')
for col in ['critic_score', 'critic_score_no', 'user_score', 'user_score_no']:
meta[col] = meta[col].astype('int')
# Set null values to np.nan
for col in ['critic_score', 'critic_score_no', 'user_score', 'user_score_no']:
meta[col] = meta[col].replace(-990, np.nan)
sns.regplot(meta.critic_score, meta.user_score)