Webscraping Metacritic

Posted on Wed 27 February 2019 in python

In looking for interesting data sets to play with I'm often searching through Kaggle thinking 'this looks awesome, but if only it had this attribute' or 'data covering a different time period'. Most recently, this happened when looking through the 'Kaggle data set for metacritic'. The natural next step was to start looking at scraping my own data, which is the process I want to cover in this post.

Inevitably the html syntax for metacritic has changed from the process documented in the Kaggle kernel, so a reworking of the html parsing has been required, which also makes this an excellent excuse to get some practise in natural languages and parsing, thanks Chomsky!

Requesting pages

The search functionality within metacritic allows you to pull up all the games for a given year. While it's possible to use the same techniques to cycle through all years and pull out the max page, in an effort to stick to the ethics of scraping, and request data at a reasonable rate, this sample code just pulls the first page.

We've also initialised a pandas dataframe, ready for the information to be stored.

In [6]:
import requests
from bs4 import BeautifulSoup
import re
import pandas as pd
import numpy as np
import seaborn as sns

platform = "all"  # "all", "ps4, ""xboxone", "pc", "ps3", "wii-u", etc.
year = 2018
max_page = 1

meta = pd.DataFrame(columns=["title", "platform", "developer", "critic_score", "critic_score_no", "user_score", "user_score_no", "genre", "release_date", "rating", "url"])

for i in range (0, max_page):
    URL = 'http://www.metacritic.com/browse/games/score/metascore/year/all/filtered?year_selected=' + str(year) + '&sort=desc&page=' + str(i)
    headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
    request = requests.get(URL, headers=headers)
    soup_main = BeautifulSoup(request.content.decode(), 'html.parser')

# Write the requested page to a text file for analysis of html structure
#file = open(r'metacritic_output_main.txt', 'w', encoding='utf-8')
#file.write(str(soup_main.prettify))
#file.close()

# For demonstration purposes, lets just print the first few lines..
print(str(soup_main.prettify)[:207])
<bound method Tag.prettify of <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN"
   "https://www.w3.org/TR/html4/strict.dtd">

<html xml:lang="en">
<head>
<title>Best Video Games for 2018 - Metacritic</title>

Parsing

While the search page gives summary information for each of the games, the individual game pages have additional useful content. So the first step is to pull the url for each game, we'll then request each of the pages and extract the relevant information.

Writing the parsing commands is time consuming and requires some trial and error, but hopefully this gives a good overview of a varied range of data types. Some exception catching is required where data isn't present, this is something you'll inevitable come across during testing.

Finally the data is written to the pandas dataframe, following some definition of data type and conditioning. In this demonstration code, I'm just writing the first three games in the search to the database.

In [7]:
link = []
    
for game in soup_main.find('div', class_='body').find_all('div', class_='product_wrap'):
    url = game.find('div', class_='basic_stat product_title').find('a').get("href")
    link.append(url)

for url in link[0:3]:  # Remove the indexing to add all of the games to the database
    URL = 'http://www.metacritic.com' + url
    headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
    request = requests.get(URL, headers=headers)
    soup_page = BeautifulSoup(request.content.decode(), 'html.parser')
    title = soup_page.find('div', class_='product_title').find('h1').contents[0]
    platform = soup_page.find('div', class_='product_title').find('span', class_='platform').get_text(strip=True)
    try: developer = soup_page.find('li', class_='summary_detail developer').find('span', class_='data').get_text(strip=True)
    except AttributeError: developer = None
    critic_score = soup_page.find('span', itemprop='ratingValue').contents[0]
    if critic_score == 'tbd': critic_score = -990
    try: critic_score_no = soup_page.find('div', class_='details main_details').find('span', class_='count').find('a').find('span').get_text(strip=True)
    except AttributeError: critic_score_no = -990
    user_score = soup_page.find('div', re.compile('metascore_w user large')).contents[0]
    if user_score == 'tbd': user_score = -99
    try: user_score_no = soup_page.find('div', class_='details side_details').find('div', class_='summary').find('a').contents[0].split(' ')[0]
    except AttributeError: user_score_no = -990
    genre = soup_page.find('li', class_='summary_detail product_genre').find_all('span', class_='data')
    for i in range(0, len(genre)):  genre[i] = genre[i].contents[0]  # Select primary genre entry
    release_date = soup_page.find('li', class_='summary_detail release_data').find('span', class_='data').contents[0]        
    try: rating = soup_page.find('li', class_='summary_detail product_rating').find('span', class_='data').contents[0]
    except AttributeError: rating = None

    meta = meta.append({"title": str(title),
                    "platform": str(platform),
                    "developer": str(developer),
                    "critic_score": int(critic_score),
                    "critic_score_no": int(critic_score_no),
                    "user_score": int(float(user_score)*10),
                    "user_score_no": int(user_score_no),
                    "genre": str(genre[0]),
                    "release_date": str(release_date),
                    "rating": str(rating),
                    "url": url},
        ignore_index=True)

#print('Output: ', len(meta), ' of up to', max_page*100)

print(meta.head(3))
                   title       platform         developer critic_score  \
0  Red Dead Redemption 2       Xbox One    Rockstar Games           97   
1  Red Dead Redemption 2  PlayStation 4    Rockstar Games           97   
2             God of War  PlayStation 4  SCE Santa Monica           94   

  critic_score_no user_score user_score_no             genre  release_date  \
0              33         74          1912  Action Adventure  Oct 26, 2018   
1              98         79          7163  Action Adventure  Oct 26, 2018   
2             118         91          9437  Action Adventure  Apr 20, 2018   

  rating                                        url  
0      M       /game/xbox-one/red-dead-redemption-2  
1      M  /game/playstation-4/red-dead-redemption-2  
2      M             /game/playstation-4/god-of-war  

Analysing the data

I'm still keen to get to grips with the dataset properly, and I have a few theories I'd like to test out (for example, do the stats really back-up that Nintendo fanboys view games through rose-tinted spectacles). But just to give a quick impression, lets load the full data, do a quick clean-up and take a look a quick-look at the relationship between critic and user scores. Feel free to download the data and take a look yourself.

In [8]:
meta = pd.read_csv('data/meta_2018.csv')

# Set datetime data type and create year feature
meta['release_date'] = pd.to_datetime(meta['release_date'])
meta['year'] = meta['release_date'].dt.year
#meta['release_date'] = meta['release_date'].dt.date

# Set feature data types
for col in ['platform', 'developer', 'genre', 'year']:
    meta[col] = meta[col].astype('category')
    
for col in ['critic_score', 'critic_score_no', 'user_score', 'user_score_no']:
    meta[col] = meta[col].astype('int')
    
# Set null values to np.nan
for col in ['critic_score', 'critic_score_no', 'user_score', 'user_score_no']:
    meta[col] = meta[col].replace(-990, np.nan)

sns.regplot(meta.critic_score, meta.user_score)
Out[8]:
<matplotlib.axes._subplots.AxesSubplot at 0x192df45b198>