Artificially Reducing Intelligence (Pt. 1)
Posted on Sun 06 June 2021 in python
Carrying on with the fermentation theme, my long term goal is to use machine learning to design an (possibly) awesome beer recipe. However, in order to train a model, we need some data linking to recipes to consumer opinions. While there are lots of online resources for opinions, commercial beer recipes are often guarded secrets, and it's especially rare that they are shared following a structured data format. One (sort of) exception to this rule is the BrewDog DIY Dog. The catalogue of all BrewDog recipes is published anually in pdf format allowing homebrewers to have a go themselves. While recipes are largely written in a consistent way, some challenges existed in parsing the data to a machine readable format - and a couple of different libraries (PyPDF2 and tabula-py) were required.
Putting together this dataset was fairly quick and dirty, and there are cases where the data isn't perfectly formatted. For example, consistent use of names for categorical variables (eg. malt and yeast names), and information on hop attributes (eg. bittering, flavour) had to be input manually. In the next part of this series I want to take a look at ways to improve the data further using some new techniques I'm keen to try out.
import pandas as pd
import PyPDF2
import tabula
# first part of code shown only for reference
# pypdf2 used to extract any text available
# tabula required for name extraction due to metadata formatting (top, left, bottom, right)
# file available from https://www.brewdog.com/uk/community/diy-dog
#pdfFileObj = open(diy_dog, 'rb')
#pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
def return_name(x):
try:
return tabula.read_pdf(diy_dog, pages=x+21, area=[[60, 0, 130, 800]], relative_area=False, multiple_tables=False)[0].columns.tolist()[0]
except:
return None
def str_to_digit(x):
try:
return float(x)
except:
return 0
def strlist_to_list(x):
return list(x.strip('[]').replace("'", '').split(', '))
# text formatting
# id (1-415)
#bd = pd.DataFrame({'id':range(1,416,1)})
# extract text
#bd['pdf'] = bd['id'].map(lambda x: pdfReader.getPage(x+20).extractText().replace('\n',''))
# name
#bd['name'] = bd['id'].map(return_name, na_action=None)
# volume
#bd['vol_L'] = bd['pdf'].str.extract('VOLUME([\.0-9]+)')
#bd['boil_vol_L'] = bd['pdf'].str.extract('BOIL VOLUME([0-9]+)')
# stats
#bd['avb'] = bd['pdf'].str.extract('ABV([\.0-9]+)%')
#bd['target_og'] = bd['pdf'].str.extract('TARGET OG([0-9]+)')
#bd['target_fg'] = bd['pdf'].str.extract('TARGET FG([0-9]+)')
#bd['ebc'] = bd['pdf'].str.extract('EBC[\sA-Z]*([0-9]+)')
#bd['srm'] = bd['pdf'].str.extract('SRM([0-9]+)')
#bd['ph'] = bd['pdf'].str.extract('PH([\.0-9]+)')
# process
#bd['mash_temp'] = bd['pdf'].str.extract('MASH TEMP\s*([0-9]+)°')
#bd['mash_time'] = bd['pdf'].str.extract('°F\s*([0-9]+)\s*min')
#bd['ferm_temp'] = bd['pdf'].str.extract('FERMENTATION\s*([0-9]+)°')
# grist
#bd['grist'] = bd['pdf'].str.extract('MALT(.*lb)')
#bd.at[184,'grist'] = 'Extra Pale4.38kg9.64lbCrystal 1500.16kg0.34lbDark Crystal0.16kg0.34lbMunich1.25kg2.75Rye0.63kg1.38lb'
#bd.at[202,'grist'] = 'Pilsner1.69kg3.72lbWheat0.38kg0.83lbFlaked Oats0.13kg0.28lb'
#bd.at[213,'grist'] = 'Propino Pale Malt4.1kg9.04lb'
#bd.at[347,'grist'] = 'Pale Ale5.16kg11.4lbWheat Malt 0.36kg0.8lbFlaked Oat Malt0.60kg1.3lb'
#bd.at[395,'grist'] = 'Pale Ale 3.6kg7.9lbWheat Malt 0.6kg1.3lbFlaked Oat Malt0.24kg0.5lb'
#bd['grist'] = bd['grist'].str.split('kg')
#bd['malt_no'] = bd['grist'].str.len() - 1
#bd['malt_1'] = bd['grist'].map(lambda x: x[0] if type(x) == list else None)
#bd['malt_1_weight'] = bd['malt_1'].str.extract('[a-z0\s](\d+\.*\d*)')
#bd['malt_1'] = bd['malt_1'].str.extract('(.*[a-z\s]0*)\d+\.*\d*')
#bd['malt_2'] = bd['grist'].map(lambda x: x[1] if type(x) == list and len(x) > 2 else None)
#bd['malt_2_weight'] = bd['malt_2'].str.extract('lb.*(\d\.\d+)') # '\d*\.*\d*lb.*(\d+.\d+)'
#bd['malt_2'] = bd['malt_2'].str.extract('lb(.*)\d\.\d+') # '\d\.\d+lb(.*)\d\.\d+'
#bd['malt_3'] = bd['grist'].map(lambda x: x[2] if type(x) == list and len(x) > 3 else None)
#bd['malt_3_weight'] = bd['malt_3'].str.extract('lb.*(\d\.\d+)')
#bd['malt_3'] = bd['malt_3'].str.extract('lb(.*)\d\.\d+')
# hops
# adjuncts (eg. honey/coffee sometimes classified as hop addition, twist
#bd['hops'] = bd['pdf'].str.extract('Attribute(.*(?:Bitter|Flavour|Aroma|Alpha|Twist))') # Only picking up bitter
#bd['hops_parse'] = bd['hops'].str.findall('(?:Start|Middle|End|end|Whirlpool|FV|FWH|HD\d|WHD)|'
# '(?:DryHop|Dry Hop)|'
# '(?:[A-Z][a-z]*[\.\/\s]+){0,3}[A-Z][a-z]*|'
# '\d{1,3}(?:\.\d*){0,1}'
# )
# load manual updates
# ?! hop sections contain adjuncts, should these be removed..
bd = pd.read_excel('data/brewdog_data_manual.xlsx')
bd['hops_parse'] = bd['hops_parse'].fillna('[]').map(strlist_to_list, na_action=None)
bd['hop_no'] = bd['hops_parse'].str.len().floordiv(4)
bd['hop_total_weight'] = bd['hops_parse'].map(lambda x: sum([str_to_digit(i) for i in x]) if type(x) == list else None)
bd['hop1'] = bd['hops_parse'].map(lambda x: x[0] if type(x) == list and len(x) > 1 else None)
bd['hop1_weight'] = bd['hops_parse'].map(lambda x: x[1] if type(x) == list and len(x) > 1 else None)
bd['hop1_timing'] = bd['hops_parse'].map(lambda x: x[2] if type(x) == list and len(x) > 1 else None)
bd['hop1_type'] = bd['hops_parse'].map(lambda x: x[3] if type(x) == list and len(x) > 1 else None)
bd['hop2'] = bd['hops_parse'].map(lambda x: x[4] if type(x) == list and len(x) > 5 else None)
bd['hop2_weight'] = bd['hops_parse'].map(lambda x: x[5] if type(x) == list and len(x) > 5 else None)
bd['hop2_timing'] = bd['hops_parse'].map(lambda x: x[6] if type(x) == list and len(x) > 5 else None)
bd['hop2_type'] = bd['hops_parse'].map(lambda x: x[7] if type(x) == list and len(x) > 5 else None)
bd['hop3'] = bd['hops_parse'].map(lambda x: x[8] if type(x) == list and len(x) > 9 else None)
bd['hop3_weight'] = bd['hops_parse'].map(lambda x: x[9] if type(x) == list and len(x) > 9 else None)
bd['hop3_timing'] = bd['hops_parse'].map(lambda x: x[10] if type(x) == list and len(x) > 9 else None)
bd['hop3_type'] = bd['hops_parse'].map(lambda x: x[11] if type(x) == list and len(x) > 9 else None)
# yeast
# some require manually specification, will look to improve on consistent representation in next part
bd['yeast'] = bd['pdf'].str.extract('YEAST(.*?)(?:Ž|BASICS|MALT|METHOD|HOPS|TWIST|\#\d+)')
bd['attenuation'] = bd['pdf'].str.extract('LEVEL([\.0-9]+)%')
bd.at[16,'yeast'] = 'S189'
bd.at[29,'yeast'] = 'Wyeast-Ardennes 3522'
bd.at[49,'yeast'] = 'Champagne'
bd.at[171,'yeast'] = 'Wyeast American Ale II Strain 1272'
bd.at[235,'yeast'] = 'Wyeast 2007 - Pilsen Lager'
bd.at[306,'yeast'] = 'Wyeast 1056 - American Ale'
bd.at[334,'yeast'] = 'Wyeast 1272'
bd.at[337,'yeast'] = 'Wyeast 1272'
bd.at[339,'yeast'] = 'Wyeast 1272'
bd.at[341,'yeast'] = 'Wyeast 1272'
bd.at[343,'yeast'] = 'Wyeast 1272'
bd.at[345,'yeast'] = 'Wyeast 1272'
bd.at[346,'yeast'] = 'Wyeast 1272'
bd.at[348,'yeast'] = 'Wyeast 1272'
# drop unstructured metadata
bd.drop(columns=['pdf','grist','hops','hops_parse'], inplace=True)
Now we have out dataset - the next challenge is linking the beer names to review aggregation sites. I found that no single site contained all of the beers, so I decided to use two to try and expand the coverage of our data - namely Beeradvocate and RateBeer. So let's get to some webscraping.. then find the most likely candidate name using fuzzy logic.
import requests
from bs4 import BeautifulSoup
from fuzzywuzzy import process
# beer advocate
url = 'https://www.beeradvocate.com/beer/profile/16315/?view=beers&show=arc'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
request = requests.get(url, headers=headers)
soup = BeautifulSoup(request.content.decode(), 'html.parser')
rows = soup.findAll('table')[2].find_all('td', class_='hr_bottom_light')
rows = '@ '.join([r.get_text(strip=True) for r in rows])
rows = [r.split('@ ') for r in rows.split('@ -@ ')]
ratings_ba = pd.DataFrame(data=rows, columns=['name','style','abv','rating_no','rating_avg','']).drop(columns=[''])
bd_names = bd['name'].tolist()
ratings_ba['name_link'] = None
ratings_ba['name_link'] = ratings_ba['name'].map(lambda x: process.extract(x, bd_names, limit=1)[0], na_action=None)
ratings_ba['name_link'] = ratings_ba['name_link'].map(lambda x: x[0] if x[1] >= 90 else None, na_action=None)
# ratebeer
url = 'https://www.ratebeer.com/Ratings/Beer/ShowBrewerBeers.asp?BrewerID=8534'
request = requests.get(url, headers=headers)
soup = BeautifulSoup(request.content.decode(), 'html.parser')
rows = soup.findAll('table')[0].find_all('tr')
rows = [[e.get_text() for e in r.find_all('td')] for r in rows[1:]]
ratings_rb = pd.DataFrame(data=rows, columns=['name','abv','date_added','rate','rating_avg','style_pct','rating_no',''])
ratings_rb = ratings_rb.drop(columns=['date_added','rate','']).replace({'':None}).dropna()
ratings_rb['name'] = ratings_rb['name'].replace({'BrewDog':'', 'ISA':'IPA'}, regex=True)
ratings_rb['name'] = ratings_rb['name'].str.split('( - | // )').apply(lambda x: ' '.join(x[0:2]))
ratings_rb['name_link'] = None
ratings_rb['name_link'] = ratings_rb['name'].map(lambda x: process.extract(x, bd_names, limit=1)[0], na_action=None)
ratings_rb['name_link'] = ratings_rb['name_link'].map(lambda x: x[0] if x[1] >= 90 else None, na_action=None)
# summary stats
print(f"Beeradvocate: {ratings_ba['name_link'].count()}/{len(ratings_ba)} found.")
print(f"RateBeer: {ratings_rb['name_link'].count()}/{len(ratings_rb)} found.")
Finally, we combine the data together, and start to see if we can spot any trends.
# join the datasets together
ratings_ba.rename(columns={'rating_avg':'rating_ba'}, inplace=True)
ratings_rb.rename(columns={'rating_avg':'rating_rb'}, inplace=True)
bd = pd.merge(bd, ratings_ba[['name_link','rating_ba']], left_on='name', right_on='name_link', how='left')
bd = bd.drop(columns=['name_link']).drop_duplicates(subset=['id'])
bd = pd.merge(bd, ratings_rb[['name_link','rating_rb']], left_on='name', right_on='name_link', how='left')
bd = bd.drop(columns=['name_link']).drop_duplicates(subset=['id'])
import matplotlib.pyplot as plt
import seaborn as sns
# set datatype of ratings to be numeric and create average to extend coverage
bd['rating_ba'] = pd.to_numeric(bd['rating_ba']).replace({0:None})
bd['rating_rb'] = pd.to_numeric(bd['rating_rb']).replace({0:None})
bd['rating_avg'] = bd[['rating_ba','rating_rb']].mean(axis=1)
fig, axs = plt.subplots(nrows=4, figsize=(6,12))
sns.scatterplot(x='rating_ba', y='rating_rb', data=bd, ax=axs[0])
sns.scatterplot(x='avb', y='rating_avg', data=bd, ax=axs[1])
sns.scatterplot(x='ebc', y='rating_avg', data=bd, ax=axs[2])
sns.histplot(data=bd, x='hop_no', ax=axs[3])
I was pleasantly suprised to see a positive correlation between the rankings from Beeradvocate and RateBeer. I had always suspected that preference was more subjective, which makes me wonder whether the trend is really reflecting more universal trend towards overall likeability or quality. The higher ratings for abv (alcohol by volume) and ebc (higher = darker colour) were less of a suprise, and it was refreshing to see that more simple hop bills also scored highly, definitely bucking more recent trends.
There's some way to go in cleaning up this data set for use in machine learning. In the next part we'll be standardisation the way categorical variables are represented, transforming the data so better represent the distribution of these variables, and then looking again at better ways of integrating the recipe and rating datasets.