Google Colab
What you basically need is either a Google account to be able to access Google Colab or an installation of Jupiter Notebook. I will use Google Colab because I want to be able to work from multiple devices on my code - it is just way easier on a hosted platform. Saving is automatically routed to your Google Drive - mapping the drive in your code is a no-brainer. One huge advantage to host simple projects on Colab / GDrive.
Let's directly dive in my use-case and I'll explain my small project. My initial goal was to get detailed information about football results from the Swiss Super League. The first major task was to find a good page to scrape data from. transfermarkt would have been my favorite pick but fbref is way easier to scrape from because the page structure is less complex.
The code should be pretty straight forward - I commented where comments are needed. First we need to import a few things: numpy for mathematical functions, beautifulsoup for parsing html data and of course pandas, for all the data analyzing parts. Google Drive drive mapping is also essential for saving purposes.
A special thanks to the dataquest youtube tutorial that helped a lot for finding easy ways for my problems!
Initial steps
# importing the needed libs
from google.colab import drive
from bs4 import BeautifulSoup
import pandas as pd, numpy as np, time, requests
# show max colums ...
pd.set_option('display.max_rows', 999)
# mount google drive for exports / imports
drive.mount('drive')
Scraping the data
After having checked the page, the plan was quite simple. Go to the most recent overview page and scrape the team urls and their detailed data. Than jump to previous seasons and redo the steps again until you have enough season data. The outer loop is needed for our team url scraping and the inner loop scrapes the detailed team stats. Should be fairly easy, isn't it?
# starting url to scrape from
data_url = "https://fbref.com/en/wettbewerbe/57/Swiss-Super-League-Statistiken"
# define seasons to scrape
years = list(range(2022, 2018, -1))
# create object container to append data to while looping through urls
all_matches = []
for year in years:
data = requests.get(data_url)
soup = BeautifulSoup(data.text)
# stats.table is the table that we are looking for
# if multiple entries exist, pick the first one - this is the one we're looking for
standings_table = soup.select('table.stats_table')[0]
# extract all team urls from the stats.table
links = [l.get("href") for l in standings_table.find_all('a')]
links = [l for l in links if '/squads/' in l]
team_urls = [f"https://fbref.com{l}" for l in links]
# prepare the data_url variable for the previous year for the next loop
previous_season = soup.select("a.prev")[0].get("href")
data_url = f"https://fbref.com{previous_season}"
# loop through all the team urls individually
for team_url in team_urls:
# create the team name based on the team url - needs a little bit of work
team_name = team_url.split("/")[-1].replace("-Stats", "").replace("-", " ")
data = requests.get(team_url)
# find the table "Scores & Fixtures" and parse data
matches = pd.read_html(data.text, match="Scores & Fixtures")[0]
soup = BeautifulSoup(data.text)
# get all relevant "Shooting" infos from the shooting tab
links = [l.get("href") for l in soup.find_all('a')]
links = [l for l in links if l and 'all_comps/shooting/' in l]
data = requests.get(f"https://fbref.com{links[0]}")
shooting = pd.read_html(data.text, match="Shooting")[0]
# drop the non needed index
shooting.columns = shooting.columns.droplevel()
# try to merge shooting data based on data with the data from the "Scores and Fixtures" table
try:
team_data = matches.merge(shooting[["Date", "Gls", "Sh", "SoT", "SoT%", "G/Sh", "G/SoT", "Dist", "PK", "PKatt"]], on="Date")
# possible, that there is no shooting stats for a given game, than this exception is needed
except ValueError:
continue
# we are only interested in swiss super league infos - no cup or international games needed
# no clue why the league name changed in 2022 ...
team_data = team_data[team_data["Comp"].isin(["Super Lg", "Swiss Super League"])]
# we need to add the year and the teams' name to the table
team_data["season"] = year
team_data["team"] = team_name
# append infos to our dataframe
all_matches.append(team_data)
# wait 1 sec to scrape the next team - just to prevent too many connections at once
time.sleep(1)
In the end, we just need to concat our dataframe and export it to our Google Drive as a csv (ok, I don't like upper case headings, that's the reason why lower casing everything before the actual export).
# concat our final list
results_df = pd.concat(all_matches)
# lower case all the headers
results_df.columns = [c.lower() for c in results_df.columns]
# export the data to your google drive folder as a csv file
results_df.to_csv("drive/My Drive/Colab Notebooks/matches.csv")
Enjoy the code! Use it as a basis for your project and improve it - let me know in the comments if it was useful.