fabio pajalunga | web-scraping-google-colab

Google Colab

What you basically need is either a Google account to be able to access Google Colab or an installation of Jupiter Notebook. I will use Google Colab because I want to be able to work from multiple devices on my code - it is just way easier on a hosted platform.
Saving is automatically routed to your Google Drive - mapping the drive in your code is a no-brainer. One huge advantage to host simple projects on Colab / GDrive.

Let's directly dive in my use-case and I'll explain my small project. My initial goal was to get detailed information about football results from the Swiss Super League. The first major task was to find a good page to scrape data from. transfermarkt would have been my favorite pick but fbref is way easier to scrape from because the page structure is less complex.

The code should be pretty straight forward - I commented where comments are needed. First we need to import a few things: numpy for mathematical functions, beautifulsoup for parsing html data and of course pandas, for all the data analyzing parts. Google Drive drive mapping is also essential for saving purposes.

A special thanks to the dataquest youtube tutorial that helped a lot for finding easy ways for my problems!

Initial steps

# importing the needed libs
from google.colab import drive
from bs4 import BeautifulSoup
import pandas as pd, numpy as np, time, requests

# show max colums ...
pd.set_option('display.max_rows', 999)

# mount google drive for exports / imports
drive.mount('drive')

Scraping the data

After having checked the page, the plan was quite simple. Go to the most recent overview page and scrape the team urls and their detailed data. Than jump to previous seasons and redo the steps again until you have enough season data. The outer loop is needed for our team url scraping and the inner loop scrapes the detailed team stats. Should be fairly easy, isn't it?

# starting url to scrape from
data_url = "https://fbref.com/en/wettbewerbe/57/Swiss-Super-League-Statistiken"
# define seasons to scrape
years = list(range(2022, 2018, -1))
# create object container to append data to while looping through urls
all_matches = []

for year in years:
    data = requests.get(data_url)
    soup = BeautifulSoup(data.text)
    # stats.table is the table that we are looking for
    # if multiple entries exist, pick the first one - this is the one we're looking for
    standings_table = soup.select('table.stats_table')[0]

    # extract all team urls from the stats.table
    links = [l.get("href") for l in standings_table.find_all('a')]
    links = [l for l in links if '/squads/' in l]
    team_urls = [f"https://fbref.com{l}" for l in links]
    
    # prepare the data_url variable for the previous year for the next loop
    previous_season = soup.select("a.prev")[0].get("href")
    data_url = f"https://fbref.com{previous_season}"
    
    # loop through all the team urls individually
    for team_url in team_urls:
        # create the team name based on the team url - needs a little bit of work
        team_name = team_url.split("/")[-1].replace("-Stats", "").replace("-", " ")
        data = requests.get(team_url)
        # find the table "Scores & Fixtures" and parse data
        matches = pd.read_html(data.text, match="Scores & Fixtures")[0]
        soup = BeautifulSoup(data.text)
        # get all relevant "Shooting" infos from the shooting tab
        links = [l.get("href") for l in soup.find_all('a')]
        links = [l for l in links if l and 'all_comps/shooting/' in l]
        data = requests.get(f"https://fbref.com{links[0]}")
        shooting = pd.read_html(data.text, match="Shooting")[0]
        # drop the non needed index
        shooting.columns = shooting.columns.droplevel()

        # try to merge shooting data based on data with the data from the "Scores and Fixtures" table
        try:
            team_data = matches.merge(shooting[["Date", "Gls", "Sh", "SoT", "SoT%", "G/Sh", "G/SoT", "Dist", "PK", "PKatt"]], on="Date")
        # possible, that there is no shooting stats for a given game, than this exception is needed
        except ValueError:
            continue

        # we are only interested in swiss super league infos - no cup or international games needed
        # no clue why the league name changed in 2022 ... 
        team_data = team_data[team_data["Comp"].isin(["Super Lg", "Swiss Super League"])]
        # we need to add the year and the teams' name to the table
        team_data["season"] = year
        team_data["team"] = team_name
        # append infos to our dataframe
        all_matches.append(team_data)
        # wait 1 sec to scrape the next team - just to prevent too many connections at once
        time.sleep(1)

In the end, we just need to concat our dataframe and export it to our Google Drive as a csv (ok, I don't like upper case headings, that's the reason why lower casing everything before the actual export).

# concat our final list
results_df = pd.concat(all_matches)
# lower case all the headers
results_df.columns = [c.lower() for c in results_df.columns]
# export the data to your google drive folder as a csv file
results_df.to_csv("drive/My Drive/Colab Notebooks/matches.csv")

Enjoy the code! Use it as a basis for your project and improve it - let me know in the comments if it was useful.

Web scraping with Google Colab

Scraping data from the web with Google Colab and Pandas is amazingly easy, thanks to dataquest for the video and the tutorial.

Google Colab

Initial steps

Scraping the data