Is Bayern München the laziest team in the German Bundesliga?

Building an easy-to-use Bundesliga data analysis app with Python and Streamlit

Tim Denzler
9 min readMay 7, 2021
Veltins Arena in Gelsenkirchen by Dominik Kuhn on Unsplash

As the latest Bundesliga season is coming to an end and Bayern München is close to winning its 9th consecutive championship, I was wondering recently if football in Germany has changed at all in the last few years. I often spend my Saturdays watching Bundesliga matches and think to myself: “I could swear matches were more interesting last season”. More goals, more shots on the goal, more fouls. However, I am never able to find adequate data or analyses that may confirm my gut feeling. This seems to be quite a paradox, as we live in a time where every move by a football player on the field is tracked, remote AWS servers calculate obscure match stats in real-time (I am looking at you xGoals), and commentators seem to fill the void caused by the 100th pass to the goalie by rambling about how this is the third match in a row that Lewandowski wears blue boxer shorts.

Obviously, I am exaggerating. Nevertheless, popular sports websites oftentimes only provide single match facts without any possibility for comprehensive analysis. As someone passionate about all things football and data, I decided to tackle this problem by leveraging Python and Streamlit [1] to build an easy-to-use and intuitive Bundesliga Analysis (BuLiAn) web application.

In this article, I will try to provide a top-level overview of the application and touch upon the following aspects:

  1. Extracting the data
  2. Cleaning the data
  3. Building and deploying the Streamlit application
  4. Generating interesting Bundesliga insights

If you want to jump right in and explore Bundesliga data yourself, click right here: BuLiAn web application hosted by Streamlit

For more details on the actual implementation, make sure to take a look at the GitHub repository: BuLiAn GitHub repository

1. Extracting the data

The first step is to gather as many relevant match facts as possible: goals, shots, fouls, corners, offsides…you name it. All this information is available on the internet and can be extracted for analysis purposes. My primary source of information was the website of a popular German football website. Using the BeautifulSoup library [2] and Jupyter Notebooks, the website was scraped for data ranging from Season 2013/2014 to Season 2019/2020. This seems to be a reasonable range, as Season 2013/2014 is the first season with data on the distance covered by teams being published. Furthermore, Season 2019/2020 is the most recent completed season. In total, 2,142 matches were scraped and subsequently stored in CSV format.

2. Cleaning the data

As is the case with most data analysis projects, the next step is to examine the data and find potential irregularities. As the German Bundesliga has a relatively low number of 18 teams when compared to other European leagues, only 9 matches take place on each of the 34 matchdays. In total, each season should therefore consist of 306 matches. However, when grouping our matches by season, we see that Season 2013/2014 (307 matches) and 2019/2020 (308 matches) seem to have some irregularities in their data.

Upon closer investigation of Season 2013/2014, we see that VfB Stuttgart had two home matches against Bayern Munich according to our data. While I am sure VfB Stuttgart would have been more than happy to get a second chance that season, a quick Google search reveals: Bayern München won the Champions League in the previous season and as such was participating in the FIFA Club World Cup. As a consequence, their match against VfB Stuttgart had to be postponed. I guess that is what you call suffering from success! When looking at Season 2018/2019, we see that Werder Bremen and Borussia Mönch…en…glad…that German team had similar issues. Therefore, all duplicate instances are removed. Et voilà, the number of matches per season is correct.

After some more data cleaning (such as removing spaces in strings, and converting strings to numerical data types), we can further check the data quality. It seems like there is an issue with three matches from Season 13/14, as the distance covered during the match was not captured. This leads to three potential options:

  • Manually search and enter the correct values
  • Remove complete rows (matches)
  • Estimate the values based on previous data

The first strategy did not succeed, as the data was nowhere to be found on the internet (it seems the data for these three matches is forever lost in the void). Removing complete matches seems to be a radical approach, as it would lead to a lack of comparability when looking at absolute values. As such, the most elegant way to solve this issue was to calculate the average distance covered per match by each team and fill the empty cells accordingly.

A quick fun fact (or sad, depending on your team affiliation): While checking the data, it came to my attention that on Matchday 8 of Season 2014/2015 Werder Bremen had 0 recorded shots on Bayern München’s goal. However, I quickly realized that this was not an error in the data, but rather a really really bad day for Werder Bremen.

3. Building and deploying the Streamlit application

In order to enable football fans to interact and quickly perform their very own Bundesliga analysis, I decided to develop my application using the Streamlit library. Streamlit not only allows for easy development and sharing of web applications but also provides a variety of well-designed and customizable widgets. In addition, I used the pandas library’s data frames [3] to manipulate the data, and the seaborn library [4] to plot appealing visualizations. In order to improve code readability and minimize the lines of code required, I leveraged Python dictionaries, for example to link labels (selection options displayed by the widgets) with column names.

The filtering sidebar is essential, as it allows for selecting seasons, matchdays, and teams to be included in the analysis. Triggering new filter selections will modify the pandas data frame accordingly. If you want to see the data that is used for the analysis, you can either take a look at the quick fact section on top or unhide the complete data frame.

The match finder enables users to investigate interesting match facts. As can be seen in the screenshot below, three selection widgets can be modified to form a sentence and search a match with corresponding attributes. In addition, match data for both teams is provided.

The match finder allows for finding trivia based on sentences posed in natural language.

The analysis per team section allows for analyzing various aspects on a team level. For this purpose, the data is grouped by teams and aggregated based on five different measures: mean (or average) values, absolute values (may lack comparability due to some teams being relegated), median values (ideal for removing outliers), maximum values, and minimum values. In addition, a team-specific color scheme may be toggled.

BuLiAn allows for analyzing team-level data

The analysis per season section and analysis per matchday section are developed similarly. First, the data is grouped by season or matchday and then aggregated based on one of the five previously mentioned measures.

The season analysis section allows for observing trends and developments in recent years
The matchday analysis sectionallows for analyzing the course of one or multiple seasons

The correlation of game stats section enables users to investigate and visualize correlation between different match aspects. However, it is important to keep in mind: correlation does not imply causation! What the data does say is that teams that have more passes in a match than their opponents tend to also shoot more goals (hence winning). What it does not say is that an increase in passes will lead to a team shooting more goals (sorry Pep).

Dive deep into Bundesliga data with a correlation analysis

Finally, deploying and hosting an application with Streamlit is as easy as it gets. All you need to do is sign-up for Streamlit sharing and upload your application in a public GitHub repository. For more information check out the official Streamlit documentation on deployment.

4. Generating interesting Bundesliga insights

Now let’s get down to business. What exciting facts and entertaining trivia can actually be discovered with the application? Below you can find some of my personal highlights:

  • Lazy Bayern München: While arguably being the best team in recent years, the Bavarians are third last when it comes to distance covered during games with an average of 113.68 km. A possible explanation: Bayern far exceeds their competition when it comes to passing, with an average of 695.07 passes per game. Second-place Borussia Dortmund does not even come close (581.85 passes per game). Hence, Bayern München prefers to let the opposing team do the running.
  • Don’t mess with the “Dino”: You do not want to play against the Hamburger SV. Not because chances are you may lose, but rather because it will most likely hurt. With an average of 16.46 fouls per game, HSV is leading in this category. On the other hand, it seems that more successful teams (e.g., Bayern München and Borussia Dortmund) tend to foul less.
  • The “Christkind” is not the only one handing out presents: While Bayern München receives an average of 8.31 shots on their own goal during a game, 1.FC Nürnberg almost doubles this number with a staggering 16.43 shots on their own goal. No wonder each of their two seasons in the Bundesliga ended after just one year.
  • Bayer Leverkusen’s turbo halftime: In Season 2018/2019, Bayer 04 Leverkusen’s match against Eintracht Frankfurt and resulted in a 6:1 (6:1) win for Bayer 04 Leverkusen. During the match, Bayer 04 Leverkusen scored 6 halftime goals, which is the maximum value for any team in the available data. Frankfurt certainly needed an Aspirin after that first half.
  • Harbor brawl: In Season 2014/2015 Hamburger SV won 1:0 (1:0) against Bayer 04 Leverkusen. However, it seems football was of minor importance that day, as both teams committed a total of 54 fouls.
  • Could you please pass the ball to me at least once? In Season 2015/2016 Eintracht Frankfurt’s match against Borussia Dortmund seemed terribly one-sided. Over the course of the match, Eintracht Frankfurt recorded a mere 16% ball possession ratio. More surprisingly, the match resulted in a 1:0 (1:0) win for Eintracht Frankfurt!
  • More passing, fewer fouls: When looking at past seasons, a clear and steady trend towards a more passing dominated game can be observed, with an average of 436.43 passes per game in Season 2013/2014 and an average of 455.33 passes per game in Season 2019/2020. In addition, there seems to be a trend towards fewer and fewer fouls committed per season.
  • Performing under pressure: While the number of shots on goals remains steady, there is a significant increase in goals scored on the last matchday. Whether this has to do with the potentially increased relevance of goal differences or opposing defenders’ minds slowly moving towards summer vacation is up for discussion.
  • Fair teams that tend to win: It seems that teams with a higher pass success ratio tend to score more goals than opposing teams. In addition, passes, distance covered, and ball possession seems to be higher for successful teams. On the contrary, teams that lose tend to foul more frequently than their counterparts (not a big surprise there).

What are your thoughts on these insights? Make sure to try out the app and leave your personal favorite facts or insights in the comments below. If you have any suggestions for improvement or extensions of the app, feel free to reach out to me!

Web Application: https://share.streamlit.io/tdenzl/bulian/main/BuLiAn.py

GitHub Repository: https://github.com/tdenzl/BuLiAn

Libraries:

[1] Streamlit

[2] BeautifulSoup

[3] pandas

[4] seaborn

--

--

Tim Denzler

Data Engineer focusing on NLP, Research, Data Analytics, and the Semantic Web.