Building Superligaen Analytics

What’s Next — The Road Ahead

2026-04-20T00:00:00+00:00

This project started as a personal challenge: build a real end-to-end data engineering system using only free tools, on a dataset I actually care about. It shipped. It runs nightly. It has real users.

But there is a lot more to build.

Here is what is on the roadmap.

dbt Semantic Layer

Right now the gold layer exposes raw dimensional tables and fact tables. The dashboard queries them directly with hand-written SQL. This works, but it means business logic lives in two places — the transformation layer and the dashboard queries.

The dbt Semantic Layer would centralise all metric definitions in one place. total_goals, win_rate, xg_overperformance — defined once in dbt, queryable everywhere. The dashboard would consume metrics rather than writing joins. No more drift between how a metric is calculated in one page versus another.

Data Quality Tests

The pipeline runs nightly and the dashboard is public. If bad data makes it through, real users see wrong numbers — and there is currently no automated check stopping that from happening.

dbt has a built-in testing framework that fits naturally into the existing setup. Tests live alongside the models and run as part of the same pipeline. The basics are straightforward: uniqueness and not-null constraints on keys, accepted value checks on categorical columns, referential integrity between the fact table and every dimension. These catch the obvious failures — a venue ID that resolves to nothing, a match result outside the expected set, a duplicate surrogate key.

Beyond the built-in tests, the dbt-expectations package brings a richer set of statistical checks: row count thresholds, value range assertions, column distribution checks. These are useful for catching subtler issues — a round where suspiciously few goals were recorded, a team with negative possession, a season where no matches were flagged as complete.

The goal is for every nightly run to either produce correct data or fail loudly. Silent corruption is the worst outcome in a pipeline like this.

Player Analytics

The bronze layer already ingests player-level data — appearances, goals, assists, shots, passes, cards, ratings — for every fixture. None of it surfaces in the dashboard yet.

The plan is to build a full player analytics layer on top of what is already there: top scorers, top assisters, player form over time, contribution per 90 minutes. A player profile page in the dashboard. Head-to-head comparisons.

The data is sitting in the warehouse. It just needs to be modelled and served.

Beyond the Top Flight

Right now the pipeline only ingests Superligaen — the Danish top division. But the same API covers the full Danish football pyramid: the 1st Division (second tier), the 2nd Division, and the DBU Pokalen cup competition.

The plan is to extend ingestion to cover all of these, model them through the same bronze → silver → gold pipeline, and build dedicated dashboard pages for each competition. Teams moving up and down between divisions, cup upsets, cross-division comparisons — all of it becomes possible once the data is flowing.

Discussions Page

This is the most experimental idea on the list.

The concept: a page in the dashboard where the data is not just displayed but discussed. Different analytical personalities — a statistician who trusts only the numbers, a football traditionalist who distrusts xG, a fan who reads into every result — analyse the same data and reach different conclusions.

The personas would be generated by a language model, grounded in the actual data from the warehouse, and updated each matchday. It would make the dashboard less of a static report and more of a living conversation about the season.

Whether this is useful or just a novelty is an open question. But it is worth finding out.

Advanced BI Techniques

The current dashboard tells you what happened. The next step is to tell you what it means — and to do that, the visualisations need to work harder.

Right now most charts are single-metric bar charts or line charts. They are readable, but they leave a lot of the data on the table. The plan is to move toward techniques that surface relationships and context that are invisible in a single-axis view.

Scatter plots comparing attacking output versus defensive solidity across teams. Radar charts that give a full performance fingerprint for a team or player in a single glance. Rolling averages that separate a genuine form run from a single good result. These are standard tools in professional football analytics — and the data to drive all of them is already in the warehouse.

On the benchmarking side, the dashboard currently shows a team’s numbers in isolation. A win rate of 60% means something very different depending on whether the league average is 40% or 55%. The plan is to add contextual benchmarks throughout: league averages as reference lines on charts, percentile rankings alongside raw values, and head-to-head comparisons that anchor a team’s performance relative to its peers.

The goal is a dashboard where a casual fan understands the story at a glance, and an analyst can find genuine signal without exporting to a spreadsheet.

Closing Thought

The original goal was to learn by building something real. That goal was met. But the more interesting discovery is that a project like this does not have a natural end — it just has the next thing to build.

The data keeps arriving. The season keeps moving. The dashboard keeps growing.

Global Launch — A Conclusion

2026-04-19T00:00:00+00:00

By April 2026 — roughly ten days after the first real commit — the pipeline was stable, the dashboard had seven pages, and the nightly job was running cleanly. It was time to call it launched.

The live dashboard is at superligaanalytics.vercel.app.

What Shipped

The final state of the project at launch:

Bronze layer — 21 endpoints ingested nightly from api-football.com into MotherDuck
Silver layer — 18 dbt models that flatten and type-cast the raw JSON
Gold layer — Kimball star schema: 10 dimension tables and fct_match_results
Dashboard — 7 Evidence.dev pages: Home, Standings, Match Results, Upcoming Fixtures, League Analytics, Team Analytics, Referee Analytics
Orchestration — GitHub Actions nightly pipeline: bronze → silver → gold → Vercel deploy
CI — dbt compile on every pull request to main
Dev/prod separation — superligaen_dev and superligaen databases, dev and prod dbt targets

Reflections

This project went from initial commit to live dashboard in approximately ten days of active development. That is fast enough that almost every architectural choice was made under time pressure, with incomplete information, and revised at least once.

The tools that delivered exactly what they promised: MotherDuck, DuckDB, dbt, GitHub Actions. No surprises, no unexplained failures.

The tools that required more navigation: Netlify (build limits), Cloudflare Pages (file size limits), Evidence.dev (underdocumented behaviour around layouts and template syntax).

The choices I would make the same way: MotherDuck as the warehouse, dbt for transformations, Evidence.dev for the dashboard, Vercel for hosting, the Kimball star schema, the dev/prod database separation.

The choices I would make differently: evaluate hosting platforms against build frequency before committing; add dbt tests from the beginning rather than deferring them; set up dbt documentation from the start.

The ambitions that did not make it in: multi-league support (blocked by API quota), dbt tests, dbt semantic layer, dbt documentation, real-time match events (requires a paid API tier).

The project is now in maintenance mode. The nightly pipeline runs, the data updates, and the dashboard reflects last night’s results every morning. For a project built entirely on free tiers in ten days, that is a good place to be.

Adding Web Analytics — Vercel and Cloudflare

2026-04-19T00:00:00+00:00

Once the dashboard was live, the natural question was: is anyone visiting it? We needed analytics.

The options were straightforward: Vercel Analytics (built into the hosting platform), Cloudflare Web Analytics (a separate free service), or Google Analytics (the industry default but heavier and requiring a cookie consent banner under GDPR).

Google Analytics was ruled out immediately. GDPR cookie banners are user-hostile and unnecessary for a project where we genuinely do not need detailed personal data — we just want page view counts and visitor numbers.

Both Vercel Analytics and Cloudflare Web Analytics are cookieless and privacy-first. They count visits using aggregated signals rather than tracking individuals. No consent banner required.

We decided to use both — not because we needed redundancy, but because each gives you a slightly different view of traffic data, and running both costs nothing.

The First Attempt (And How It Broke Everything)

Adding Vercel Analytics was not as simple as enabling a toggle in the Vercel dashboard. For SvelteKit apps (which Evidence.dev is built on), you need to install the @vercel/analytics npm package and call inject() somewhere in your app.

The natural place for a call that should run on every page is a layout component. Evidence.dev supports a pages/+layout.svelte file. I created one:

This broke the site completely. Every page lost its navigation, sidebar, theming, and layout chrome.

The reason: Evidence.dev has its own built-in +layout.svelte that imports its stylesheet, loads its default layout component (EvidenceDefaultLayout), and handles the app shell. When you create a pages/+layout.svelte, Evidence copies your file into its template directory, overwriting its own layout. My file only had — which rendered the page content but none of Evidence’s surrounding UI.

The fix required knowing what Evidence’s own layout looked like. Once that was clear, the correct version wraps Evidence’s layout rather than replacing it:

This correctly extends Evidence’s layout rather than replacing it.

Adding Cloudflare Web Analytics

Adding the Cloudflare beacon alongside Vercel Analytics required one more trick. The Cloudflare script tag looks like this in standard HTML:

The { and } characters in the data-cf-beacon attribute value are Svelte template delimiters. If you put this tag inside a block, Svelte’s compiler tries to parse {"token": "..."} as a template expression and fails with a parse error.

The workaround: inject the script using document.createElement inside the onMount callback, where Svelte’s template compiler does not process the string:

onMount(() => {
  inject(); // Vercel Analytics

  const script = document.createElement('script');
  script.defer = true;
  script.src = 'https://static.cloudflareinsights.com/beacon.min.js';
  script.dataset.cfBeacon = JSON.stringify({ token: 'your-token' });
  document.head.appendChild(script);
});

This is less elegant than a


erDiagram
    fct_match_results {
        int date_sk FK
        int time_sk FK
        int team_sk FK
        int opponent_team_sk FK
        int league_sk FK
        int stadium_sk FK
        int referee_sk FK
        int match_sk FK
        int team_side_sk FK
        int match_result_sk FK
        int points_earned
        int goals_scored
        int goals_conceded
        int goals_ht_scored
        int goals_ht_conceded
        int shots_on_goal
        int shots_off_goal
        int total_shots
        int blocked_shots
        int shots_insidebox
        int shots_outsidebox
        decimal ball_possession_pct
        int total_passes
        int passes_accurate
        int fouls
        int corner_kicks
        int offsides
        int yellow_cards
        int red_cards
        int goalkeeper_saves
        decimal expected_goals
    }
    dim_date {
        int date_sk PK
        date date
        int year
        varchar quarter
        int month
        varchar month_name
        int week_number
        int day_of_week
        varchar day_name
        varchar is_weekend
    }
    dim_time {
        int time_sk PK
        int hour
        varchar period_of_day
    }
    dim_team {
        int team_sk PK
        int team_id
        varchar team_name
        varchar team_code
        varchar team_country
        int team_founded_year
        varchar team_logo
    }
    dim_opponent_team {
        int team_sk PK
        int team_id
        varchar team_name
        varchar team_code
        varchar team_country
        int team_founded_year
        varchar team_logo
    }
    dim_match {
        int match_sk PK
        int match_id
        varchar season
        varchar match_round_name
        varchar match_round_type
        int match_round_number
        varchar match_status
        varchar match_name
        varchar match_short_name
        varchar match_result
        varchar kick_off_time
    }
    dim_league {
        int league_sk PK
        int league_id
        varchar league_name
        varchar league_type
        varchar league_logo
        varchar league_country
        varchar league_country_code
        varchar league_country_flag
    }
    dim_stadium {
        int stadium_sk PK
        int stadium_id
        varchar stadium_name
        varchar stadium_address
        varchar stadium_city
        varchar stadium_country
        int stadium_capacity
        varchar stadium_surface
    }
    dim_referee {
        int referee_sk PK
        varchar referee_name
    }
    dim_team_side {
        int team_side_sk PK
        varchar team_side
    }
    dim_match_result {
        int match_result_sk PK
        varchar match_result
    }

    dim_date           ||--|{ fct_match_results : "date_sk"
    dim_team           ||--|{ fct_match_results : "team_sk"
    dim_match          ||--|{ fct_match_results : "match_sk"
    dim_league         ||--|{ fct_match_results : "league_sk"
    dim_stadium        ||--|{ fct_match_results : "stadium_sk"
    fct_match_results  }|--|| dim_opponent_team  : "opponent_team_sk"
    fct_match_results  }|--|| dim_referee        : "referee_sk"
    fct_match_results  }|--|| dim_time           : "time_sk"
    fct_match_results  }|--|| dim_team_side      : "team_side_sk"
    fct_match_results  }|--|| dim_match_result   : "match_result_sk"



Next: building the dashboard on top of this model.



Building the Bronze Layer — Raw Ingestion
2026-04-11T00:00:00+00:00
The bronze layer has one job: pull data from the API and store it in the warehouse exactly as it arrived. No transformation, no business logic. If the API gives you a nested JSON blob, you store a nested JSON blob. The philosophy is that raw data is irreplaceable — once you transform it, you lose the original, and if your transformation logic turns out to be wrong you have nothing to go back to.

The First Version Was a Monolith

The first version of the ingestion code was a single Python script that did everything: built URLs, called the API, handled pagination, and wrote to MotherDuck. It worked, but by the time it covered three or four endpoints it was already hard to follow. The first significant refactor split it into focused modules:


  api.py — the HTTP client, rate limiting, retry/backoff
  db.py — the MotherDuck connection
  config.py — endpoint configuration, environment variables
  ingest_*.py — one file per logical group of endpoints


That structure stayed for the rest of the project.

The Rate Limiting Problem

api-football.com allows 10 requests per minute. The first version of the code just did time.sleep(6) between calls — six seconds per request keeps you just under the limit if every call takes zero time, which of course they do not. On fast calls the sleep is too short; on slow calls it is wasted time.

The proper solution is retry with exponential backoff: make the call, and if you get a 429, wait and try again. The wait doubles each retry with a small random jitter to avoid thundering herd problems. Here is the core of what we ended up with:

def api_get(url, params, retries=5):
    for attempt in range(retries):
        response = requests.get(url, headers=HEADERS, params=params)
        if response.status_code == 200:
            return response.json()
        if response.status_code == 429:
            wait = (2 ** attempt) + random.uniform(0, 1)
            time.sleep(wait)
    raise Exception(f"API call failed after {retries} retries: {url}")


This was more correct than sleep-based limiting and also faster on days when the API was responding quickly.

Idempotency: Delete-Before-Insert

One of the more important early decisions was how to handle re-runs. If the nightly pipeline fails halfway through and you re-run it, you do not want to double-insert yesterday’s data. The pattern we chose was delete-before-insert: before inserting any records for a given date window, delete everything for that date window first. If the insert then succeeds, you have exactly one copy of the data. If it fails, the next run will delete and re-insert cleanly.

For full loads the pattern is a full table truncate before reloading. For incremental runs it is a targeted delete by date range — typically a rolling window of recent days to catch any late-arriving corrections from the API.

Getting this right took several iterations. One early bug was that the teams endpoint returns a JSON array of team objects, and the initial code was inserting the whole array as a single row rather than unnesting it first. Another was that the venues endpoint needed a season parameter that was not being passed, so it silently returned empty results for several seasons.

21 Endpoints

The final bronze layer covers 21 endpoints from the api-football.com free tier:


  
    
      Group
      Endpoints
    
  
  
    
      League
      leagues, seasons, rounds, standings
    
    
      Match
      fixtures, fixture events, fixture statistics, fixture lineups, fixture player stats
    
    
      Team
      teams, venues
    
    
      Player
      players, top scorers, top assisters, top yellow cards, top red cards
    
    
      Prediction
      fixture predictions, fixture odds
    
    
      Other
      injuries, sidelined, trophies, coaches
    
  


Each endpoint has its own ingestion script because the API parameters, pagination behaviour, and response shapes vary significantly. Some endpoints require a league ID and season. Some require a fixture ID and can only be fetched one fixture at a time (which is why fixture statistics alone accounts for a large chunk of the daily API call budget). Some, like odds and predictions, return data that changes right up until kick-off, so they need to be re-fetched regularly.

Incremental vs Full Load

The ingestion runner supports two modes, controlled by a command-line flag:


  --full-load — truncate and reload everything from 2020 to the current season. Used for initial bootstrap and occasional corrective runs.
  Incremental (default) — fetch only the last N days of data (default 5). Used by the nightly GitHub Actions cron job.


The --lookback parameter controls how many days back the incremental run looks. Setting it to 5 rather than 1 gives a buffer for late-arriving data and ensures that matches played over the weekend are picked up reliably even if the cron runs only once per day.

The nightly schedule runs at 23:00 UTC, which is midnight Danish time. That is late enough to catch the result of any evening match from the same day.

A Note on the Season

One subtlety with football APIs: the “current season” is not always obvious. Superligaen runs on a split-season calendar — the 2025/26 season starts in mid-2025 and finishes in mid-2026. Whether you are in the “2025” season or the “2026” season depends on the calendar and the convention used by the API.

The first version of the code hardcoded CURRENT_SEASON = 2024, which was wrong. The second version tried to derive it from today’s date using a heuristic (if month >= 7, season = year; else season = year - 1), which was better but still not correct around season transitions. The final version queries the leagues endpoint directly to find which season is currently active. The API knows — you should ask it.

What Lands in Bronze

Every bronze table has a consistent shape: the raw API response JSON is stored in a response column of type JSON, alongside metadata columns like inserted_at, season, and sometimes fixture_id or league_id depending on the endpoint. No type casting, no column extraction — just the raw blob.

This means bronze is not useful for direct querying, but it is a perfect foundation for silver. If we ever decide that a silver transformation was wrong, we can drop the silver table and rerun the transformation against the unchanged bronze data. That is the point.

Next: turning the raw JSON blobs into structured, typed tables — and then into a Kimball dimensional model.


Choosing a Data Source
2026-04-10T00:00:00+00:00
Choosing the Data Source

The first thing I did was look for a football API. Two candidates came up immediately.

football-data.org was the first I tried. The documentation is clean, the free tier is usable, and the community around it is solid. I set up an account, read through the available endpoints, and then hit the first wall: the Danish Superligaen is not included in the free tier. The free tier covers the top five European leagues — Premier League, La Liga, Bundesliga, Serie A, and Ligue 1 — plus a handful of cup competitions. Superligaen requires a paid plan. That was the end of that.

api-football.com was the second option, and it did have Superligaen in the free tier. The free tier gives you access to all leagues but caps you at 100 API calls per day. That sounds like a lot until you start mapping out what you need to fetch.

Each match needs: fixture metadata, statistics, events (goals, cards), lineups, player stats, and predictions. For a season with roughly 200 matches across multiple years, a full historical load requires thousands of calls just for fixtures. Add standings, top scorers, venues, referees, injuries, and odds, and a full bootstrap run across all available seasons (2020–2025) adds up to something in the hundreds of thousands of calls — spread over many days with careful throttling.

The 100-call-per-day limit was also one of the reasons I couldn’t expand the project beyond Superligaen. I wanted to include the Danish Cup, the Danish second division, or even add a comparison league from another country — Turkish Süper Lig would have been interesting. The architecture we built is genuinely scalable: adding a new league is just a config change. But doing so would immediately exceed the daily quota. That is a ceiling I am still bumping against.

There is also a rate limit within each day: 10 requests per minute. Exceeding it returns a 429, and the API does not give you a Retry-After header — you just have to know the limit and respect it. Early versions of the ingestion code used a naive sleep(6) between calls, which worked but was fragile. We later replaced it with a retry-with-backoff strategy that is both more correct and more efficient.

Choosing the Data Warehouse: MotherDuck

Once I knew the data source, I needed a place to store it. The options I considered were:


  BigQuery — the obvious choice for cloud data warehousing. Free tier is generous. But the Python client, the IAM setup, the service account JSON files — it adds friction before you have written a single query.
  Snowflake — industry standard, but the free trial expires and then you are paying.
  DuckDB local — fast, zero setup, perfect for development. But the data only lives on your laptop, which rules out a public dashboard.
  MotherDuck — DuckDB in the cloud. The free tier gives you 10 GB of storage and a managed DuckDB instance accessible via a token. Zero infrastructure to manage. The Python client is just duckdb with a md: prefix on the connection string.


MotherDuck won immediately. The developer experience is exceptional: you connect with a token, you write standard DuckDB SQL, and your data persists in the cloud. There is a web UI, a CLI, and it integrates natively with Evidence.dev (which I will get to later). For a side project, it removes every infrastructure concern that would otherwise become a time sink.

The one thing MotherDuck does not tell you upfront is that the free plan runs on a Pulse tier compute node with a 953 MB memory cap. That limit would come back to bite us several weeks later in a way that was not obvious at all, but more on that in a later post.

Two Environments from the Start

One decision I made early that saved a lot of grief: set up two separate databases on MotherDuck — superligaen for production and superligaen_dev for development. Every pipeline run, every dbt model, every SQL change would first be tested against superligaen_dev before being pointed at prod. This is standard practice in professional data engineering but easy to skip on a side project. I am glad I did not skip it.

The GitHub Actions workflows have a target_db parameter so you can point any run at either database. The dbt profiles have explicit dev and prod targets. This separation meant I could break things in superligaen_dev freely — and I broke things constantly — without ever risking the production data.

What We Are Building

At this point the stack was: api-football.com as the source, MotherDuck as the warehouse, Python for ingestion, and some form of transformation and dashboard yet to be decided. The architecture I had in mind was a medallion architecture: three layers.


  Bronze — raw JSON from the API, stored as-is in MotherDuck. One table per API endpoint. No transformation, no validation, just a faithful copy of whatever the API returned.
  Silver — cleaned, typed, structured relational tables. Each bronze table gets flattened, nulls handled, types cast correctly.
  Gold — a Kimball dimensional model. A fact table and a set of dimension tables designed for analytical queries and dashboard consumption.


That design held throughout the project. The tools used to implement each layer changed significantly, but the three-layer architecture never did.

Next: building the bronze layer.


The Idea — Why I Built This
2026-04-09T00:00:00+00:00
Every project starts somewhere. This one started with two things that happened to collide at the right moment.

The Constraint

I wanted to build an end-to-end data engineering project. Not a tutorial, not a sandbox — something real, with actual data, running in production. The full stack: ingestion, transformation, storage, a live dashboard.

The constraint I gave myself was simple: free, open source tools only. I wasn’t going to spend money on a personal project just to prove I could build something. If the tools couldn’t hold their own without a credit card, I’d find different tools.

The Data Problem

The harder constraint was the data. I needed something real. Not synthetic, not historical-only, not something I’d lose interest in halfway through. Weather data felt generic. Titanic and movies felt like tutorial territory. I wanted data I’d actually want to look at after the build was done.

Then I remembered that I love football.

Football data from a major European league would mean: a live API with ongoing updates, a season structure that maps cleanly to a data model, and enough statistical depth to make analytics interesting. Fixtures, results, lineups, referee assignments, standings — the shape of the data is almost purpose-built for a star schema.

The Personal Angle

The league was the next question. I’d recently moved to Denmark. I follow the big European leagues like most football fans, but I realised I knew almost nothing about Danish football — the players, the clubs, the rivalries, how the season works.

That’s the moment the project clicked into focus. I wasn’t just going to build a data engineering showcase. I was going to build something I’d actually use: an analytics product for Superligaen, the Danish premier football league, aimed at people like me who want to understand it better.

That’s how this started. A love of data engineering, a constraint around cost, and a new country I wanted to understand through the sport I already loved.

The rest is what went wrong — and eventually right.

Group	Endpoints
League	leagues, seasons, rounds, standings
Match	fixtures, fixture events, fixture statistics, fixture lineups, fixture player stats
Team	teams, venues
Player	players, top scorers, top assisters, top yellow cards, top red cards
Prediction	fixture predictions, fixture odds
Other	injuries, sidelined, trophies, coaches