<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://saugki1773.github.io/data-engineering-blog/feed.xml" rel="self" type="application/atom+xml" /><link href="https://saugki1773.github.io/data-engineering-blog/" rel="alternate" type="text/html" /><updated>2026-04-26T16:50:44+00:00</updated><id>https://saugki1773.github.io/data-engineering-blog/feed.xml</id><title type="html">Building Superligaen Analytics</title><subtitle>A data engineer&apos;s diary — from raw API calls to a live football dashboard. Everything that went wrong, and how we fixed it.</subtitle><author><name>Salih Ugur Kımıllı</name></author><entry><title type="html">What’s Next — The Road Ahead</title><link href="https://saugki1773.github.io/data-engineering-blog/2026/04/20/whats-next.html" rel="alternate" type="text/html" title="What’s Next — The Road Ahead" /><published>2026-04-20T00:00:00+00:00</published><updated>2026-04-20T00:00:00+00:00</updated><id>https://saugki1773.github.io/data-engineering-blog/2026/04/20/whats-next</id><content type="html" xml:base="https://saugki1773.github.io/data-engineering-blog/2026/04/20/whats-next.html"><![CDATA[<p>This project started as a personal challenge: build a real end-to-end data engineering system using only free tools, on a dataset I actually care about. It shipped. It runs nightly. It has real users.</p>

<p>But there is a lot more to build.</p>

<p>Here is what is on the roadmap.</p>

<h2 id="dbt-semantic-layer">dbt Semantic Layer</h2>

<p>Right now the gold layer exposes raw dimensional tables and fact tables. The dashboard queries them directly with hand-written SQL. This works, but it means business logic lives in two places — the transformation layer and the dashboard queries.</p>

<p>The dbt Semantic Layer would centralise all metric definitions in one place. <code class="language-plaintext highlighter-rouge">total_goals</code>, <code class="language-plaintext highlighter-rouge">win_rate</code>, <code class="language-plaintext highlighter-rouge">xg_overperformance</code> — defined once in dbt, queryable everywhere. The dashboard would consume metrics rather than writing joins. No more drift between how a metric is calculated in one page versus another.</p>

<h2 id="data-quality-tests">Data Quality Tests</h2>

<p>The pipeline runs nightly and the dashboard is public. If bad data makes it through, real users see wrong numbers — and there is currently no automated check stopping that from happening.</p>

<p>dbt has a built-in testing framework that fits naturally into the existing setup. Tests live alongside the models and run as part of the same pipeline. The basics are straightforward: uniqueness and not-null constraints on keys, accepted value checks on categorical columns, referential integrity between the fact table and every dimension. These catch the obvious failures — a venue ID that resolves to nothing, a match result outside the expected set, a duplicate surrogate key.</p>

<p>Beyond the built-in tests, the dbt-expectations package brings a richer set of statistical checks: row count thresholds, value range assertions, column distribution checks. These are useful for catching subtler issues — a round where suspiciously few goals were recorded, a team with negative possession, a season where no matches were flagged as complete.</p>

<p>The goal is for every nightly run to either produce correct data or fail loudly. Silent corruption is the worst outcome in a pipeline like this.</p>

<h2 id="player-analytics">Player Analytics</h2>

<p>The bronze layer already ingests player-level data — appearances, goals, assists, shots, passes, cards, ratings — for every fixture. None of it surfaces in the dashboard yet.</p>

<p>The plan is to build a full player analytics layer on top of what is already there: top scorers, top assisters, player form over time, contribution per 90 minutes. A player profile page in the dashboard. Head-to-head comparisons.</p>

<p>The data is sitting in the warehouse. It just needs to be modelled and served.</p>

<h2 id="beyond-the-top-flight">Beyond the Top Flight</h2>

<p>Right now the pipeline only ingests Superligaen — the Danish top division. But the same API covers the full Danish football pyramid: the 1st Division (second tier), the 2nd Division, and the DBU Pokalen cup competition.</p>

<p>The plan is to extend ingestion to cover all of these, model them through the same bronze → silver → gold pipeline, and build dedicated dashboard pages for each competition. Teams moving up and down between divisions, cup upsets, cross-division comparisons — all of it becomes possible once the data is flowing.</p>

<h2 id="discussions-page">Discussions Page</h2>

<p>This is the most experimental idea on the list.</p>

<p>The concept: a page in the dashboard where the data is not just displayed but <em>discussed</em>. Different analytical personalities — a statistician who trusts only the numbers, a football traditionalist who distrusts xG, a fan who reads into every result — analyse the same data and reach different conclusions.</p>

<p>The personas would be generated by a language model, grounded in the actual data from the warehouse, and updated each matchday. It would make the dashboard less of a static report and more of a living conversation about the season.</p>

<p>Whether this is useful or just a novelty is an open question. But it is worth finding out.</p>

<h2 id="advanced-bi-techniques">Advanced BI Techniques</h2>

<p>The current dashboard tells you what happened. The next step is to tell you what it means — and to do that, the visualisations need to work harder.</p>

<p>Right now most charts are single-metric bar charts or line charts. They are readable, but they leave a lot of the data on the table. The plan is to move toward techniques that surface relationships and context that are invisible in a single-axis view.</p>

<p>Scatter plots comparing attacking output versus defensive solidity across teams. Radar charts that give a full performance fingerprint for a team or player in a single glance. Rolling averages that separate a genuine form run from a single good result. These are standard tools in professional football analytics — and the data to drive all of them is already in the warehouse.</p>

<p>On the benchmarking side, the dashboard currently shows a team’s numbers in isolation. A win rate of 60% means something very different depending on whether the league average is 40% or 55%. The plan is to add contextual benchmarks throughout: league averages as reference lines on charts, percentile rankings alongside raw values, and head-to-head comparisons that anchor a team’s performance relative to its peers.</p>

<p>The goal is a dashboard where a casual fan understands the story at a glance, and an analyst can find genuine signal without exporting to a spreadsheet.</p>

<h2 id="closing-thought">Closing Thought</h2>

<p>The original goal was to learn by building something real. That goal was met. But the more interesting discovery is that a project like this does not have a natural end — it just has the next thing to build.</p>

<p>The data keeps arriving. The season keeps moving. The dashboard keeps growing.</p>]]></content><author><name>Salih Ugur Kımıllı</name></author><category term="data-engineering" /><category term="roadmap" /><summary type="html"><![CDATA[This project started as a personal challenge: build a real end-to-end data engineering system using only free tools, on a dataset I actually care about. It shipped. It runs nightly. It has real users.]]></summary></entry><entry><title type="html">Global Launch — A Conclusion</title><link href="https://saugki1773.github.io/data-engineering-blog/2026/04/19/global-launch.html" rel="alternate" type="text/html" title="Global Launch — A Conclusion" /><published>2026-04-19T00:00:00+00:00</published><updated>2026-04-19T00:00:00+00:00</updated><id>https://saugki1773.github.io/data-engineering-blog/2026/04/19/global-launch</id><content type="html" xml:base="https://saugki1773.github.io/data-engineering-blog/2026/04/19/global-launch.html"><![CDATA[<p>By April 2026 — roughly ten days after the first real commit — the pipeline was stable, the dashboard had seven pages, and the nightly job was running cleanly. It was time to call it launched.</p>

<p>The live dashboard is at <a href="https://superligaanalytics.vercel.app/">superligaanalytics.vercel.app</a>.</p>

<h2 id="what-shipped">What Shipped</h2>

<p>The final state of the project at launch:</p>

<ul>
  <li><strong>Bronze layer</strong> — 21 endpoints ingested nightly from api-football.com into MotherDuck</li>
  <li><strong>Silver layer</strong> — 18 dbt models that flatten and type-cast the raw JSON</li>
  <li><strong>Gold layer</strong> — Kimball star schema: 10 dimension tables and <code class="language-plaintext highlighter-rouge">fct_match_results</code></li>
  <li><strong>Dashboard</strong> — 7 Evidence.dev pages: Home, Standings, Match Results, Upcoming Fixtures, League Analytics, Team Analytics, Referee Analytics</li>
  <li><strong>Orchestration</strong> — GitHub Actions nightly pipeline: bronze → silver → gold → Vercel deploy</li>
  <li><strong>CI</strong> — dbt compile on every pull request to main</li>
  <li><strong>Dev/prod separation</strong> — <code class="language-plaintext highlighter-rouge">superligaen_dev</code> and <code class="language-plaintext highlighter-rouge">superligaen</code> databases, <code class="language-plaintext highlighter-rouge">dev</code> and <code class="language-plaintext highlighter-rouge">prod</code> dbt targets</li>
</ul>

<h2 id="reflections">Reflections</h2>

<p>This project went from initial commit to live dashboard in approximately ten days of active development. That is fast enough that almost every architectural choice was made under time pressure, with incomplete information, and revised at least once.</p>

<p>The tools that delivered exactly what they promised: MotherDuck, DuckDB, dbt, GitHub Actions. No surprises, no unexplained failures.</p>

<p>The tools that required more navigation: Netlify (build limits), Cloudflare Pages (file size limits), Evidence.dev (underdocumented behaviour around layouts and template syntax).</p>

<p>The choices I would make the same way: MotherDuck as the warehouse, dbt for transformations, Evidence.dev for the dashboard, Vercel for hosting, the Kimball star schema, the dev/prod database separation.</p>

<p>The choices I would make differently: evaluate hosting platforms against build frequency before committing; add dbt tests from the beginning rather than deferring them; set up dbt documentation from the start.</p>

<p>The ambitions that did not make it in: multi-league support (blocked by API quota), dbt tests, dbt semantic layer, dbt documentation, real-time match events (requires a paid API tier).</p>

<p>The project is now in maintenance mode. The nightly pipeline runs, the data updates, and the dashboard reflects last night’s results every morning. For a project built entirely on free tiers in ten days, that is a good place to be.</p>]]></content><author><name>Salih Ugur Kımıllı</name></author><category term="data-engineering" /><category term="deployment" /><summary type="html"><![CDATA[By April 2026 — roughly ten days after the first real commit — the pipeline was stable, the dashboard had seven pages, and the nightly job was running cleanly. It was time to call it launched.]]></summary></entry><entry><title type="html">Adding Web Analytics — Vercel and Cloudflare</title><link href="https://saugki1773.github.io/data-engineering-blog/2026/04/19/launch-and-analytics.html" rel="alternate" type="text/html" title="Adding Web Analytics — Vercel and Cloudflare" /><published>2026-04-19T00:00:00+00:00</published><updated>2026-04-19T00:00:00+00:00</updated><id>https://saugki1773.github.io/data-engineering-blog/2026/04/19/launch-and-analytics</id><content type="html" xml:base="https://saugki1773.github.io/data-engineering-blog/2026/04/19/launch-and-analytics.html"><![CDATA[<p>Once the dashboard was live, the natural question was: is anyone visiting it? We needed analytics.</p>

<p>The options were straightforward: Vercel Analytics (built into the hosting platform), Cloudflare Web Analytics (a separate free service), or Google Analytics (the industry default but heavier and requiring a cookie consent banner under GDPR).</p>

<p>Google Analytics was ruled out immediately. GDPR cookie banners are user-hostile and unnecessary for a project where we genuinely do not need detailed personal data — we just want page view counts and visitor numbers.</p>

<p>Both Vercel Analytics and Cloudflare Web Analytics are <strong>cookieless and privacy-first</strong>. They count visits using aggregated signals rather than tracking individuals. No consent banner required.</p>

<p>We decided to use both — not because we needed redundancy, but because each gives you a slightly different view of traffic data, and running both costs nothing.</p>

<h2 id="the-first-attempt-and-how-it-broke-everything">The First Attempt (And How It Broke Everything)</h2>

<p>Adding Vercel Analytics was not as simple as enabling a toggle in the Vercel dashboard. For SvelteKit apps (which Evidence.dev is built on), you need to install the <code class="language-plaintext highlighter-rouge">@vercel/analytics</code> npm package and call <code class="language-plaintext highlighter-rouge">inject()</code> somewhere in your app.</p>

<p>The natural place for a call that should run on every page is a layout component. Evidence.dev supports a <code class="language-plaintext highlighter-rouge">pages/+layout.svelte</code> file. I created one:</p>

<pre><code class="language-svelte">&lt;script&gt;
  import { onMount } from 'svelte';
  import { inject } from '@vercel/analytics';
  onMount(() =&gt; inject());
&lt;/script&gt;

&lt;slot /&gt;
</code></pre>

<p>This broke the site completely. Every page lost its navigation, sidebar, theming, and layout chrome.</p>

<p>The reason: Evidence.dev has its own built-in <code class="language-plaintext highlighter-rouge">+layout.svelte</code> that imports its stylesheet, loads its default layout component (<code class="language-plaintext highlighter-rouge">EvidenceDefaultLayout</code>), and handles the app shell. When you create a <code class="language-plaintext highlighter-rouge">pages/+layout.svelte</code>, Evidence copies your file into its template directory, <strong>overwriting its own layout</strong>. My file only had <code class="language-plaintext highlighter-rouge">&lt;slot /&gt;</code> — which rendered the page content but none of Evidence’s surrounding UI.</p>

<p>The fix required knowing what Evidence’s own layout looked like. Once that was clear, the correct version wraps Evidence’s layout rather than replacing it:</p>

<pre><code class="language-svelte">&lt;script&gt;
  import '@evidence-dev/tailwind/fonts.css';
  import '../app.css';
  import { EvidenceDefaultLayout } from '@evidence-dev/core-components';
  import { onMount } from 'svelte';
  import { inject } from '@vercel/analytics';

  export let data;

  onMount(() =&gt; inject());
&lt;/script&gt;

&lt;EvidenceDefaultLayout {data}&gt;
  &lt;slot slot="content" /&gt;
&lt;/EvidenceDefaultLayout&gt;
</code></pre>

<p>This correctly extends Evidence’s layout rather than replacing it.</p>

<h2 id="adding-cloudflare-web-analytics">Adding Cloudflare Web Analytics</h2>

<p>Adding the Cloudflare beacon alongside Vercel Analytics required one more trick. The Cloudflare script tag looks like this in standard HTML:</p>

<div class="language-html highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nt">&lt;script </span><span class="na">defer</span> <span class="na">src=</span><span class="s">"https://static.cloudflareinsights.com/beacon.min.js"</span>
  <span class="na">data-cf-beacon=</span><span class="s">'{"token": "your-token"}'</span><span class="nt">&gt;&lt;/script&gt;</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">{</code> and <code class="language-plaintext highlighter-rouge">}</code> characters in the <code class="language-plaintext highlighter-rouge">data-cf-beacon</code> attribute value are Svelte template delimiters. If you put this tag inside a <code class="language-plaintext highlighter-rouge">&lt;svelte:head&gt;</code> block, Svelte’s compiler tries to parse <code class="language-plaintext highlighter-rouge">{"token": "..."}</code> as a template expression and fails with a parse error.</p>

<p>The workaround: inject the script using <code class="language-plaintext highlighter-rouge">document.createElement</code> inside the <code class="language-plaintext highlighter-rouge">onMount</code> callback, where Svelte’s template compiler does not process the string:</p>

<div class="language-javascript highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nx">onMount</span><span class="p">(()</span> <span class="o">=&gt;</span> <span class="p">{</span>
  <span class="nx">inject</span><span class="p">();</span> <span class="c1">// Vercel Analytics</span>

  <span class="kd">const</span> <span class="nx">script</span> <span class="o">=</span> <span class="nb">document</span><span class="p">.</span><span class="nx">createElement</span><span class="p">(</span><span class="dl">'</span><span class="s1">script</span><span class="dl">'</span><span class="p">);</span>
  <span class="nx">script</span><span class="p">.</span><span class="nx">defer</span> <span class="o">=</span> <span class="kc">true</span><span class="p">;</span>
  <span class="nx">script</span><span class="p">.</span><span class="nx">src</span> <span class="o">=</span> <span class="dl">'</span><span class="s1">https://static.cloudflareinsights.com/beacon.min.js</span><span class="dl">'</span><span class="p">;</span>
  <span class="nx">script</span><span class="p">.</span><span class="nx">dataset</span><span class="p">.</span><span class="nx">cfBeacon</span> <span class="o">=</span> <span class="nx">JSON</span><span class="p">.</span><span class="nx">stringify</span><span class="p">({</span> <span class="na">token</span><span class="p">:</span> <span class="dl">'</span><span class="s1">your-token</span><span class="dl">'</span> <span class="p">});</span>
  <span class="nb">document</span><span class="p">.</span><span class="nx">head</span><span class="p">.</span><span class="nx">appendChild</span><span class="p">(</span><span class="nx">script</span><span class="p">);</span>
<span class="p">});</span>
</code></pre></div></div>

<p>This is less elegant than a <code class="language-plaintext highlighter-rouge">&lt;script&gt;</code> tag in the HTML but works correctly and is straightforward to understand.</p>]]></content><author><name>Salih Ugur Kımıllı</name></author><category term="data-engineering" /><category term="analytics" /><category term="deployment" /><summary type="html"><![CDATA[Once the dashboard was live, the natural question was: is anyone visiting it? We needed analytics.]]></summary></entry><entry><title type="html">Migrating to dbt — When Raw SQL Isn’t Enough</title><link href="https://saugki1773.github.io/data-engineering-blog/2026/04/18/dbt-migration.html" rel="alternate" type="text/html" title="Migrating to dbt — When Raw SQL Isn’t Enough" /><published>2026-04-18T00:00:00+00:00</published><updated>2026-04-18T00:00:00+00:00</updated><id>https://saugki1773.github.io/data-engineering-blog/2026/04/18/dbt-migration</id><content type="html" xml:base="https://saugki1773.github.io/data-engineering-blog/2026/04/18/dbt-migration.html"><![CDATA[<p>When the silver and gold layers were first built, they ran as plain SQL files executed by Python runner scripts — <code class="language-plaintext highlighter-rouge">run_silver.py</code> and <code class="language-plaintext highlighter-rouge">run_gold.py</code>. Each script would read a directory of <code class="language-plaintext highlighter-rouge">.sql</code> files, connect to MotherDuck, and execute them in a specific order. It worked. The data was correct. But as the number of models grew and the logic became more complex, the cracks in the approach started to show.</p>

<h2 id="the-problem-with-plain-sql-runners">The Problem with Plain SQL Runners</h2>

<p><strong>Order dependency was manual.</strong> If <code class="language-plaintext highlighter-rouge">dim_team</code> needed to run before <code class="language-plaintext highlighter-rouge">fct_match_results</code>, you had to remember that and name or number the files accordingly. When we added a new model, figuring out where it slotted into the execution order was entirely up to the developer.</p>

<p><strong>No incremental logic.</strong> Every run was a full rebuild. For silver models that flatten tens of thousands of fixture records, this was slow and unnecessary. There was no way to say “only process records that arrived since the last run” without writing custom Python logic around each SQL file.</p>

<p><strong>No compilation validation.</strong> SQL syntax errors only appeared at runtime. There was no way to check whether a model was valid without actually running it against the database.</p>

<p><strong>No lineage.</strong> There was no documentation of which model depended on what. Understanding the pipeline required reading the code, not querying a manifest.</p>

<p><strong>Parameterisation was awkward.</strong> The nightly pipeline uses a rolling date window (last 5 days). The full-refresh pipeline uses a full season reload. Passing different variables to the same SQL file required string interpolation in Python, which is fragile and hard to read.</p>

<p>All of these problems have well-known solutions in the data engineering world. They are solved by <strong>dbt</strong>.</p>

<h2 id="what-dbt-brings">What dbt Brings</h2>

<p>dbt (data build tool) is a transformation framework that sits on top of your data warehouse and manages SQL models. You write SQL <code class="language-plaintext highlighter-rouge">SELECT</code> statements, and dbt handles the <code class="language-plaintext highlighter-rouge">CREATE TABLE AS</code>, dependency ordering, incremental logic, and documentation.</p>

<p>The key features we needed:</p>

<p><strong>Dependency resolution</strong> — dbt builds a DAG (directed acyclic graph) of your models by analysing which model references which. You write <code class="language-plaintext highlighter-rouge">{{ ref('silver_fixtures') }}</code> instead of a table name, and dbt knows to run <code class="language-plaintext highlighter-rouge">silver_fixtures</code> before whatever model references it.</p>

<p><strong>Incremental models</strong> — dbt’s <code class="language-plaintext highlighter-rouge">incremental</code> materialisation allows a model to process only new or updated records on each run, using a configurable filter. For silver models that process fixture data, this means a nightly run that touches only the last 5 days of records rather than reprocessing all 200+ fixtures from every season.</p>

<p><strong>Macros</strong> — dbt allows you to write Jinja macros for reusable SQL logic. We created three: <code class="language-plaintext highlighter-rouge">fixture_filter()</code> for filtering by date window, <code class="language-plaintext highlighter-rouge">season_filter()</code> for full-season reloads, and <code class="language-plaintext highlighter-rouge">gold_incremental_filter()</code> for the gold layer’s incremental logic. These macros mean the filter logic lives in one place and is consistent across all models.</p>

<p><strong>CI validation</strong> — dbt’s <code class="language-plaintext highlighter-rouge">compile</code> command parses all SQL and resolves all references without executing anything. Adding <code class="language-plaintext highlighter-rouge">dbt compile --target dev</code> to the CI workflow (run on every pull request) means SQL syntax errors and broken references are caught before they ever reach main.</p>

<p><strong>Schema management</strong> — dbt’s <code class="language-plaintext highlighter-rouge">generate_schema_name</code> macro controls the schema prefix applied to model outputs. We use this to ensure models land in exactly the right schema (<code class="language-plaintext highlighter-rouge">silver</code> or <code class="language-plaintext highlighter-rouge">gold</code>) regardless of the dbt target, without Cloudflare or Vercel prefixes being appended.</p>

<h2 id="the-migration">The Migration</h2>

<p>The migration itself took one day. The process:</p>

<ol>
  <li>Create the <code class="language-plaintext highlighter-rouge">dbt/</code> directory with <code class="language-plaintext highlighter-rouge">dbt_project.yml</code> and <code class="language-plaintext highlighter-rouge">profiles.yml</code>.</li>
  <li>Port each SQL file to a dbt model by wrapping it in the appropriate dbt configuration block.</li>
  <li>Replace hardcoded table names with <code class="language-plaintext highlighter-rouge">{{ ref() }}</code> calls where models depend on each other.</li>
  <li>Extract the date window and season filters into macros.</li>
  <li>Update the GitHub Actions workflows to run <code class="language-plaintext highlighter-rouge">dbt run --select silver.*</code> and <code class="language-plaintext highlighter-rouge">dbt run --select gold.*</code> instead of <code class="language-plaintext highlighter-rouge">python run_silver.py</code>.</li>
  <li>Delete the old Python runner scripts and raw SQL files.</li>
</ol>

<p>Step 6 was satisfying. Deleting code that is no longer needed is one of the better feelings in software development.</p>

<p>There was one technical issue during the migration: DuckDB’s dialect handles some SQL constructs differently from other databases, and certain patterns that work in standard SQL do not work in DuckDB’s dbt adapter. The main culprit was date arithmetic — DuckDB uses <code class="language-plaintext highlighter-rouge">INTERVAL</code> syntax and <code class="language-plaintext highlighter-rouge">epoch()</code> for timestamp operations, which dbt’s generic date macros do not always handle correctly. We needed to write DuckDB-specific macro implementations rather than relying on dbt’s cross-database abstractions.</p>

<p>The dbt-duckdb adapter version also needed to match the DuckDB version that MotherDuck was running (1.5.1 at the time). Version mismatches between the adapter, the dbt-core version, and the DuckDB version produced cryptic errors that took a while to resolve. Once the versions were pinned and aligned, everything worked cleanly.</p>

<h2 id="two-targets-dev-and-prod">Two Targets: Dev and Prod</h2>

<p>The dbt profiles configure two targets:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">dev</code> — points to <code class="language-plaintext highlighter-rouge">superligaen_dev</code> on MotherDuck. Used for local development and for CI.</li>
  <li><code class="language-plaintext highlighter-rouge">prod</code> — points to <code class="language-plaintext highlighter-rouge">superligaen</code>. Used by the nightly GitHub Actions pipeline.</li>
</ul>

<p>The MotherDuck token is passed via the <code class="language-plaintext highlighter-rouge">MOTHERDUCK_TOKEN</code> environment variable rather than being stored in <code class="language-plaintext highlighter-rouge">profiles.yml</code>. This keeps credentials out of version control and makes rotating the token a matter of updating a GitHub Secret rather than committing a change.</p>

<h2 id="what-we-did-not-get-to">What We Did Not Get To</h2>

<p>dbt has a testing framework (<code class="language-plaintext highlighter-rouge">dbt test</code>) that lets you define assertions about your data — things like “this column has no null values”, “this foreign key always joins successfully”, “this value is always positive”. We did not implement any tests during this project. The right time to add them would be now that the models are stable and the architecture is not changing frequently. Tests would catch data quality regressions from the API before they reach the dashboard.</p>

<p>dbt also has documentation generation (<code class="language-plaintext highlighter-rouge">dbt docs generate</code>) that produces a browsable data catalog with descriptions, lineage graphs, and column-level documentation. Another thing worth adding.</p>

<p>The third item on the list is the <strong>dbt semantic layer</strong>. Rather than defining metrics like “goals per game” or “win rate” directly in dashboard SQL queries, the semantic layer lets you define them once in dbt and expose them as reusable, consistent metrics across any downstream tool. For a project where the same aggregations appear in multiple dashboard pages, this would eliminate duplication and make the numbers trustworthy by construction.</p>

<p>Next: a look at the API limits and infrastructure constraints that shape what this project can and cannot become.</p>]]></content><author><name>Salih Ugur Kımıllı</name></author><category term="data-engineering" /><category term="dbt" /><category term="transformation" /><summary type="html"><![CDATA[When the silver and gold layers were first built, they ran as plain SQL files executed by Python runner scripts — run_silver.py and run_gold.py. Each script would read a directory of .sql files, connect to MotherDuck, and execute them in a specific order. It worked. The data was correct. But as the number of models grew and the logic became more complex, the cracks in the approach started to show.]]></summary></entry><entry><title type="html">The Deployment Saga — Netlify, Cloudflare, and Finally Vercel</title><link href="https://saugki1773.github.io/data-engineering-blog/2026/04/18/deployment-saga.html" rel="alternate" type="text/html" title="The Deployment Saga — Netlify, Cloudflare, and Finally Vercel" /><published>2026-04-18T00:00:00+00:00</published><updated>2026-04-18T00:00:00+00:00</updated><id>https://saugki1773.github.io/data-engineering-blog/2026/04/18/deployment-saga</id><content type="html" xml:base="https://saugki1773.github.io/data-engineering-blog/2026/04/18/deployment-saga.html"><![CDATA[<p>This is the chapter I wish someone had written before I started. The deployment story is not a story about bad tools — Netlify, Cloudflare Pages, and Vercel are all good products. It is a story about free tier constraints that are easy to overlook until you hit them, and about how a project with an unusual build profile (large data files, Node.js compilation, MotherDuck token handling) does not fit neatly into the assumptions any of these platforms make.</p>

<h2 id="chapter-1-netlify">Chapter 1: Netlify</h2>

<p>I asked Gemini for a recommendation on where to host an Evidence.dev dashboard. The answer was Netlify. Netlify is a well-established static site hosting platform with a generous free tier, good documentation, and a GitHub integration that makes deployment trivially easy — push to main, Netlify rebuilds and deploys automatically.</p>

<p>I set it up. The first build worked. The dashboard was live. Everything looked fine.</p>

<p>The problem appeared five days later.</p>

<p>Netlify’s free tier includes <strong>300 build minutes per month</strong>. An Evidence.dev build — installing npm dependencies, running <code class="language-plaintext highlighter-rouge">evidence sources</code> to query MotherDuck, compiling the dashboard — takes roughly 4 to 5 minutes. If the nightly pipeline triggers a rebuild every night, that is 35 minutes a week, 150 minutes a month. Still within the limit.</p>

<p>Except: during active development, every push to the main branch triggered a rebuild. In those five days, between debugging pipeline issues, iterating on dashboard pages, and fixing configuration problems, I triggered somewhere around 60 to 70 builds. That was essentially the entire monthly quota. On day five, Netlify suspended the site’s builds until the next billing cycle.</p>

<p>I could not deploy. The site was frozen. I could have paid to upgrade, but paying for hosting on a side project did not feel right when the entire rest of the stack was on free tiers.</p>

<p>Netlify’s build limit is reasonable for a normal static marketing site that deploys a few times a week. It is not designed for a project in active development or for a pipeline that rebuilds nightly with data freshness as a feature. In hindsight, the right approach would have been to build Evidence.dev in GitHub Actions and upload the output to Netlify using the CLI, bypassing Netlify’s build system entirely. We eventually did implement this — but by then, there were other reasons to switch.</p>

<p>There were also token handling issues. Evidence.dev requires the MotherDuck token to be base64-encoded and written to a <code class="language-plaintext highlighter-rouge">connection.options.yaml</code> file before the build runs. Getting that into Netlify’s build environment in a way that survived across the <code class="language-plaintext highlighter-rouge">npm run sources</code> step required several attempts and a dedicated CI step to write the file.</p>

<h2 id="chapter-2-cloudflare-pages">Chapter 2: Cloudflare Pages</h2>

<p>Cloudflare’s market position is well-known. It runs a significant portion of the internet’s DNS and CDN infrastructure. Cloudflare Pages is their static site hosting product, and it is genuinely fast — assets are served from Cloudflare’s edge network, which means low latency everywhere. The free tier has <strong>500 builds per month</strong>, which solved the quota problem immediately.</p>

<p>I migrated. The site was up on Cloudflare Pages. The pipeline was working. Things were good.</p>

<p>Then the data grew.</p>

<p>Evidence.dev bundles the query results as Parquet files into the static build output. As we added more dashboard pages — match results going back to 2020, player stats, referee data — the build output got larger. Cloudflare Pages has a <strong>25 MB file size limit</strong> for individual deployable assets.</p>

<p>The bundled Parquet data from the Evidence.dev build was a few megabytes over that limit. Cloudflare would not deploy it. We tried compression options, tried splitting queries to reduce individual file sizes, tried serving some data lazily — nothing moved the needle enough without significantly compromising the dashboard’s performance.</p>

<p>This was a hard limit that Cloudflare was not going to lift on the free tier.</p>

<h2 id="chapter-3-vercel">Chapter 3: Vercel</h2>

<p>At this point I sat down and compared all three platforms properly against what this project actually needs:</p>

<table>
  <thead>
    <tr>
      <th>Requirement</th>
      <th>Netlify</th>
      <th>Cloudflare Pages</th>
      <th>Vercel</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Builds per month</td>
      <td>300</td>
      <td>500</td>
      <td>Unlimited (hobby)</td>
    </tr>
    <tr>
      <td>Max file size</td>
      <td>100 MB total</td>
      <td>25 MB per file</td>
      <td>100 MB per file</td>
    </tr>
    <tr>
      <td>Build timeout</td>
      <td>15 min</td>
      <td>20 min</td>
      <td>45 min</td>
    </tr>
    <tr>
      <td>GitHub integration</td>
      <td>✓</td>
      <td>✓</td>
      <td>✓</td>
    </tr>
    <tr>
      <td>Cost</td>
      <td>Free tier</td>
      <td>Free tier</td>
      <td>Free tier</td>
    </tr>
  </tbody>
</table>

<p>Vercel’s hobby tier has <strong>no monthly build limit</strong> and a <strong>100 MB per file limit</strong>. Both problems solved.</p>

<p>The migration itself was straightforward — connect the GitHub repository, configure the build command (<code class="language-plaintext highlighter-rouge">npm run build</code>) and output directory (<code class="language-plaintext highlighter-rouge">build</code>), add the MotherDuck token as an environment variable. First build succeeded.</p>

<h2 id="the-deploy-hook-debugging-saga">The Deploy Hook Debugging Saga</h2>

<p>The one complication with Vercel was controlling <em>when</em> it deploys. By default, Vercel rebuilds on every push to main. For this project, that would mean rebuilding every time a code change is merged — including changes that have nothing to do with the dashboard data. We only want to rebuild when the nightly pipeline finishes, or when manually triggered.</p>

<p>The initial approach was a Vercel <strong>deploy hook</strong> — a unique URL you POST to, and Vercel queues a build. The GitHub Actions pipeline would curl that URL at the end of the gold step.</p>

<p>This seemed simple. It was not.</p>

<p>The curl call was returning 200 but the build was not triggering. We added verbose logging to the curl command. The response body was correct. We added a separate test workflow that did nothing but curl the hook and report the response. It worked in isolation. When called from the main pipeline, it did not.</p>

<p>The exact reason was never fully isolated. There were several issues layered on top of each other: the Vercel deploy hook behaves differently when called from a GitHub Actions runner on certain network configurations, there were permission issues with the GITHUB_TOKEN in the workflow, and at one point a test commit was made to check whether Vercel was even watching the right branch.</p>

<p>We eventually abandoned the deploy hook approach entirely and replaced it with a <strong>dedicated deployment branch</strong> — <code class="language-plaintext highlighter-rouge">publish_dashboard/vercel</code>. The nightly pipeline’s final step makes an empty commit to that branch, and Vercel is configured to only watch that branch for deployments. The GitHub Actions step needs <code class="language-plaintext highlighter-rouge">contents: write</code> permission to push to a branch, which was another discovery made after the fact.</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">Push to Vercel deploy branch</span>
  <span class="na">run</span><span class="pi">:</span> <span class="pi">|</span>
    <span class="s">git config user.email "github-actions@github.com"</span>
    <span class="s">git config user.name "GitHub Actions"</span>
    <span class="s">git checkout -B publish_dashboard/vercel origin/main</span>
    <span class="s">git commit --allow-empty -m "chore: nightly data refresh $(date -u '+%Y-%m-%d %H:%M UTC')"</span>
    <span class="s">git push origin publish_dashboard/vercel --force</span>
</code></pre></div></div>

<p>This approach is more reliable than a webhook because it uses standard git push semantics, which GitHub Actions handles correctly, and Vercel’s Git integration handles correctly. The empty commit triggers the deploy without cluttering the main branch history.</p>

<h2 id="lessons">Lessons</h2>

<p>If I were starting over, I would evaluate hosting platforms against the specific build profile of the project — build frequency, output size, build time — before committing to anything. The differences between free tiers are not academic; they determine whether your project actually works at the end.</p>

<p>For Evidence.dev specifically, Vercel is the right choice as of today. Unlimited builds, large file support, straightforward integration. The deploy hook is unreliable and the branch-push approach is better.</p>

<p>Next: one of the most impactful refactors in the project — migrating the transformation layer to dbt.</p>]]></content><author><name>Salih Ugur Kımıllı</name></author><category term="data-engineering" /><category term="deployment" /><category term="devops" /><summary type="html"><![CDATA[This is the chapter I wish someone had written before I started. The deployment story is not a story about bad tools — Netlify, Cloudflare Pages, and Vercel are all good products. It is a story about free tier constraints that are easy to overlook until you hit them, and about how a project with an unusual build profile (large data files, Node.js compilation, MotherDuck token handling) does not fit neatly into the assumptions any of these platforms make.]]></summary></entry><entry><title type="html">The Dashboard — Discovering Evidence.dev</title><link href="https://saugki1773.github.io/data-engineering-blog/2026/04/16/building-the-dashboard.html" rel="alternate" type="text/html" title="The Dashboard — Discovering Evidence.dev" /><published>2026-04-16T00:00:00+00:00</published><updated>2026-04-16T00:00:00+00:00</updated><id>https://saugki1773.github.io/data-engineering-blog/2026/04/16/building-the-dashboard</id><content type="html" xml:base="https://saugki1773.github.io/data-engineering-blog/2026/04/16/building-the-dashboard.html"><![CDATA[<p>I knew from the start that I wanted a live public dashboard, not a static report or a screenshot. The question was which tool to use.</p>

<h2 id="why-not-the-obvious-choices">Why Not the Obvious Choices</h2>

<p><strong>Tableau</strong> and <strong>Power BI</strong> were immediately out — they are expensive, and the free tiers do not allow public sharing without embedding tricks.</p>

<p><strong>Grafana</strong> is excellent for operational metrics but not designed for product analytics dashboards. The user experience for building and sharing analytical views is awkward.</p>

<p><strong>Metabase</strong> was a serious contender. It is open source, has a clean UI, and connects to DuckDB. But self-hosting Metabase adds infrastructure overhead, and the cloud version has a cost.</p>

<p><strong>Streamlit</strong> would have worked, but it requires you to write Python to build UI, and the resulting dashboards do not look polished without significant effort.</p>

<p><strong>Superset</strong> — similar story to Metabase: powerful, but infrastructure-heavy.</p>

<h2 id="evidencedev">Evidence.dev</h2>

<p>Evidence.dev is a different paradigm entirely. You write dashboard pages in Markdown. SQL queries go in fenced code blocks directly in the page. The output of each query is available as a variable in the same file. Charts, tables, and filters are Svelte components that you use inline in the Markdown. The whole thing compiles to a static site — no server, no runtime, just HTML, CSS, and JavaScript.</p>

<div class="language-markdown highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">```</span><span class="nl">sql matches
</span><span class="sb">select match_date, home_team, away_team, score
from superligaen.match_results_by_match
order by match_date desc
limit 10
` ` `

&lt;DataTable data={matches} /&gt;
</span></code></pre></div></div>

<p>That is the entire workflow. Write a SQL query, reference its name in a component, and you have a table on your dashboard. It is the fastest path from data to UI that I have found.</p>

<p>Evidence.dev connects to MotherDuck natively via the <code class="language-plaintext highlighter-rouge">@evidence-dev/motherduck</code> plugin. At build time, it runs the SQL queries against MotherDuck and bundles the results as Parquet files into the static build output. The deployed site loads these Parquet files at runtime using DuckDB-WASM — meaning the dashboard runs analytical queries entirely in the browser, with no server. It is genuinely impressive engineering.</p>

<h2 id="dashboard-pages">Dashboard Pages</h2>

<p>We built seven pages:</p>

<p><strong>Home</strong> — a hero banner with the league logo and flag, four KPI tiles (current leader, team count, matches played, goals scored this season), and a navigation grid linking to every other page.</p>

<p><strong>Standings</strong> — three separate tables covering the Championship Round, Relegation Round, and Regular Season standings, with a season selector to browse historical tables.</p>

<p><strong>Match Results</strong> — a filterable table of all historical results with a Goals vs xG chart by round that shows which rounds were overperforming or underperforming expected goals.</p>

<p><strong>Upcoming Fixtures</strong> — a table of the next fixtures sorted by date and kick-off time, plus a match analysis section: select any upcoming fixture from the dropdown and see the head-to-head history between those two clubs and the last five results for each team.</p>

<p><strong>League Analytics</strong> — cross-team benchmarks: top scorers, most disciplined teams, possession rankings, and season-level trends. This is the “zoom out” page.</p>

<p><strong>Team Analytics</strong> — deep dive into a single team: select a team and see their season KPIs, recent form, shooting accuracy, possession stats, and discipline record. This page was the most technically interesting to build because it required combining multiple gold views with different grains.</p>

<p><strong>Referee Analytics</strong> — cards issued, fouls per match, team exposure (which referees are most frequently assigned to which clubs), and a match log for each referee. This one came from a genuine curiosity: in a small league with a small pool of referees, team-referee familiarity is a real thing.</p>

<h2 id="what-evidencedev-is-not">What Evidence.dev Is Not</h2>

<p>Evidence.dev is a static site generator. That means the data is frozen at build time. There is no live query against MotherDuck when a user loads the page — they are loading the data that was baked in during the last build. For a nightly pipeline that updates at midnight, this is perfectly fine. The dashboard is always at most 24 hours stale, which is acceptable for a league that plays two or three times a week.</p>

<p>The implication is that to refresh the data you need to trigger a new build. The nightly GitHub Actions pipeline does this automatically: bronze ingestion → silver dbt run → gold dbt run → trigger Vercel deploy. The deploy takes about two minutes and the site is updated.</p>

<h2 id="quirks-and-fixes">Quirks and Fixes</h2>

<p>A few things about Evidence.dev that were not obvious:</p>

<p><strong>The <code class="language-plaintext highlighter-rouge">%</code> sign in SQL</strong> — Evidence.dev uses <code class="language-plaintext highlighter-rouge">%</code> as a template delimiter internally, which conflicts with the SQL <code class="language-plaintext highlighter-rouge">LIKE</code> operator pattern <code class="language-plaintext highlighter-rouge">'%value%'</code>. We ran into this when formatting percentage values. The fix was to use the <code class="language-plaintext highlighter-rouge">pct0</code> format specifier that Evidence provides for formatting numbers as percentages, rather than concatenating a <code class="language-plaintext highlighter-rouge">%</code> symbol in SQL.</p>

<p><strong>Sidebar and TOC</strong> — By default, Evidence.dev renders a table of contents and a sidebar on every page. For a dashboard that is supposed to look like a product, these are in the way. The <code class="language-plaintext highlighter-rouge">sidebar: never</code> and <code class="language-plaintext highlighter-rouge">hide_toc: true</code> frontmatter options suppress them. This was not in the main documentation but buried in a GitHub issue.</p>

<p><strong>Mobile responsiveness</strong> — The default Evidence.dev layout is desktop-first. Getting the home page hero banner and the KPI grid to look right on mobile required custom CSS in the Markdown using Tailwind utility classes (which Evidence.dev ships with). Nothing dramatic, but it needed explicit work.</p>

<p><strong>The <code class="language-plaintext highlighter-rouge">evidence.config.yaml</code> layout key</strong> — An early version of the config file had an invalid <code class="language-plaintext highlighter-rouge">layout:</code> key that was silently causing a build warning. We removed it when it started being noisy.</p>

<p>Next: the part of the project that took the most calendar time relative to its apparent simplicity — getting the dashboard deployed.</p>]]></content><author><name>Salih Ugur Kımıllı</name></author><category term="data-engineering" /><category term="dashboard" /><category term="frontend" /><summary type="html"><![CDATA[I knew from the start that I wanted a live public dashboard, not a static report or a screenshot. The question was which tool to use.]]></summary></entry><entry><title type="html">Silver and Gold — Transforming Data into a Star Schema</title><link href="https://saugki1773.github.io/data-engineering-blog/2026/04/14/silver-and-gold-layers.html" rel="alternate" type="text/html" title="Silver and Gold — Transforming Data into a Star Schema" /><published>2026-04-14T00:00:00+00:00</published><updated>2026-04-14T00:00:00+00:00</updated><id>https://saugki1773.github.io/data-engineering-blog/2026/04/14/silver-and-gold-layers</id><content type="html" xml:base="https://saugki1773.github.io/data-engineering-blog/2026/04/14/silver-and-gold-layers.html"><![CDATA[<p>With 21 tables of raw JSON sitting in MotherDuck, the next step was to make the data actually usable. That meant two more layers: silver (clean, structured tables) and gold (a Kimball dimensional model designed for analytics).</p>

<h2 id="silver-flattening-the-json">Silver: Flattening the JSON</h2>

<p>The silver layer’s job is to take each bronze table and turn it into a proper relational table — extract columns from the JSON, cast types, handle nulls, and normalise nested structures. Every bronze endpoint gets a corresponding silver model.</p>

<p>DuckDB’s JSON handling is one of its best features. You can navigate nested JSON using arrow operators and the <code class="language-plaintext highlighter-rouge">-&gt;&gt;</code> syntax, and <code class="language-plaintext highlighter-rouge">UNNEST</code> is available for arrays. For a fixture statistics row that looks like this in bronze:</p>

<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
  </span><span class="nl">"fixture"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="nl">"id"</span><span class="p">:</span><span class="w"> </span><span class="mi">12345</span><span class="p">},</span><span class="w">
  </span><span class="nl">"statistics"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
    </span><span class="p">{</span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Shots on Goal"</span><span class="p">,</span><span class="w"> </span><span class="nl">"value"</span><span class="p">:</span><span class="w"> </span><span class="mi">4</span><span class="p">},</span><span class="w">
    </span><span class="p">{</span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Ball Possession"</span><span class="p">,</span><span class="w"> </span><span class="nl">"value"</span><span class="p">:</span><span class="w"> </span><span class="s2">"55%"</span><span class="p">}</span><span class="w">
  </span><span class="p">]</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>

<p>The silver transformation pivots this into a proper row with typed columns — <code class="language-plaintext highlighter-rouge">shots_on_goal INTEGER</code>, <code class="language-plaintext highlighter-rouge">ball_possession_pct DECIMAL</code>, and so on.</p>

<p>One rule we followed throughout: <strong>keep all columns in silver</strong>. When in doubt, keep it. Storage is cheap and you can always choose not to expose a column in gold or in the dashboard, but you cannot recreate data you threw away. Silver models kept logos, flags, URLs, internal IDs, everything — even things that looked useless at the time.</p>

<h2 id="the-motherduck-memory-limit">The MotherDuck Memory Limit</h2>

<p>The fixture_players silver model was the most complex one to write. Each fixture returns a deeply nested JSON structure: a list of teams, each containing a list of players, each containing a list of statistics. Getting all of that into a flat table required multiple levels of <code class="language-plaintext highlighter-rouge">UNNEST</code>.</p>

<p>The initial version used nested UNNESTs — unnesting teams, then unnesting players within the same query, then unnesting statistics:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">SELECT</span> <span class="p">...</span>
<span class="k">FROM</span> <span class="n">bronze</span><span class="p">.</span><span class="n">fixture_players</span><span class="p">,</span>
<span class="k">UNNEST</span><span class="p">(</span><span class="n">response</span><span class="o">-&gt;</span><span class="s1">'response'</span><span class="p">)</span> <span class="k">AS</span> <span class="n">t</span><span class="p">(</span><span class="n">team</span><span class="p">),</span>
<span class="k">UNNEST</span><span class="p">(</span><span class="n">t</span><span class="p">.</span><span class="n">team</span><span class="o">-&gt;</span><span class="s1">'players'</span><span class="p">)</span> <span class="k">AS</span> <span class="n">p</span><span class="p">(</span><span class="n">player</span><span class="p">),</span>
<span class="k">UNNEST</span><span class="p">(</span><span class="n">p</span><span class="p">.</span><span class="n">player</span><span class="o">-&gt;</span><span class="s1">'statistics'</span><span class="p">)</span> <span class="k">AS</span> <span class="n">s</span><span class="p">(</span><span class="n">stat</span><span class="p">)</span>
</code></pre></div></div>

<p>This worked fine in development on a local DuckDB instance with no memory constraints. When we ran it in production on MotherDuck’s free tier, it hit the <strong>953 MB memory cap</strong> on the Pulse compute plan and crashed.</p>

<p>The fix was to stop doing all the unnesting in a single query and instead use <strong>sequential CTEs</strong> — unpack one level per CTE, materialise it, then unpack the next level from that:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">WITH</span> <span class="n">teams</span> <span class="k">AS</span> <span class="p">(</span>
    <span class="k">SELECT</span> <span class="n">fixture_id</span><span class="p">,</span> <span class="k">UNNEST</span><span class="p">(</span><span class="n">response</span><span class="o">-&gt;</span><span class="s1">'response'</span><span class="p">)</span> <span class="k">AS</span> <span class="n">team</span>
    <span class="k">FROM</span> <span class="n">bronze</span><span class="p">.</span><span class="n">fixture_players</span>
<span class="p">),</span>
<span class="n">players</span> <span class="k">AS</span> <span class="p">(</span>
    <span class="k">SELECT</span> <span class="n">fixture_id</span><span class="p">,</span> <span class="n">team</span><span class="o">-&gt;&gt;</span><span class="s1">'id'</span> <span class="k">AS</span> <span class="n">team_id</span><span class="p">,</span> <span class="k">UNNEST</span><span class="p">(</span><span class="n">team</span><span class="o">-&gt;</span><span class="s1">'players'</span><span class="p">)</span> <span class="k">AS</span> <span class="n">player</span>
    <span class="k">FROM</span> <span class="n">teams</span>
<span class="p">),</span>
<span class="n">stats</span> <span class="k">AS</span> <span class="p">(</span>
    <span class="k">SELECT</span> <span class="n">fixture_id</span><span class="p">,</span> <span class="n">team_id</span><span class="p">,</span> <span class="n">player</span><span class="o">-&gt;&gt;</span><span class="s1">'id'</span> <span class="k">AS</span> <span class="n">player_id</span><span class="p">,</span> <span class="k">UNNEST</span><span class="p">(</span><span class="n">player</span><span class="o">-&gt;</span><span class="s1">'statistics'</span><span class="p">)</span> <span class="k">AS</span> <span class="n">stat</span>
    <span class="k">FROM</span> <span class="n">players</span>
<span class="p">)</span>
<span class="k">SELECT</span> <span class="p">...</span>
<span class="k">FROM</span> <span class="n">stats</span>
</code></pre></div></div>

<p>Each CTE processes a smaller set of intermediate results rather than holding the entire explosion in memory at once. After this refactor the query ran cleanly within the memory budget.</p>

<p>This was one of the more interesting problems in the whole project — not because the fix was complicated, but because the failure mode was invisible in development. Local DuckDB has no memory cap. The bug only appeared in production, and the error message (<code class="language-plaintext highlighter-rouge">Out of Memory: cannot allocate</code>) was not immediately helpful in pointing to nested UNNESTs as the culprit.</p>

<h2 id="gold-kimball-dimensional-modelling">Gold: Kimball Dimensional Modelling</h2>

<p>Once silver tables were clean and stable, I built the gold layer as a <strong>Kimball star schema</strong>. The fact grain is one row per team per match — meaning each fixture produces two rows in the fact table, one for the home team and one for the away team. This grain was chosen because most analytical questions in football are team-centric: how many goals has this team scored? What is their xG differential at home?</p>

<p>The fact table, <code class="language-plaintext highlighter-rouge">fct_match_results</code>, contains all the measurable numeric values — goals, shots, possession, passes, fouls, cards, expected goals, and points earned. Everything else is pushed into dimensions.</p>

<p>We ended up with 10 dimension tables:</p>

<table>
  <thead>
    <tr>
      <th>Dimension</th>
      <th>What it represents</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">dim_date</code></td>
      <td>Calendar attributes of the match date</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">dim_time</code></td>
      <td>Hour of kick-off and period of day (Morning / Afternoon / Evening / Night)</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">dim_team</code></td>
      <td>Club identity — name, code, country, logo</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">dim_opponent_team</code></td>
      <td>Role-playing dimension — same structure as <code class="language-plaintext highlighter-rouge">dim_team</code>, aliased to represent the opposing club</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">dim_match</code></td>
      <td>Match metadata — round, season, names, status</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">dim_league</code></td>
      <td>League identity — name, country, logo, flag</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">dim_stadium</code></td>
      <td>Venue — name, city, capacity, surface</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">dim_referee</code></td>
      <td>Referee name</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">dim_team_side</code></td>
      <td>Home or Away</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">dim_match_result</code></td>
      <td>Win, Draw, or Loss</td>
    </tr>
  </tbody>
</table>

<p>Having <code class="language-plaintext highlighter-rouge">dim_team</code> and <code class="language-plaintext highlighter-rouge">dim_opponent_team</code> as separate dimensions makes self-join queries much cleaner. A query like “show me all home results where the opponent was a top-four side” is a simple join rather than a correlated subquery.</p>

<h2 id="surrogate-keys-and-sentinel-rows">Surrogate Keys and Sentinel Rows</h2>

<p>Every dimension uses an integer <strong>surrogate key</strong> as its primary key — <code class="language-plaintext highlighter-rouge">team_sk</code>, <code class="language-plaintext highlighter-rouge">match_sk</code>, and so on. These are stable across runs: when a new referee appears, they get a new SK, and existing referees keep theirs. This is the standard Kimball pattern.</p>

<p>Each dimension also has two <strong>sentinel rows</strong>:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">-1 Unknown [Attribute]</code> — for records where the value exists but is genuinely unknown</li>
  <li><code class="language-plaintext highlighter-rouge">-2 Not Applicable [Attribute]</code> — for records where the dimension does not apply</li>
</ul>

<p>These sentinel rows mean the fact table can always have a valid foreign key, even for fixtures that have no referee assigned yet (common for upcoming matches) or for venues that are listed as TBD. Dashboard queries never need a <code class="language-plaintext highlighter-rouge">LEFT JOIN</code> or a null check — every fact row joins cleanly.</p>

<p>One early version of the sentinel rows had generic labels like <code class="language-plaintext highlighter-rouge">-1 Unknown</code> and <code class="language-plaintext highlighter-rouge">-2 Not Applicable</code>. We later updated them to be attribute-specific: <code class="language-plaintext highlighter-rouge">-1 Unknown Referee</code>, <code class="language-plaintext highlighter-rouge">-2 Not Applicable Stadium</code>, and so on. This makes them instantly readable in query results without having to check which dimension you are looking at.</p>

<h2 id="the-data-model">The Data Model</h2>

<p>The star schema centres on a single fact table joined to ten dimensions. The grain is one row per team per match — each fixture produces two rows, one for the home team and one for the away team.</p>

<script type="module">
  import mermaid from 'https://cdn.jsdelivr.net/npm/mermaid@10/dist/mermaid.esm.min.mjs';
  mermaid.initialize({ startOnLoad: true, theme: 'dark' });
</script>

<div class="mermaid">
erDiagram
    fct_match_results {
        int date_sk FK
        int time_sk FK
        int team_sk FK
        int opponent_team_sk FK
        int league_sk FK
        int stadium_sk FK
        int referee_sk FK
        int match_sk FK
        int team_side_sk FK
        int match_result_sk FK
        int points_earned
        int goals_scored
        int goals_conceded
        int goals_ht_scored
        int goals_ht_conceded
        int shots_on_goal
        int shots_off_goal
        int total_shots
        int blocked_shots
        int shots_insidebox
        int shots_outsidebox
        decimal ball_possession_pct
        int total_passes
        int passes_accurate
        int fouls
        int corner_kicks
        int offsides
        int yellow_cards
        int red_cards
        int goalkeeper_saves
        decimal expected_goals
    }
    dim_date {
        int date_sk PK
        date date
        int year
        varchar quarter
        int month
        varchar month_name
        int week_number
        int day_of_week
        varchar day_name
        varchar is_weekend
    }
    dim_time {
        int time_sk PK
        int hour
        varchar period_of_day
    }
    dim_team {
        int team_sk PK
        int team_id
        varchar team_name
        varchar team_code
        varchar team_country
        int team_founded_year
        varchar team_logo
    }
    dim_opponent_team {
        int team_sk PK
        int team_id
        varchar team_name
        varchar team_code
        varchar team_country
        int team_founded_year
        varchar team_logo
    }
    dim_match {
        int match_sk PK
        int match_id
        varchar season
        varchar match_round_name
        varchar match_round_type
        int match_round_number
        varchar match_status
        varchar match_name
        varchar match_short_name
        varchar match_result
        varchar kick_off_time
    }
    dim_league {
        int league_sk PK
        int league_id
        varchar league_name
        varchar league_type
        varchar league_logo
        varchar league_country
        varchar league_country_code
        varchar league_country_flag
    }
    dim_stadium {
        int stadium_sk PK
        int stadium_id
        varchar stadium_name
        varchar stadium_address
        varchar stadium_city
        varchar stadium_country
        int stadium_capacity
        varchar stadium_surface
    }
    dim_referee {
        int referee_sk PK
        varchar referee_name
    }
    dim_team_side {
        int team_side_sk PK
        varchar team_side
    }
    dim_match_result {
        int match_result_sk PK
        varchar match_result
    }

    dim_date           ||--|{ fct_match_results : "date_sk"
    dim_team           ||--|{ fct_match_results : "team_sk"
    dim_match          ||--|{ fct_match_results : "match_sk"
    dim_league         ||--|{ fct_match_results : "league_sk"
    dim_stadium        ||--|{ fct_match_results : "stadium_sk"
    fct_match_results  }|--|| dim_opponent_team  : "opponent_team_sk"
    fct_match_results  }|--|| dim_referee        : "referee_sk"
    fct_match_results  }|--|| dim_time           : "time_sk"
    fct_match_results  }|--|| dim_team_side      : "team_side_sk"
    fct_match_results  }|--|| dim_match_result   : "match_result_sk"
</div>

<p>Next: building the dashboard on top of this model.</p>]]></content><author><name>Salih Ugur Kımıllı</name></author><category term="data-engineering" /><category term="transformation" /><category term="data-modeling" /><summary type="html"><![CDATA[With 21 tables of raw JSON sitting in MotherDuck, the next step was to make the data actually usable. That meant two more layers: silver (clean, structured tables) and gold (a Kimball dimensional model designed for analytics).]]></summary></entry><entry><title type="html">Building the Bronze Layer — Raw Ingestion</title><link href="https://saugki1773.github.io/data-engineering-blog/2026/04/11/building-the-bronze-layer.html" rel="alternate" type="text/html" title="Building the Bronze Layer — Raw Ingestion" /><published>2026-04-11T00:00:00+00:00</published><updated>2026-04-11T00:00:00+00:00</updated><id>https://saugki1773.github.io/data-engineering-blog/2026/04/11/building-the-bronze-layer</id><content type="html" xml:base="https://saugki1773.github.io/data-engineering-blog/2026/04/11/building-the-bronze-layer.html"><![CDATA[<p>The bronze layer has one job: pull data from the API and store it in the warehouse exactly as it arrived. No transformation, no business logic. If the API gives you a nested JSON blob, you store a nested JSON blob. The philosophy is that raw data is irreplaceable — once you transform it, you lose the original, and if your transformation logic turns out to be wrong you have nothing to go back to.</p>

<h2 id="the-first-version-was-a-monolith">The First Version Was a Monolith</h2>

<p>The first version of the ingestion code was a single Python script that did everything: built URLs, called the API, handled pagination, and wrote to MotherDuck. It worked, but by the time it covered three or four endpoints it was already hard to follow. The first significant refactor split it into focused modules:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">api.py</code> — the HTTP client, rate limiting, retry/backoff</li>
  <li><code class="language-plaintext highlighter-rouge">db.py</code> — the MotherDuck connection</li>
  <li><code class="language-plaintext highlighter-rouge">config.py</code> — endpoint configuration, environment variables</li>
  <li><code class="language-plaintext highlighter-rouge">ingest_*.py</code> — one file per logical group of endpoints</li>
</ul>

<p>That structure stayed for the rest of the project.</p>

<h2 id="the-rate-limiting-problem">The Rate Limiting Problem</h2>

<p>api-football.com allows 10 requests per minute. The first version of the code just did <code class="language-plaintext highlighter-rouge">time.sleep(6)</code> between calls — six seconds per request keeps you just under the limit if every call takes zero time, which of course they do not. On fast calls the sleep is too short; on slow calls it is wasted time.</p>

<p>The proper solution is <strong>retry with exponential backoff</strong>: make the call, and if you get a 429, wait and try again. The wait doubles each retry with a small random jitter to avoid thundering herd problems. Here is the core of what we ended up with:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">api_get</span><span class="p">(</span><span class="n">url</span><span class="p">,</span> <span class="n">params</span><span class="p">,</span> <span class="n">retries</span><span class="o">=</span><span class="mi">5</span><span class="p">):</span>
    <span class="k">for</span> <span class="n">attempt</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">retries</span><span class="p">):</span>
        <span class="n">response</span> <span class="o">=</span> <span class="n">requests</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="n">url</span><span class="p">,</span> <span class="n">headers</span><span class="o">=</span><span class="n">HEADERS</span><span class="p">,</span> <span class="n">params</span><span class="o">=</span><span class="n">params</span><span class="p">)</span>
        <span class="k">if</span> <span class="n">response</span><span class="p">.</span><span class="n">status_code</span> <span class="o">==</span> <span class="mi">200</span><span class="p">:</span>
            <span class="k">return</span> <span class="n">response</span><span class="p">.</span><span class="n">json</span><span class="p">()</span>
        <span class="k">if</span> <span class="n">response</span><span class="p">.</span><span class="n">status_code</span> <span class="o">==</span> <span class="mi">429</span><span class="p">:</span>
            <span class="n">wait</span> <span class="o">=</span> <span class="p">(</span><span class="mi">2</span> <span class="o">**</span> <span class="n">attempt</span><span class="p">)</span> <span class="o">+</span> <span class="n">random</span><span class="p">.</span><span class="n">uniform</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span>
            <span class="n">time</span><span class="p">.</span><span class="n">sleep</span><span class="p">(</span><span class="n">wait</span><span class="p">)</span>
    <span class="k">raise</span> <span class="nb">Exception</span><span class="p">(</span><span class="sa">f</span><span class="s">"API call failed after </span><span class="si">{</span><span class="n">retries</span><span class="si">}</span><span class="s"> retries: </span><span class="si">{</span><span class="n">url</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
</code></pre></div></div>

<p>This was more correct than sleep-based limiting and also faster on days when the API was responding quickly.</p>

<h2 id="idempotency-delete-before-insert">Idempotency: Delete-Before-Insert</h2>

<p>One of the more important early decisions was how to handle re-runs. If the nightly pipeline fails halfway through and you re-run it, you do not want to double-insert yesterday’s data. The pattern we chose was <strong>delete-before-insert</strong>: before inserting any records for a given date window, delete everything for that date window first. If the insert then succeeds, you have exactly one copy of the data. If it fails, the next run will delete and re-insert cleanly.</p>

<p>For full loads the pattern is a full table truncate before reloading. For incremental runs it is a targeted delete by date range — typically a rolling window of recent days to catch any late-arriving corrections from the API.</p>

<p>Getting this right took several iterations. One early bug was that the teams endpoint returns a JSON array of team objects, and the initial code was inserting the whole array as a single row rather than unnesting it first. Another was that the venues endpoint needed a <code class="language-plaintext highlighter-rouge">season</code> parameter that was not being passed, so it silently returned empty results for several seasons.</p>

<h2 id="21-endpoints">21 Endpoints</h2>

<p>The final bronze layer covers 21 endpoints from the api-football.com free tier:</p>

<table>
  <thead>
    <tr>
      <th>Group</th>
      <th>Endpoints</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>League</td>
      <td>leagues, seasons, rounds, standings</td>
    </tr>
    <tr>
      <td>Match</td>
      <td>fixtures, fixture events, fixture statistics, fixture lineups, fixture player stats</td>
    </tr>
    <tr>
      <td>Team</td>
      <td>teams, venues</td>
    </tr>
    <tr>
      <td>Player</td>
      <td>players, top scorers, top assisters, top yellow cards, top red cards</td>
    </tr>
    <tr>
      <td>Prediction</td>
      <td>fixture predictions, fixture odds</td>
    </tr>
    <tr>
      <td>Other</td>
      <td>injuries, sidelined, trophies, coaches</td>
    </tr>
  </tbody>
</table>

<p>Each endpoint has its own ingestion script because the API parameters, pagination behaviour, and response shapes vary significantly. Some endpoints require a league ID and season. Some require a fixture ID and can only be fetched one fixture at a time (which is why fixture statistics alone accounts for a large chunk of the daily API call budget). Some, like odds and predictions, return data that changes right up until kick-off, so they need to be re-fetched regularly.</p>

<h2 id="incremental-vs-full-load">Incremental vs Full Load</h2>

<p>The ingestion runner supports two modes, controlled by a command-line flag:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">--full-load</code> — truncate and reload everything from 2020 to the current season. Used for initial bootstrap and occasional corrective runs.</li>
  <li>Incremental (default) — fetch only the last N days of data (default 5). Used by the nightly GitHub Actions cron job.</li>
</ul>

<p>The <code class="language-plaintext highlighter-rouge">--lookback</code> parameter controls how many days back the incremental run looks. Setting it to 5 rather than 1 gives a buffer for late-arriving data and ensures that matches played over the weekend are picked up reliably even if the cron runs only once per day.</p>

<p>The nightly schedule runs at <strong>23:00 UTC</strong>, which is midnight Danish time. That is late enough to catch the result of any evening match from the same day.</p>

<h2 id="a-note-on-the-season">A Note on the Season</h2>

<p>One subtlety with football APIs: the “current season” is not always obvious. Superligaen runs on a split-season calendar — the 2025/26 season starts in mid-2025 and finishes in mid-2026. Whether you are in the “2025” season or the “2026” season depends on the calendar and the convention used by the API.</p>

<p>The first version of the code hardcoded <code class="language-plaintext highlighter-rouge">CURRENT_SEASON = 2024</code>, which was wrong. The second version tried to derive it from today’s date using a heuristic (if month &gt;= 7, season = year; else season = year - 1), which was better but still not correct around season transitions. The final version queries the leagues endpoint directly to find which season is currently active. The API knows — you should ask it.</p>

<h2 id="what-lands-in-bronze">What Lands in Bronze</h2>

<p>Every bronze table has a consistent shape: the raw API response JSON is stored in a <code class="language-plaintext highlighter-rouge">response</code> column of type <code class="language-plaintext highlighter-rouge">JSON</code>, alongside metadata columns like <code class="language-plaintext highlighter-rouge">inserted_at</code>, <code class="language-plaintext highlighter-rouge">season</code>, and sometimes <code class="language-plaintext highlighter-rouge">fixture_id</code> or <code class="language-plaintext highlighter-rouge">league_id</code> depending on the endpoint. No type casting, no column extraction — just the raw blob.</p>

<p>This means bronze is not useful for direct querying, but it is a perfect foundation for silver. If we ever decide that a silver transformation was wrong, we can drop the silver table and rerun the transformation against the unchanged bronze data. That is the point.</p>

<p>Next: turning the raw JSON blobs into structured, typed tables — and then into a Kimball dimensional model.</p>]]></content><author><name>Salih Ugur Kımıllı</name></author><category term="data-engineering" /><category term="ingestion" /><summary type="html"><![CDATA[The bronze layer has one job: pull data from the API and store it in the warehouse exactly as it arrived. No transformation, no business logic. If the API gives you a nested JSON blob, you store a nested JSON blob. The philosophy is that raw data is irreplaceable — once you transform it, you lose the original, and if your transformation logic turns out to be wrong you have nothing to go back to.]]></summary></entry><entry><title type="html">Choosing a Data Source</title><link href="https://saugki1773.github.io/data-engineering-blog/2026/04/10/choosing-the-data-source.html" rel="alternate" type="text/html" title="Choosing a Data Source" /><published>2026-04-10T00:00:00+00:00</published><updated>2026-04-10T00:00:00+00:00</updated><id>https://saugki1773.github.io/data-engineering-blog/2026/04/10/choosing-the-data-source</id><content type="html" xml:base="https://saugki1773.github.io/data-engineering-blog/2026/04/10/choosing-the-data-source.html"><![CDATA[<h2 id="choosing-the-data-source">Choosing the Data Source</h2>

<p>The first thing I did was look for a football API. Two candidates came up immediately.</p>

<p><strong>football-data.org</strong> was the first I tried. The documentation is clean, the free tier is usable, and the community around it is solid. I set up an account, read through the available endpoints, and then hit the first wall: the Danish Superligaen is not included in the free tier. The free tier covers the top five European leagues — Premier League, La Liga, Bundesliga, Serie A, and Ligue 1 — plus a handful of cup competitions. Superligaen requires a paid plan. That was the end of that.</p>

<p><strong>api-football.com</strong> was the second option, and it did have Superligaen in the free tier. The free tier gives you access to all leagues but caps you at <strong>100 API calls per day</strong>. That sounds like a lot until you start mapping out what you need to fetch.</p>

<p>Each match needs: fixture metadata, statistics, events (goals, cards), lineups, player stats, and predictions. For a season with roughly 200 matches across multiple years, a full historical load requires thousands of calls just for fixtures. Add standings, top scorers, venues, referees, injuries, and odds, and a full bootstrap run across all available seasons (2020–2025) adds up to something in the hundreds of thousands of calls — spread over many days with careful throttling.</p>

<p>The 100-call-per-day limit was also one of the reasons I couldn’t expand the project beyond Superligaen. I wanted to include the Danish Cup, the Danish second division, or even add a comparison league from another country — Turkish Süper Lig would have been interesting. The architecture we built is genuinely scalable: adding a new league is just a config change. But doing so would immediately exceed the daily quota. That is a ceiling I am still bumping against.</p>

<p>There is also a rate limit within each day: <strong>10 requests per minute</strong>. Exceeding it returns a 429, and the API does not give you a Retry-After header — you just have to know the limit and respect it. Early versions of the ingestion code used a naive <code class="language-plaintext highlighter-rouge">sleep(6)</code> between calls, which worked but was fragile. We later replaced it with a retry-with-backoff strategy that is both more correct and more efficient.</p>

<h2 id="choosing-the-data-warehouse-motherduck">Choosing the Data Warehouse: MotherDuck</h2>

<p>Once I knew the data source, I needed a place to store it. The options I considered were:</p>

<ul>
  <li><strong>BigQuery</strong> — the obvious choice for cloud data warehousing. Free tier is generous. But the Python client, the IAM setup, the service account JSON files — it adds friction before you have written a single query.</li>
  <li><strong>Snowflake</strong> — industry standard, but the free trial expires and then you are paying.</li>
  <li><strong>DuckDB local</strong> — fast, zero setup, perfect for development. But the data only lives on your laptop, which rules out a public dashboard.</li>
  <li><strong>MotherDuck</strong> — DuckDB in the cloud. The free tier gives you <strong>10 GB of storage</strong> and a managed DuckDB instance accessible via a token. Zero infrastructure to manage. The Python client is just <code class="language-plaintext highlighter-rouge">duckdb</code> with a <code class="language-plaintext highlighter-rouge">md:</code> prefix on the connection string.</li>
</ul>

<p>MotherDuck won immediately. The developer experience is exceptional: you connect with a token, you write standard DuckDB SQL, and your data persists in the cloud. There is a web UI, a CLI, and it integrates natively with Evidence.dev (which I will get to later). For a side project, it removes every infrastructure concern that would otherwise become a time sink.</p>

<p>The one thing MotherDuck does not tell you upfront is that the free plan runs on a <strong>Pulse</strong> tier compute node with a 953 MB memory cap. That limit would come back to bite us several weeks later in a way that was not obvious at all, but more on that in a later post.</p>

<h2 id="two-environments-from-the-start">Two Environments from the Start</h2>

<p>One decision I made early that saved a lot of grief: set up two separate databases on MotherDuck — <code class="language-plaintext highlighter-rouge">superligaen</code> for production and <code class="language-plaintext highlighter-rouge">superligaen_dev</code> for development. Every pipeline run, every dbt model, every SQL change would first be tested against <code class="language-plaintext highlighter-rouge">superligaen_dev</code> before being pointed at prod. This is standard practice in professional data engineering but easy to skip on a side project. I am glad I did not skip it.</p>

<p>The GitHub Actions workflows have a <code class="language-plaintext highlighter-rouge">target_db</code> parameter so you can point any run at either database. The dbt profiles have explicit <code class="language-plaintext highlighter-rouge">dev</code> and <code class="language-plaintext highlighter-rouge">prod</code> targets. This separation meant I could break things in <code class="language-plaintext highlighter-rouge">superligaen_dev</code> freely — and I broke things constantly — without ever risking the production data.</p>

<h2 id="what-we-are-building">What We Are Building</h2>

<p>At this point the stack was: <strong>api-football.com</strong> as the source, <strong>MotherDuck</strong> as the warehouse, <strong>Python</strong> for ingestion, and some form of transformation and dashboard yet to be decided. The architecture I had in mind was a <strong>medallion architecture</strong>: three layers.</p>

<ul>
  <li><strong>Bronze</strong> — raw JSON from the API, stored as-is in MotherDuck. One table per API endpoint. No transformation, no validation, just a faithful copy of whatever the API returned.</li>
  <li><strong>Silver</strong> — cleaned, typed, structured relational tables. Each bronze table gets flattened, nulls handled, types cast correctly.</li>
  <li><strong>Gold</strong> — a Kimball dimensional model. A fact table and a set of dimension tables designed for analytical queries and dashboard consumption.</li>
</ul>

<p>That design held throughout the project. The tools used to implement each layer changed significantly, but the three-layer architecture never did.</p>

<p>Next: building the bronze layer.</p>]]></content><author><name>Salih Ugur Kımıllı</name></author><category term="data-engineering" /><category term="architecture" /><summary type="html"><![CDATA[Choosing the Data Source]]></summary></entry><entry><title type="html">The Idea — Why I Built This</title><link href="https://saugki1773.github.io/data-engineering-blog/2026/04/09/the-idea.html" rel="alternate" type="text/html" title="The Idea — Why I Built This" /><published>2026-04-09T00:00:00+00:00</published><updated>2026-04-09T00:00:00+00:00</updated><id>https://saugki1773.github.io/data-engineering-blog/2026/04/09/the-idea</id><content type="html" xml:base="https://saugki1773.github.io/data-engineering-blog/2026/04/09/the-idea.html"><![CDATA[<p>Every project starts somewhere. This one started with two things that happened to collide at the right moment.</p>

<h2 id="the-constraint">The Constraint</h2>

<p>I wanted to build an end-to-end data engineering project. Not a tutorial, not a sandbox — something real, with actual data, running in production. The full stack: ingestion, transformation, storage, a live dashboard.</p>

<p>The constraint I gave myself was simple: free, open source tools only. I wasn’t going to spend money on a personal project just to prove I could build something. If the tools couldn’t hold their own without a credit card, I’d find different tools.</p>

<h2 id="the-data-problem">The Data Problem</h2>

<p>The harder constraint was the data. I needed something real. Not synthetic, not historical-only, not something I’d lose interest in halfway through. Weather data felt generic. Titanic and movies felt like tutorial territory. I wanted data I’d actually want to look at after the build was done.</p>

<p>Then I remembered that I love football.</p>

<p>Football data from a major European league would mean: a live API with ongoing updates, a season structure that maps cleanly to a data model, and enough statistical depth to make analytics interesting. Fixtures, results, lineups, referee assignments, standings — the shape of the data is almost purpose-built for a star schema.</p>

<h2 id="the-personal-angle">The Personal Angle</h2>

<p>The league was the next question. I’d recently moved to Denmark. I follow the big European leagues like most football fans, but I realised I knew almost nothing about Danish football — the players, the clubs, the rivalries, how the season works.</p>

<p>That’s the moment the project clicked into focus. I wasn’t just going to build a data engineering showcase. I was going to build something I’d actually use: an analytics product for Superligaen, the Danish premier football league, aimed at people like me who want to understand it better.</p>

<p>That’s how this started. A love of data engineering, a constraint around cost, and a new country I wanted to understand through the sport I already loved.</p>

<p>The rest is what went wrong — and eventually right.</p>]]></content><author><name>Salih Ugur Kımıllı</name></author><category term="data-engineering" /><summary type="html"><![CDATA[Every project starts somewhere. This one started with two things that happened to collide at the right moment.]]></summary></entry></feed>