Skip to content
Deployment intelligence platform

Heimdall

The dashboard the platform team checks every morning. Answers one question — 'where is my ticket right now?' — across 17 services and four environments.

2025 → ongoing
20+ engineers, daily
// the problem

Five tabs, one question

Across 17 services and a dev → QA → preprod → prod pipeline, the state of any given ticket is scattered. The commit's in Bitbucket. The desired state is in the GitOps repo. The pods are in Kubernetes. The tests are in Sentry. The ticket is in JIRA.

Heimdall started life as a small Python service that exposed DORA counters to Prometheus — handy for leadership, but it didn't help anyone shipping a feature on a Tuesday afternoon. So I built a UI on top, and kept building until it was the first tab people opened.

It's now used daily by the engineering team and runs the morning stand-up.

// the product

A short tour

Six pages. Each one answers a question someone's about to ask in Slack.

dashboard
Heimdall dashboard with pipeline stages, last-24h deploys and 30-day rollup
The pipeline at the top — how many tickets are at each stage, and how long each handover takes. Underneath, the last 24 hours of deploys and a 30-day rollup. DORA metrics in a glance, no Grafana detour required.
tickets
Heimdall tickets view grouped by environment with stuck callouts
Every open ticket grouped by environment, stuck ones first. The "PRs ready" card surfaces the PRs with approval and green CI just waiting on a merge — usually two or three a day.
environments
Heimdall environments view with per-env activity and SHA matrix
Per-environment cards on top. The matrix below is one row per service, one column per environment — a green cell means the env is on the latest commit, red means it's drifted. This view replaced about five recurring Slack threads.
environment detail
Heimdall environment detail with promotion-ready services and per-service health
Drilldown for one environment. "Ready to promote" lists the services where the next env can safely take the new commit. Below that, per-service health, error rate, p95, and pod resource pressure.
pull requests
Heimdall pull request triage sorted ready, CI failing, needs review, stale
The same PRs Bitbucket has, but sorted by what unblocks shipping rather than what's most recent. Ready first, then CI failing, then needs review, then stale.
activity
Heimdall activity feed of deploys, syncs and promotions
A chronological feed of every deploy, sync, and promotion. The first place you look during an incident, and a useful way to open standup.
// architecture

How it's built

One Python service. A background job pulls from the upstream sources every ten minutes and writes everything down — once into a database, once into an in-memory cache the web app reads from. The web app itself does no fetching, no joins, no slow work. That's the whole trick. Pages stay fast under load because the work happens elsewhere.

Heimdall — system overview

Hover any node for a one-line explanation.

sourcesBitbucketcommits · PRsGitOps repodesired stateKuberneteswhat's runningJIRAticketsbackground workBackground collectorruns every 10 minutes · does all the heavy liftingstoresDatabasedurable history of everythingIn-memory cachewhat the UI readswebWeb appUI · API · metrics — only ever reads

The data model thinks of a deployment as a lifecycle, not an event: PR merged → tag updated → pods healthy → tests pass. A database view joins them all into one queryable thing, which is what powers the pages above.

// design

A few decisions worth flagging

Treat it as a product, not a script

The original DORA collector was a back-end service. Useful, but nobody opened it. The lesson I keep coming back to: if a tool doesn't have a place to look, it doesn't get used. The UI is what made the work matter.

Trust pods, not abstractions

ArgoCD will happily report a service as healthy while its new pods are crashlooping behind the scenes. Heimdall reads pod state directly, which means the dashboard stays honest in the cases that matter most.

Make it operable

The README opens with "is it healthy?" and answers it in one curl. Anyone on-call can diagnose Heimdall without reading the code. That's the bar I aim for whenever I hand work to a team.

// impact

What changed

17

services tracked

20+

engineers using it daily

10 min

data freshness

1 curl

to know if it's healthy

The numbers I care about most aren't in the table. The team stopped pasting kubectl output into Slack to ask if a deploy worked. Standup got shorter. Release management started using the same view as the engineers, which meant fewer dropped tickets at the seams. I'm still finding things to improve.

Thanks for reading.

If any of this resonates — or you want to dig into the parts I didn't write up — drop me a note. Always happy to talk shop.