From Mic Blip to Wiki Page: Meet Hubert, Our Self-Writing Meeting Machine
Meet Hubert, our self-writing meeting machine, inspired by the legendary OSS 117. Like Hubert Bonisseur de la Bath, it navigates chaos with effortless charm, turning audio into structured notes. No buttons, no fuss. Just results, delivered with the panache of Jean Dujardin’s iconic wink.
As a proud French person, I’ll admit I have strong opinions about meetings. It’s not that the people in them are the issue. My colleagues are wonderful. But let’s be honest: meetings are often an inefficient way to share information. You gather, discuss, and someone promises to “send a recap later.” The recap gets lost in the shuffle, and two weeks later, you’re revisiting the same topic because no one can recall the decisions made. It’s a cycle that can leave even the most patient among us feeling drained.

We build software at Kalvad. That means we talk to clients, argue among ourselves, break things, fix things, and make a lot of decisions out loud. Those decisions are valuable. They evaporate the second the call ends. For years we tried the classic remedies: assigned note-takers, shared docs, post-meeting summaries on Slack. Every single one of them relies on a human being tired, motivated, and disciplined enough to write things down after a long call. Spoiler: that human does not exist.
So we built Hubert.
What Hubert Actually Is

Hubert is a pipeline that runs quietly on our machines, notices when we're in a meeting, records it, transcribes it, asks an LLM to extract the important bits, and publishes everything to our internal wiki. Titled, dated, structured, and linked to the original audio. No buttons. No "start recording." No "hey AI, summarize this for me." We open a call, we have the call, we close the call, and a few minutes later a new page appears in our Outline wiki with a title, a summary, a todo list, minutes, and a playable MP3 attached.
That's the whole pitch. The point of Hubert is that the user does nothing. If you have to click anything, the tool has already failed.
The Architecture, in One Picture

Hubert is two services talking to each other through a message queue:
[ bonisseur ] ───▶ AMQP ───▶ [ hubert (4 workers) ]
mic watcher convert → transcribe → analyze → publish
bonisseur runs on your laptop/desktop. It watches your microphone via PulseAudio/PipeWire, and when any application grabs the mic, it starts an ffmpeg recording in the background. When the mic goes idle, it stops ffmpeg and drops a message on the queue.
hubert is a daemon with four workers. Each worker owns one stage: MP3 conversion, transcription, LLM analysis, and Outline publishing. They talk to each other exclusively through AMQP queues. Nothing is shared between them except files on disk.
Both are written in Crystal. Yes, the Ruby-looking language with real types and a real compiler. We'll come back to that choice, because it's more interesting than it sounds.
Step One: Catching the Meeting

The first problem is the dumbest and also the hardest: how do you know a meeting is happening?
We tried several approaches. Calendar integration is brittle. People move meetings, impromptu calls happen, reschedule multiple times, and anything "critical" never makes it onto the calendar until after the fact. Detecting specific apps (Meet, Zoom, Jitsi, Teams) means maintaining a list forever. Global hotkeys rely on the human being disciplined. None of that works.
Then we realized the answer was already on the system. On Linux, PulseAudio (and PipeWire, via the Pulse compatibility layer) exposes a subscribe API that tells you every time an application opens, closes, or changes a source input. In shell terms:
pactl subscribe
That command streams events forever. One of them is Event 'new' on source-output. The moment Google Meet grabs your mic to record you, bonisseur sees it, spawns an ffmpeg process, and starts writing a WAV file. When the source-output disappears, bonisseur stops ffmpeg.
That's it. That's the meeting detector. No calendar hooks, no browser extensions, no "please install this Zoom plugin." The mic itself is the source of truth, because it's the thing that cannot lie: if something is recording you, you are in a meeting.
The ffmpeg invocation looks roughly like this:
ffmpeg -y \
-f pulse -i default \
-f pulse -i @DEFAULT_MONITOR@ \
-filter_complex "amix=inputs=2" \
-ac 1 -ar 16000 \
-metadata "recording_id=$RECORDING_ID" \
-metadata "start_date=$START_DATE" \
output.wavNotice the -metadata flags. We embed the recording UUID and the start timestamp directly into the WAV file. This is a recent and very deliberate change. Previously, we passed all of that context through AMQP messages, which meant every worker had to agree on a message schema, every change meant a migration, and every typo in that schema corrupted a recording's identity. Now the recording is its own metadata. The WAV file knows who it is and when it was born. Every downstream step just runs ffprobe and asks the file directly.
This is one of those "obvious in hindsight" design choices. The media file is the single source of truth. The queue is just a notification: something happened in this directory, go look.
Note: This does not authorize the recording of meetings without the explicit consent of all participants.
Step Two: The Pipeline

Once bonisseur has a WAV file, it publishes a single AMQP message containing one thing: the directory path. That's all the message carries. The worker that picks it up knows what to do by knowing which queue it lives on.
Hubert runs four workers in a single process (four fibers, thanks to Crystal's cheap concurrency):
- convert writes MP3 from WAV
- transcribe turns MP3 into text with speakers
- analyze turns text into a title, a summary, todos, and minutes
- publish ships everything to Outline
Each worker consumes from one queue and publishes to the next. Between them, the filesystem is the database. Each stage writes its artifact into the same ~/hubert/YYYY/MM/DD/<uuid>/ directory. If a worker restarts mid-job, it reads the directory, sees which files already exist, and either resumes or skips. Idempotency for free, no external state store, no locks, no Redis, no Postgres. Just files and messages.
convert

The convert worker is the boring one, and boring is good. It runs ffmpeg with libmp3lame -q 2, which gets us roughly 190 kbps VBR. Plenty for speech, small enough to upload. Crucially, it re-maps the WAV metadata onto the MP3 so the recording_id and start_date survive the transcoding. The ffprobe-based metadata contract is preserved end to end.
That's literally the whole worker. It's forty lines of Crystal plus error handling. Boring. Good.
Transcribe, where we met Gladia

Now we get to the interesting bit. For a long time we were fine with nothing. We'd just ship MP3s to the wiki and call it a day. The problem is that a meeting recording you can't grep is worse than no recording at all. Nobody opens a 50-minute MP3 to find "what did we decide about the staging env last Tuesday." You need text, and the text has to be good enough that you can search it and trust it.
We went shopping for transcription APIs. We had three hard requirements:
- Diarization. Speaker labels are non-negotiable. A transcript without "who said what" is just a wall of text with no narrative.
- Multilingual with code-switching. Our team speaks English, French, and Arabic, and we switch between them mid-sentence. An engine that picks one language at the start of the call and sticks with it will mangle 70% of our meetings.
- Quality good enough to actually trust. If we have to re-listen to the audio to double-check every transcript, the transcript has failed. It has to be correct often enough that we can skim it.
Now, if you've read any of our previous posts, you already know where our instincts live: we love self-hosting. We run our own CI, our own dashboards, our own logs, our own everything we reasonably can. Our default answer to "should we use a SaaS for this?" is "no, let's host it." So the honest first move was to try every open-source speech-to-text model we could get our hands on. Whisper in all its sizes, the faster-whisper variants, WhisperX with pyannote diarization bolted on, a couple of the newer open models that were getting noisy on Hugging Face, and a few self-hosted wrappers around them. We genuinely wanted one of these to win. Our GPU was sitting right there, already paid for, already running Qwen 3.5. Transcription would have been the obvious next tenant.
It didn't work out. Not for lack of trying. The monolingual English benchmarks are lovely, and for a single clean speaker reading a podcast script, most of these models are excellent. Put them in a room with three of us switching between French, English, and Arabic on a slightly bad line, and the wheels come off. Diarization bolted on as a separate stage is fragile and disagrees with itself between runs. Code-switching mid-sentence confuses language ID. Arabic, in particular, is where the gap between "impressive demo" and "usable in production" is the widest. We ended up with transcripts that looked almost right, which is the worst possible outcome, because "almost right" is what hallucinations look like, and you can't skim a transcript you don't trust.
So we did the thing we normally grumble about and went shopping for a hosted API. After a few trials, we landed on Gladia. Gladia does diarization natively, handles our three languages with code-switching enabled, and, crucially, the transcripts it returns are actually correct often enough to be useful without a re-listen. The API is exactly the kind of thing you want when you're building a pipeline like this one: you POST an audio URL (or upload the file), you poll for a result, you get back a structured JSON blob with speakers, timestamps, and text. No drama. No surprise billing. No "please contact sales to unlock this feature." It just works.
To be very clear, this is not us giving up on self-hosting. It's us being honest about where the state of the art currently sits for the specific, unglamorous problem of multilingual code-switched meeting audio. We will absolutely revisit this the moment an open model crosses the quality bar, and we'll evict Gladia from the pipeline the same way we eventually evicted OpenRouter from the analyze worker. Until then, we'd rather use the tool that tells us the truth about what was said in the room than cling to a principle that leaves us with transcripts we can't trust.
Our transcribe worker is basically:
response = Gladia.transcribe(
audio_url: url,
diarization: true,
enhanced: true,
languages: ["en", "fr", "ar"],
code_switching: true,
sentiment_analysis: true,
)
We enable sentiment analysis because it's cheap and occasionally hilarious. Seeing a meeting tagged as "frustrated (93%)" is a surprisingly honest retrospective. We also tell Gladia to favor enhanced diarization, which is slower but noticeably better when four people on a bad line talk over each other.
The worker polls every five seconds with a sixty-minute cap. If Gladia doesn't come back in 60 minutes, the message becomes a DLQ candidate. More on that in a second. Output is written as a markdown file with one line per speaker segment:
**Speaker 0** [00:12]: alors on reprend le point sur l'infra
**Speaker 1** [00:15]: yes, le staging is still down
That's a real line from a real call, and yes, that's how we talk. Gladia handles it without complaint, which was the main thing we were looking for.
Analyze, the Qwen 3.5 part

A transcript is raw material. To be useful, it has to become a title, a summary, a list of action items, and, for the more formal meetings, minutes. This is a classic LLM job, and it's where we have the most fun.
We use Qwen 3.5 for this stage, and, importantly, we run it locally on a single NVIDIA card sitting in a machine in our office. That was not where we started. The first version of the analyze worker pointed at OpenRouter, because OpenRouter is by far the easiest way to try ten models on a Tuesday afternoon without reading ten sets of API docs. We spent a couple of weeks there, hopping between Claude, GPT, Llama variants, a few Mistral models, and eventually Qwen. OpenRouter is genuinely great for that kind of shopping trip. We are never going back to the days of a separate SDK per provider.
Once the shopping was done, though, the numbers started pointing in an uncomfortable direction. We were running Hubert on dozens of meetings a week, each meeting triggered four LLM calls, and the transcripts for long calls are not small. Even at OpenRouter's pleasant pricing, the bill had a trajectory we did not love, and more importantly, every one of those calls sent a full meeting transcript, sometimes with client names, internal project names, and candid opinions, to a third party. We trust the providers involved. We trust our own hardware more.
So we bought a GPU. Nothing exotic, just a single NVIDIA card, the kind of machine you can put in a corner and forget about. We pulled Qwen 3.5 with Ollama, which is honestly the most boring and the most correct way to run a local model in 2026: ollama pull, ollama serve, done. Ollama speaks the OpenAI-compatible chat completions dialect out of the box, so the analyze worker just had its base URL pointed at the Ollama endpoint and kept going. The only line in the code that had to change was that URL. Everything else, the prompts, the temperature, the retries, the four separate calls, stayed exactly the same. That is the quiet superpower of the OpenAI-compatible protocol: even if you never touch the OpenAI API, half the ecosystem still talks to you.

Qwen 3.5 won the final bake-off for two reasons. First, its French and Arabic handling is genuinely strong, better than we expected from an open-weights model, and the code-switched transcripts Gladia produces do not confuse it. Second, it runs comfortably on the hardware we actually have, with enough headroom to stay responsive even when a transcription lands at the same time as someone else is asking the same GPU to do something else. When you're processing tens of meetings a week on your own silicon, those two properties together matter more than any benchmark.
The worker makes four separate calls to Qwen 3.5, one per artifact. We explicitly do not try to do it all in one prompt. The reason is simple: a focused prompt beats a clever prompt every time. If you ask a model for "a title, a summary, a todo list, and formal minutes, in JSON, in this exact schema," you will spend half your life fighting JSON repairs. If you ask it for "a short descriptive title, maximum ten words, no quotes, no explanation," you get back a short descriptive title.
Here's a trimmed version of the title prompt:
You are a meeting assistant. Read the transcript below and generate
a short, descriptive title. Max 10 words. Return ONLY the title, no
quotes, no markdown, no explanation. If the transcript is empty or
off-topic, return an empty string.
Three things to notice. First, we're extremely bossy about format. We don't politely ask for "just the title please"; we tell the model what it is not allowed to do. Second, we pre-commit to an escape hatch: "return an empty string." That single line avoids an entire class of hallucinations, because the model is now explicitly allowed to say "nothing." Third, we set the temperature to 0.3. Low enough to be consistent across reruns, high enough to not strangle the model's phrasing.
The summary prompt is longer but built the same way: a role, a list of dos, an aggressive list of don'ts, and an escape hatch. The todos prompt asks for a markdown checklist and explicitly forbids inventing action items that weren't actually assigned. The minutes prompt is the most formal and is only used for longer, multi-speaker recordings. Short one-on-ones don't get minutes, because turning a five-minute chat into "formal meeting minutes" makes the model invent ceremony that never happened.
We talk to Qwen 3.5 directly now, not through a router, because the router lives on our desk. OpenRouter was the right tool for the exploration phase, when we needed to try fifty models a day without pain. Local inference is the right tool for the steady state, when we know exactly which model we want and we'd rather keep the transcripts, the latency, and the bill under our own roof. The analyze worker does not know the difference. It just POSTs a chat completion and gets one back. Sometimes boring is better.
Publish

By the time a message reaches the publish worker, the directory on disk looks like this:
~/hubert/2026/04/13/a3f9.../
├── recording.wav
├── recording.mp3
├── transcription.md
├── title.txt
├── summary.md
├── todos.md
└── minutes.md
The publish worker reads these files, extracts the start date from the MP3's embedded metadata, and creates a hierarchical document tree in our Outline wiki:
hubert (collection)
└── 2026
└── 04
└── 13
└── "Infra sync: staging rollout and DNS cleanup"
├── Transcription
├── Todo List
└── Meeting Minutes

The parent documents (year, month, day) are created on demand if they don't already exist. The main document carries the AI-generated title, the summary as its body, and a markdown link to the MP3, which is uploaded as an Outline attachment. The three sub-documents hold the transcription, todos, and minutes.
The last thing the publish worker writes, and this is important, is a file called outline_id. That file is how we know the job is done. If anything crashes before outline_id is written, the job is retried from whichever stage failed, and idempotency kicks in: workers see the existing artifacts on disk and don't redo work. If outline_id exists, publishing is skipped entirely. The filesystem is the ledger.
Retries, Dead Letters, and the Art of Not Panicking

Every main queue has two friends: a .retry queue and a .dlq queue. When a worker fails, the message is republished to the retry queue with a TTL. When the TTL expires, it bounces back to the main queue with an incremented x-retry-count header. After five tries, at 5s, 10s, 20s, 40s, 80s intervals, the message lands in the DLQ with the last error embedded in its headers and the original queue recorded for context.
Why bother? Because APIs are flaky, and "local" is not a synonym for "reliable." Gladia occasionally times out. The Qwen 3.5 box occasionally gets saturated and answers slowly, or reboots itself after a kernel update we forgot about. Outline occasionally decides it doesn't recognize your token for four seconds. If your pipeline explodes every time any of those things happen, you'll spend all day babysitting it instead of doing your job. The retry/DLQ pattern means 99% of failures heal themselves, and the 1% that don't sit quietly in a queue waiting for a human to look at them. We check the DLQ roughly once a week. It's usually empty.
The Sentry integration bolts on top of all this. Every exception a worker throws, including the ones that are going to be retried, gets reported with the directory path attached, so we can find the offending recording on disk in two clicks. That has been disproportionately useful during debugging, especially in the early weeks when we were still discovering the edge cases of our own transcription pipeline.
Why Crystal?

I can already hear the objections. Crystal? In 2026? Why not Go, Python, Rust, TypeScript, anything with an ecosystem?
Because Crystal is genuinely lovely to write in, and because this project is exactly the right size for it. The whole of Hubert (both services, all four workers, the API clients, the retry logic, the notification layer) is a few thousand lines of code. There's no framework. There's a standard library, a compiler that catches your mistakes, and a syntax that looks like Ruby if Ruby had grown up and gotten a type system. The binaries are single files. Deployment is scp plus a systemd unit. There is no runtime to install on the target machine, no virtualenv to manage, no node_modules folder quietly eating your SSD.
For a pipeline this size, that's the entire selling point. We did not need distributed tracing, a plugin system, or a "Hubert Enterprise." We needed two daemons that start on boot, don't leak memory, and don't fall over when an API returns 502. Crystal gave us that without making us write it in a language we'd rather not write in.
Would I pick Crystal for a web app with fifty contributors? Probably not. Would I pick it again for the next little daemon we need to write? Probably yes. But I also know that "probably yes" is not quite "definitely yes," and we'll come back to that at the end.
Things We Learned the Hard Way

A few things we did not expect.
Silent recordings are a real category. About 5% of our WAV files are completely silent, because the mic was opened but nobody actually spoke. An aborted call, a muted participant, a "checking your hardware" moment in Google Meet. We now skip any recording under a minimum duration and size threshold before it even hits Gladia. We used to pay to transcribe empty files. We don't anymore.
Titles are the product. People don't read the summaries. People don't read the minutes. People skim the sidebar of the wiki, read the titles, and click on the one that looks relevant. If the title is bad, the whole page might as well not exist. We spent more time tuning the title prompt than any other part of the system, by an embarrassing margin.
Filesystem-as-state is underrated. Every time we started sketching a "proper" state store for this, the design got worse. Files on disk, with existence as the completion marker, is simple, debuggable, and survives restarts with zero ceremony. For a single-machine or small-cluster pipeline, it's great. For a global SaaS, obviously not. But we are not building a global SaaS. We are building a butler.
Embedding metadata in the media file is a superpower. Being able to ffprobe any WAV or MP3 in our pipeline and recover its identity has saved us from data loss more than once. It's the single best architectural decision we made, and it cost us about ten lines of code.
Where Hubert Goes Next

We have a shortlist. Automatic detection of "private" meetings that shouldn't be published at all. Per-speaker voice identification, so the wiki says "Loïc" instead of "Speaker 0" (this is harder than it sounds and we're not rushing it). A "last week in meetings" digest posted to our internal chat every Monday morning. Maybe a small frontend so people can browse and search without opening Outline. Maybe none of those, if they turn out to be features nobody actually wants.
The point of Hubert is that it does its job and gets out of the way. Every feature we add has to pass the same test: does this make the butler more useful, or does it turn the butler into a project I have to maintain? If it's the second one, we don't build it.
Meetings are never going to stop. That's fine. We just refuse to let them disappear anymore.
A Word on the Code

One thing I have not mentioned yet: at the moment, Hubert is closed source. It started as an internal Kalvad tool and grew organically, which means there are still a few Loïc-shaped assumptions baked into it that we'd want to sand down before throwing it at strangers. Configuration paths assume Arch Linux. The desktop notifications assume notify-send. The Outline collection is hardcoded to hubert. None of that is hard to fix, it's just that we never had to.
We're also seriously considering a full rewrite in Zig before publishing it. Not because Crystal has failed us, it absolutely has not, but because Zig has been nagging at us for a while and a project this size is exactly the right sandbox to learn it properly. Zig gives us the same "single static binary, no runtime, no surprises" story that Crystal does, with tighter control over memory and a cross-compilation story that is frankly a little embarrassing for everyone else. It also forces us to be honest about every allocation, which for a long-running daemon that shuffles audio files around is more of a feature than a chore.
There's a less noble reason too: we just want to write some Zig. If you've ever rewritten a working tool in a new language purely because the new language is fun, you already know exactly what I mean.
When that rewrite lands, and when we've scrubbed out the Loïc-shaped assumptions, we plan to open it up. Until then, if you want your own Hubert, you can absolutely build it yourself. The moving parts are all standard: a mic watcher, a queue, a transcription API, an LLM, a wiki with an API. The hard part is not the code. The hard part is deciding that the human in the loop should not be the one taking notes.
Bring your own Gladia keys and Qwen 3.5. Bring patience for the first week. Bring a mic.
Where are those name coming from?
Hubert Bonisseur de la Bath, OSS 117, and Jean Dujardin: A Cultural Icon
Hubert Bonisseur de la Bath is the fictional secret agent known as OSS 117, created by French writer Jean Bruce in 1949. Originally an American agent of French descent, the character was reimagined in the 2006 film OSS 117: Cairo, Nest of Spies as a French agent working for the SDECE (France’s external intelligence service) during the Cold War. This transformation was spearheaded by director Michel Hazanavicius and actor Jean Dujardin, who brought a unique blend of humor, charm, and cluelessness to the role, turning OSS 117 into a cult figure of French comedy.
In short, Hubert Bonisseur de la Bath is a symbol of French comedic brilliance, and Jean Dujardin’s portrayal cements his status as one of the greatest actors of his generation, rightfully earning his place alongside legends like Louis de Funès.