Can you build your own meeting transcription app?

Yes. Apple's SpeechAnalyzer API provides on-device speech-to-text, and Deepgram offers cloud-based transcription with speaker diarization. Both can be integrated into a macOS app using ScreenCaptureKit for system audio and AVAudioEngine for microphone input.

How much does it cost to run your own transcription service?

With Deepgram's Nova-2 model and speaker diarization, roughly $16-32 per month for a typical meeting schedule of 8 hours per week. Compare that to $14-18 per month for Granola or Notion AI, which include transcription, diarization, and AI summarization.

Is Apple's on-device speech recognition good enough for meeting transcription?

Apple's SpeechAnalyzer produces decent transcriptions but does not support speaker diarization. You cannot run two instances simultaneously, which makes it difficult to separately transcribe microphone and system audio for speaker identification.

What is speaker diarization and why does it matter?

Speaker diarization identifies who said what in a conversation. Without it, every line of your meeting transcript is attributed to a single speaker, which makes the notes much less useful for reference.

I built my own Granola (and why you shouldn't)

Building your own meeting transcription app is a great learning exercise and a terrible financial decision. Granola and Notion AI win on volume pricing that a solo builder cannot match.

I have a habit of building my own versions of apps I use. I built my own Vivino for tracking wine. I built my own MyFitnessPal for tracking food. This week, I built my own Granola.

Granola, if you haven't used it, is a meeting transcription app. It listens to your meetings, transcribes the conversation, identifies who said what, and generates AI-powered notes. It costs $14 a month. I actually use Notion AI, which is similar and bundled with Notion's Business plan. As is the curse of many a nerd, I spend every day searching for a new note taking app and a new task app, and now, thanks to Claude Code, I am cursed with the ability to build my own.

In any event, my version of Granola does the same thing. Sort of. It took me about two evenings to build, with significant help from Claude Code. It captures microphone audio, captures system audio from Zoom or Google Meet, transcribes everything using Apple's on-device SpeechAnalyzer, and saves the result as a Markdown file.

And I should not have built it.

The project was genuinely fun

I want to be clear about this. I do not regret the time I spent. I learned things I would not have learned any other way.

Apple shipped a new SpeechAnalyzer API in macOS 26 that I had never touched. I learned how ScreenCaptureKit captures system audio, which requires pretending to capture video at 1 frame per second because the API was designed for screen recording, not audio. I got deeper into Swift concurrency, @MainActor isolation, and the specific ways nonisolated(unsafe) lets you share state with the audio render thread without crashing.

I also learned that Apple's SpeechAnalyzer cannot run two instances simultaneously. I discovered this the hard way, after writing all the code to run parallel transcription engines for microphone and system audio. The second instance fails silently with a "Registration timer expired" error that appears nowhere in Apple's documentation.

These are the kinds of things you only learn by building. Reading the documentation would not have taught me any of it.

I also reinforced how much I loathe Apple's developer tools and the abysmal state of everything in the Apple developer ecosystem.

Where it fell apart

The first version worked fine for basic transcription. One audio stream, one speaker, text saved to a Markdown file. The problems started when I tried to add speaker diarization.

In a real meeting, you need to know who said what. "Let's push the launch to Q3" means very different things depending on whether your VP said it or the intern suggested it. Without diarization, every line of the transcript just says "Me." Not useful.

Apple's SpeechAnalyzer does not support diarization. It has no concept of multiple speakers. With Claude's help, I tried an energy-based approach: capture system audio separately, measure its energy level, and when the energy is high, assume the current transcript segment came from someone else.

This works when people take turns speaking. It falls apart completely when two people talk at the same time, which in my experience is most of every meeting. The energy-based heuristic cannot tell the difference between speakers as the points they're making in a meeting overlap. Both produce high energy on the system audio channel.

The real solution is Deepgram. Their Nova-2 model supports native speaker diarization. You send audio, they return text with speaker labels. It works well. But it costs money.

The math does not work

I have about 16 meetings per week, each roughly 30 minutes. That is 480 minutes per week, or about 2,080 minutes per month.

Deepgram's approach requires two audio streams (microphone and system audio), each sent to a separate WebSocket connection. That doubles the billable minutes to 4,160 per month. At Nova-2 pricing ($0.0058/min) plus the diarization add-on ($0.0020/min), that comes to about $32 per month. Just for transcription.

Add OpenAI for AI summarization and you are north of $35 per month for a meeting notes tool.

Granola costs $14 per month. Notion AI runs $18 per month, bundled with everything else Notion Business does.

How do they charge so much less? Volume pricing. Granola processes millions of minutes of audio per month across their user base. Their per-minute cost from Deepgram (or whatever provider they use) is a fraction of what I pay on the Pay As You Go tier. They spread infrastructure costs across thousands of paying users. The economies of scale are working in their favor and against mine.

What I actually wanted

My use case was specific. I wanted meeting transcripts saved as Markdown files in a local directory that I use as a knowledge base with Claude Code. I call it my "brain." It is a repo full of notes, documents, and reference material that gives Claude context about my work.

I built the whole pipeline: transcribe the meeting, summarize it with AI, format it as Markdown with YAML frontmatter, save it to the right folder based on which calendar the meeting came from.

Then I remembered that I already have a utility that runs every night and exports my Notion pages to the same Markdown files in the same directory. If I use Notion AI for transcription, the notes end up in Notion, get exported overnight, and land in my brain repo by morning. Same result. Less code. Someone else maintains the transcription infrastructure.

The real lesson

I wrote about this pattern in Picks and Shovels. When everyone is rushing to build their own AI-powered tools, the companies selling the infrastructure make the reliable money. The same logic applies to using those tools. When Granola or Notion has already negotiated volume pricing, hired audio engineers who specialize in diarization, and built a polished product around the transcription workflow, the math is hard to beat by going solo.

Build your own version if you want to learn. I genuinely recommend it. The two nights I spent on this project were just really, really fun. This builder era is an absolute joy. There's no substitute for learning.

But think of your unit costs if what you're trying to do is save money. Building my own Vivino and my own MyFitnessPal made obvious sense. I don't have to pay an absurd amount of money to remove ads or track my food intake, and receive an abysmal, complicated, and just enshittified product in return.

On the other hand, Granola and Notion AI are both great products. They are not perfect, but they are good enough for me and they both make the math work.