Voice notes in Shoal — encrypted audio with optional on-device transcription

Talk when typing is the wrong shape

Some things are easier to say than to type. A bedtime story for a grandchild who’s away for the week. A quick walk-and-talk on the way home. The right tone of voice on something a sentence in a chat wouldn’t carry. Voice notes are for those.

Shoal voice notes sit inside the same conversations as text messages, with the same recipients, the same encryption, and the same admin oversight. Tap and hold to record; release to send. Playback is built in.

Optional on-device transcription

Voice notes are wonderful when you’re free to listen, and a nuisance when you’re not — in a meeting, on a noisy train, looking after a baby who’s just fallen asleep. So Shoal can turn a voice note into text on the receiving device, using the speech recogniser that’s already on your phone or laptop.

A few specifics:

It’s optional. Transcription is off by default for each conversation; you can turn it on for the conversations where it helps and leave it off elsewhere.
It runs locally. Transcription uses your operating system’s on-device speech recognition (the same engine that powers dictation in your keyboard). The audio is decrypted on your device, passed to the local recogniser, and the text comes back without ever crossing the network.
Transcripts are encrypted. Once produced, transcripts are encrypted with the conversation key — the same AES-256-GCM key that wraps the audio and any text in that conversation — and stored alongside the voice note for replay. Recipients who weren’t around when the transcript was generated can decrypt it later.
You can re-run it. If a transcript came out garbled, you can re-transcribe locally without sending the audio anywhere.

What our servers see

Nothing decrypted. The audio leaves the sending device as ciphertext; the transcript, if there is one, is produced and encrypted on a recipient device before it’s stored. Our servers route encrypted bytes and never see the audio waveform, the spoken words, or the resulting text.

This isn’t a separate privacy story bolted onto voice notes; it’s the same model as text messages. The only personal data we hold in plaintext is your email address, and voice notes don’t change that.

Admin oversight still applies

Voice notes are messages, so the same admin oversight rules apply. When a child is in a conversation, family admins are cryptographic recipients of that conversation’s key — and that key wraps the voice note’s audio and any transcript. Admins can listen to and read children’s voice notes in the same way they can read text messages, and only those conversations. There’s no separate decryption pathway and nothing for our servers to hand over.

Moderation acts on transcripts where a transcript exists — so the word-blocking and word-flagging rules an admin has set apply to spoken words too, on the child’s own device, before a voice note is sent. Without a transcript, only the time-limit and link-blocking rules apply, since there’s nothing for a keyword rule to inspect.

Why we did it this way

We could have built transcription as a server-side feature — upload encrypted audio, decrypt server-side, run it through a cloud speech model, hand the text back. It would have worked on more devices and produced slightly better transcripts. It would also have meant we needed to read your audio, which would have quietly undone the encryption claim that runs through everything else on this site.

So we didn’t. On-device transcription has limits — recognition quality varies by platform, some older devices don’t ship a recogniser at all, and a few languages aren’t covered yet — but it keeps the architectural property that matters: decrypted message contents, in any form, never reach our servers.