> ## Documentation Index
> Fetch the complete documentation index at: https://docs.vowen.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Models & Engines

> Choose the right transcription model for your use case.

Vowen supports multiple transcription engines — both local (offline) and cloud-based. This guide helps you pick the right one.

## Quick Recommendations

| Use Case               | Recommended Model                             | Why                                |
| ---------------------- | --------------------------------------------- | ---------------------------------- |
| Quick notes (macOS)    | Base.en or Parakeet                           | Fast, good accuracy, works offline |
| Quick notes (Windows)  | Groq Whisper Turbo                            | Fast cloud model, free tier        |
| Professional writing   | Large v3 Turbo + AI Enhancement               | Best accuracy + polished output    |
| Non-English            | Large v3 or Groq Large v3                     | Best multilingual accuracy         |
| Maximum privacy        | Any local model                               | Nothing leaves your device         |
| Real-time preview      | Parakeet, Deepgram, Soniox, or Cartesia Ink 2 | Shows text as you speak            |
| Meetings (diarization) | Deepgram Nova 3 or AssemblyAI                 | Identifies who said what           |

## Local Models (Offline)

These run entirely on your machine. No internet required. All local models are downloaded on demand from within the app; nothing is bundled with the installer. The **Base** model is the default and is offered for download during onboarding.

### Whisper Models

Based on OpenAI's Whisper, supporting 99 languages.

| Model              | Size   | Speed       | Accuracy            | Best For               |
| ------------------ | ------ | ----------- | ------------------- | ---------------------- |
| Tiny               | 78 MB  | Fastest     | Basic               | Quick notes, testing   |
| Tiny.en            | 78 MB  | Fastest     | Good (English)      | Fast English dictation |
| Base.en            | 148 MB | Fast        | Good                | General English use    |
| **Base** (default) | 148 MB | Fast        | Good                | Multilingual basics    |
| Small              | 488 MB | Medium      | Great               | Professional work      |
| Small.en           | 488 MB | Medium      | Great (English)     | Detailed English       |
| Medium             | 1.5 GB | Slow        | Excellent           | High-quality output    |
| Medium.en          | 1.5 GB | Slow        | Excellent (English) | Long-form English      |
| Large v3           | 3 GB   | Slowest     | Best                | Maximum accuracy       |
| Large v3 Turbo     | 1.6 GB | Medium-Fast | Excellent           | Best balance           |

<Note>
  Models with `.en` suffix are English-only and slightly more accurate for English than their multilingual counterparts.
</Note>

### Parakeet TDT 0.6B

NVIDIA's streaming-capable model. Supports 25 European languages with auto-detection.

|           | macOS  | Windows     |
| --------- | ------ | ----------- |
| Format    | CoreML | ONNX (int8) |
| Size      | \~1 GB | \~478 MB    |
| Streaming | Yes    | Yes         |
| Languages | 25     | 25          |

<Tip>Parakeet is excellent for real-time transcription preview and European languages. The first 1-2 transcriptions after launch may be slower as the model loads into memory.</Tip>

Recent Parakeet improvements:

* **Voice activity detection (VAD)** now drives streaming, for cleaner utterance boundaries and more reliable real-time transcription
* **Auto-translate to English** works with Parakeet (Pro)
* On **Windows**, long recordings are automatically chunked so lengthy audio transcribes reliably
* Accidental empty taps (audio under \~0.3 seconds) are discarded silently — they won't surface an error or leave an entry in your Voice Log

## Cloud Models

These send audio to a third-party API. Require an internet connection and API key.

### Available Cloud Models

With cloud models, your audio is sent to the provider, processed, and the transcription is returned. Some models stream text back as you speak (**Real-time:** Yes); others return the full transcript once you stop (**Real-time:** No).

| Model                 | Provider     | Languages              | Real-time | Diarization  | Free Tier      |
| --------------------- | ------------ | ---------------------- | --------- | ------------ | -------------- |
| Whisper Large v3      | Groq         | 99                     | No        | Yes (macOS)¹ | Yes (generous) |
| Whisper Turbo         | Groq         | 99                     | No        | Yes (macOS)¹ | Yes (generous) |
| gpt-4o-transcribe     | OpenAI       | 99                     | Yes       | Yes          | Paid           |
| Gemini (Live + Flash) | Google       | Many                   | Yes       | Yes          | Free API       |
| Nova 2/3              | Deepgram     | 99                     | Yes       | Yes          | \$200 credit   |
| Scribe v2             | ElevenLabs   | 99                     | Yes       | Yes          | Limited        |
| Universal             | AssemblyAI   | 6 streaming / 99 batch | Yes       | Yes          | \$50 credit    |
| Voxtral Mini          | Mistral      | 13                     | Yes       | Yes          | Free API       |
| Saaras v3             | Sarvam AI    | 22+                    | Yes       | No           | Limited        |
| Soniox (STT v5)       | Soniox       | 60+                    | Yes       | No           | Paid           |
| Ink 2                 | Cartesia     | English                | Yes       | No           | Paid           |
| Aurora                | XAI          | Various                | Yes       | Yes          | Limited        |
| Speechmatics          | Speechmatics | 39                     | Yes       | Yes          | Paid           |

<Note>**OpenAI** (`gpt-4o-transcribe`) and **Google Gemini** are now real-time streaming providers, so they can power live transcription preview and [Ask AI live during a meeting](/meeting-notes/recording#ask-ai-live-during-a-meeting), not just file transcription. Gemini's live sessions are capped at 15 minutes each, while its file transcription handles audio up to \~9.5 hours.</Note>

<Note>AssemblyAI Universal supports 6 languages in real-time streaming (English, Spanish, French, German, Italian, Portuguese) and 99 languages when used for batch transcription of pre-recorded files.</Note>

<Note>¹ **Groq diarization** runs on-device through a built-in Pyannote pipeline and is available on **macOS only**. See [Diarization](/meeting-notes/diarization#on-device-diarization-macos).</Note>

<Note>**Cartesia Ink 2** is a streaming-first model tuned for the lowest word error rate and fastest live preview. It is **English-only** (more languages coming) and does not support diarization. Custom vocabulary is applied as post-processing rather than natively. It is recommended for real-time dictation.</Note>

<Note>**Soniox** runs its v5 models — real-time streaming for live dictation and an async model for file transcription — covering 60+ languages with low latency. Soniox does not provide speaker diarization; use a diarization-capable model (or on-device diarization on macOS) for meetings where you need to identify who said what.</Note>

### Setting Up Cloud Models

1. Go to **Settings > Models**
2. Select a cloud model from the list
3. Enter your API key when prompted
4. The model is ready to use immediately

<div className="my-6 flex gap-3 rounded-xl border border-violet-500/30 bg-violet-500/10 p-4">
  <div className="shrink-0 pt-0.5 text-violet-500 dark:text-violet-400">
    <svg width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" strokeWidth="2" strokeLinecap="round" strokeLinejoin="round">
      <path d="M15 14c.2-1 .7-1.7 1.5-2.5 1-.9 1.5-2.2 1.5-3.5A6 6 0 0 0 6 8c0 1 .2 2.2 1.5 3.5.7.7 1.3 1.5 1.5 2.5" />

      <path d="M9 18h6" />

      <path d="M10 22h4" />
    </svg>
  </div>

  <div className="text-sm leading-relaxed text-zinc-700 dark:text-zinc-300">
    <strong className="text-violet-700 dark:text-violet-300">Pro tip:</strong> Groq is the most popular choice among Vowen users. It's fast, accurate, and has a generous free tier that covers most daily use.
  </div>
</div>

## GPU Acceleration (Windows)

If you have an NVIDIA GPU, you can dramatically speed up local model transcription:

1. Go to **Settings > Models**
2. Scroll down to find "GPU Acceleration"
3. Download the CUDA acceleration module
4. Restart Vowen (or your system if needed)

With GPU acceleration, even the Large v3 model responds in 1-2 seconds on modern NVIDIA GPUs.

## Choosing Between Local and Cloud

| Factor          | Local                        | Cloud                  |
| --------------- | ---------------------------- | ---------------------- |
| Privacy         | Data never leaves device     | Audio sent to provider |
| Speed (macOS)   | Fast for small/medium models | Fast always            |
| Speed (Windows) | Slow without GPU             | Fast always            |
| Accuracy        | Good to excellent            | Excellent              |
| Internet        | Not required                 | Required               |
| Cost            | Free                         | Free tier or paid API  |
| Languages       | 99 (Whisper) / 25 (Parakeet) | Varies by provider     |
