CloviTranscribe

CloviTranscribe — private, self-hosted transcription that turns audio and video into searchable, speaker-labeled text

Transcription that stays on your side, by CloviTek

TL;DR

See what's inside — how it works

At a glance

Best for
  • Podcast and video producers who deliver transcripts and captions
  • Operations and compliance teams documenting calls and meetings
  • SaaS builders embedding transcription inside their own product
Integrations
  • REST API (Business tier)
  • Webhooks with HMAC signing (Business tier)
  • CloviTek SSO
  • Automation platforms via webhook
  • Slack notifications
  • Direct file upload (MP3, MP4, WAV, M4A, OGG, WEBM)
Alternative to
  • Cloud per-minute transcription and AI note-taking services

Overview

Audio and video pile up faster than any team can process by hand — and the usual answers either charge by the minute until costs compound or require shipping private recordings to someone else's cloud. Not ideal. CloviTranscribe takes a different path: a self-hosted transcription engine with a clean web interface, where recordings are processed on infrastructure you control and exported in the formats you actually use. No metered billing. No third-party servers handling your sensitive material.

CloviTranscribe is an all-in-one self-hosted transcription workspace — featuring file and link ingestion, a job queue with live status, 99-language detection, and speaker labels. Multi-format export, full-text search, and AI summaries on higher tiers. Everything you need in one place.

Private, self-hosted transcription

Recordings are processed by a self-hosted Whisper engine running on the server you control — audio never leaves your infrastructure and isn't handed off to a separate third-party transcription cloud. That makes CloviTranscribe a natural fit for teams handling vendor negotiations, privileged conversations, or any sensitive material where data location actually matters. Because the engine runs on hardware you've already provisioned, there's no per-minute compute charge stacking up behind every job. No usage creep. One tradeoff worth naming up front: CPU-based processing is thorough rather than instant, so longer files take longer to return than a paid cloud API would.

Upload or link, then transcribe

Submit your work two ways — upload a file directly in common formats like MP3, MP4, WAV, M4A, OGG, and WEBM, or paste a link to a hosted recording. Simple as that. Each submission becomes a tracked job in the queue, and the dashboard polls status so progress stays visible without manual refreshing. Language is auto-detected across 99 languages, with an optional hint when the source is already known. The flow is intentionally plain — choose a source, start the job, and collect the transcript when it finishes. No unnecessary steps, just results.

Clean exports with timestamps

Finished transcripts export as plain TXT for documents, SRT and VTT for captions and web video, and JSON for downstream processing — all the formats a producer or developer would actually reach for in production. Word-level timestamps keep captions synced with playback, and structured output stays machine-readable. The export set covers the common destinations: a show-notes paragraph, a caption track, a JSON payload feeding another system. And here's the part that saves time — each completed job stays in the library for re-export. No forced re-runs, no starting over every time a different format is needed.

Speaker labels and AI summaries

Transcripts can be segmented by speaker turn — so a conversation reads as a back-and-forth rather than a wall of text — and speakers can be renamed in the viewer. On higher tiers, an AI summary distills a recording into a short overview with action items and key points, which turns a long call into something skimmable in seconds. One thing worth knowing about speaker separation: it uses sequential turn labels rather than acoustic voice fingerprinting. It marks who-spoke-when by turn boundaries, not by voice recognition, so treat the labels as an editable starting point — not a forensic identity match.

Search, library, and a developer API

Every completed job lands in a searchable library — and full-text search spans stored transcripts, so a phrase from weeks ago is one query away. Important jobs can be starred for quick return, and retention scales with whatever tier you hold. For builders, the Business tier exposes a REST API and HMAC-signed webhooks, so a finished transcript gets pushed straight into another product or automation. That's the path for embedding transcription inside your own application rather than operating it as a separate tab — direct integration, not another browser window to juggle.

CloviTranscribe brings transcription back under your own roof — private processing, predictable lifetime pricing, and exports that drop straight into captions, documents, or code. No middleman, no recurring fees eating into your margin. It suits producers shipping transcripts, operations teams documenting calls, and developers who need a transcription endpoint they can actually embed. The roadmap ships on a public cadence — roughly monthly — and founder responses to reviews and questions are part of how the product improves. Stack a tier to raise your monthly minutes and retention, and reach the top tier to unlock the API and webhooks. Grab a lifetime code and start with your next recording.

Plans & features

$59 one-time Tier 1 (entry code)
Lifetime access at Starter-equivalent: 300 transcription minutes refreshed monthly, file upload and link transcription, word-level timestamps, TXT/SRT/VTT export, 30-day retention, 99-language detection, single concurrent job. 60-day refund.
$129 one-time Tier 2
Everything in Tier 1 plus a higher monthly minute cap, speaker turn labels and renaming, JSON export, AI summary with action items, full-text search across your library, 90-day retention, and 2 concurrent jobs.
$249 one-time Tier 3
Everything in Tier 2 plus the high-margin developer layer: REST API access, HMAC-signed webhooks, custom vocabulary hints, the highest monthly minute cap, 365-day retention, 5 concurrent jobs, and priority queue placement.
$399 one-time Tier 4
Everything in Tier 3 plus white-label embed with configurable subdomain and logo, multiple API keys and webhook endpoints, and a larger monthly minute pool for agencies running transcription across several client accounts.
$599 one-time Tier 5
Everything in Tier 4 plus the largest monthly minute pool, custom-domain configuration, expanded concurrency for batch workflows, and priority support with a defined response window.

FAQs

Where is my audio processed?
Audio is processed by a self-hosted Whisper engine on the server that runs CloviTranscribe rather than being routed to a separate third-party transcription cloud. That is the core privacy and data-location benefit of the product.
How fast is transcription?
Processing runs on CPU, which is reliable but not instant. Short clips return quickly, while longer recordings take proportionally longer than a paid cloud API would. Jobs run through a queue with live status so you can submit and walk away.
How does speaker separation work?
Transcripts are split into speaker turns and you can rename each speaker in the viewer. The current approach labels turns sequentially rather than matching voices by acoustic fingerprint, so treat the labels as an editable starting point. Acoustic diarization is on the public roadmap.
Is the pricing really lifetime, and are there usage limits?
Yes. Each tier is a one-time purchase for lifetime access. Transcription minutes are capped per tier and refresh every month, so there is no per-minute metering that compounds, and no tier is truly unlimited. Higher tiers raise the monthly cap and retention.
Which export formats are supported?
TXT, SRT, VTT, and JSON, with word-level timestamps. SRT and VTT cover captions and web video, TXT covers documents and show notes, and JSON covers structured downstream processing.
Can I embed transcription in my own product?
Yes, on the API tier and above. A REST API and HMAC-signed webhooks let you submit jobs and receive completed transcript JSON in your own application or automation. White-label embed with your own subdomain and logo is available on the higher tiers.
What does the refund policy look like?
Every purchase includes a 60-day refund window, no questions asked, so you can transcribe real files and confirm the tool fits your workflow before committing.

CloviTranscribe vs. the alternatives

Feature CloviTranscribe Otter.ai / Fireflies Rev / Sonix (cloud)
Audio stays on your server Yes — self-hosted No — their cloud No — their cloud
Pricing model One-time LTD, minutes renew monthly Monthly subscription per seat Per-minute or monthly sub
Export formats TXT, SRT, VTT, JSON + timestamps TXT, DOCX, PDF TXT, SRT, DOCX, JSON
Speaker labels Yes (turn-based, renameable) Yes (acoustic ID) Yes (acoustic ID)
AI meeting summary Yes (Tier 2 and above) Yes Add-on / higher plan
REST API + webhooks Yes (Tier 3 and above) Yes (paid plan) Yes (enterprise)
White-label / embed Yes (Tier 4 and above) No No
99-language detection Yes — Whisper auto-detect Limited languages Select languages
Full-text search across library Yes Yes Partial / plan-gated
Recurring cost after purchase None (LTD) $16–$40/mo per seat $0.02–$0.25 per minute

Key tradeoff worth knowing: CloviTranscribe runs on CPU rather than a paid cloud GPU, so longer files take proportionally longer than a cloud API would. You trade raw speed for privacy and flat pricing.

Why CloviTek built CloviTranscribe

I kept running into the same wall — piles of recorded calls, interviews, and product walkthroughs that needed to become text, and fast. The transcription tools I tried? They either charged by the minute until the bill made my stomach turn, or they asked me to upload private recordings to a cloud I didn't control. Neither felt right.

So CloviTranscribe runs on hardware you already own. Simple as that. It slots into the same CloviTek productivity suite I lean on every day — documents, slides, automation, all of it in one place.

Is it the fastest option on a CPU? No. And I'd rather tell you that now than oversell it. What it does give you is private transcription you can trust, and pricing that doesn't punish you for actually using it. That matters more than shaving off a few seconds, at least to me.

by CloviTek · Vitaly Kirkpatrick