Speech Intelligence API

Speech Processing API: language and soft-skill scoring from one recording

The ZenHire speech analysis API for hiring scores spoken English from a 4-minute audio or video clip, returning a CEFR A1-C2 level, 0-100 dimensional scores for fluency, vocabulary, and pronunciation, plus built-in fraud signals, from a single REST call.

Get API access Book a demo

CEFR A1-C2 in ~4 minutes90-96% alignment with PhD linguistsJSON response

POST /v1/speech-analyze

{
  "cefr_level": "B2",
  "overall_score": 78,
  "dimensions": {
    "fluency": 81,
    "vocabulary": 74,
    "pronunciation": 79,
    "accent_intelligibility": 83
  },
  "fraud_signals": {
    "multiple_speakers": false,
    "reading_detected": false
  }
}

What does the speech analysis API score from a recording?

Each call reads an MP3, WAV, OGG, FLAC, M4A, or AIFF file, or a video clip, and returns one CEFR band from A1 to C2 for the recording plus a CEFR level per dimension, 0-100 sub-scores, an overall number, and fraud signals. A recruiter sees a standardized verdict instead of replaying audio.

The endpoint expects at least 3 minutes of audio to score reliably; where a clip is too short or too noisy to judge, the API returns a low-confidence flag rather than a guessed band, so a thin two-second answer never inflates a score. If only some dimensions can be scored, credits are charged proportionally, and failed runs are never charged. Pair it with an ai interview api and the same recording feeds both the language band and the structured interview score.

The same trained model that powers this endpoint reaches 97% accuracy on CV field extraction and 91% on scripted or AI-generated response detection elsewhere in the ZenHire stack. The speech scores are not a bolt-on, they share one feature-engineered backbone.

Field	What it carries
cefr_level	Standardized A1-C2 band derived from the dimensional scores
fluency	0-100 speech-rate and conversational-flow score
hesitation	0-100 score for micro-pauses and filler frequency
vocabulary	0-100 score for lexical diversity and contextual fit
pronunciation	0-100 clarity and intelligibility score
vocabularyCefr, fluencyCefr, accentCefr	A separate CEFR level per dimension alongside the overall band
transcript	Full transcript with speaker diarization, a speakerCount, and automatic candidate detection flagged high, medium, or low confidence
fraud_signals	Per-call multiple-speaker, reading, and diarization flags

Output-field manifest showing one recording (MP3, WAV, or video clip) opening through a downward connector into a grid of six named JSON output fields, each with a code-style key and a uniform range badge: cefr_level with an A1-C2 band badge, fluency, hesitation, vocabulary, and pronunciation each with a 0-100 score badge, and fraud_signals with a flags badge, plus each field's carried description. — The manifest maps each JSON key the speech analysis API returns to the range it reports on, so integrators know exactly which standardized value every output carries.

How does the speech analysis API measure fluency and accent?

This language scoring api reads acoustic signal: words per minute, pause length, filler density, lexical range, and how cleanly sounds resolve. It is the same read behind ZenHire's interview analysis layer. Accent feeds one axis only, intelligibility, which asks whether a listener can follow the speaker, not where they grew up.

Some buyers fear an automated voice assessment api will quietly mark down regional or non-native accents. It does not: accent is scored for clarity alone, sensitive attributes like race, gender, and age are architecturally excluded from the model by design, and every dimension ships with two plain-language sentences of explanation, written to be safe to surface directly to hiring managers, candidates, and audit reviewers, as part of ZenHire's approach to ethical hiring.

Signal, not stereotype

Scores pace, pauses, and clarity from the audio, with race, gender, and age architecturally excluded.

Accent for clarity only

Rates whether speech is intelligible, never penalizing a candidate for sounding non-native.

Explainable bands

Returns a per-dimension breakdown behind every CEFR level, so a reviewer can audit and override it.

Admittance membrane diagram with three vertical bands: a left panel of admitted acoustic signals (words per minute, pause length, filler density, lexical range, how cleanly sounds resolve, and accent for intelligibility only) each reaching a central glass-box membrane bar with an inward arrow, and a right panel of sensitive attributes race, gender, and age struck out and held away with no connector crossing the membrane. — The membrane makes the exclusion architectural rather than promised, showing which signals reach the score and which sensitive attributes are physically kept from ever touching it.

Can the speech analysis API rate soft and domain skills too?

Yes. The same call captures communication signals that map to soft skills, and you can pair it with a soft-skills assessment api and role-weighted scorecards to grade domain answers, since the speech analysis API is context-agnostic across question types. It also scores 16 languages, with a dedicated, most-tuned pipeline for assessing spoken English and a multilingual pipeline covering Spanish, French, German, Hindi, Japanese, Arabic, and more, so one integration covers multilingual hiring programs.

A spoken-language mismatch that slips past screening surfaces on the first live customer call, and by then the cost is sunk: replacing one frontline hire runs $5,000 to $20,000 (industry estimates), with roughly half of frontline leavers gone inside 90 days. Enforcing a hard CEFR floor on the recording itself, before anyone reaches a queue, catches that mismatch for the price of one API call.

1. Send any answer
POST a structured, situational, or open-ended response to /v1/speech-analyze, then poll for the result, typically ready in 2-5 minutes.
2. Read communication signals
Parse fluency, hesitation, and vocabulary as proxies for clarity, confidence, and articulation.
3. Layer a scorecard
Weight domain competencies with the assessments engine for a combined, role-specific verdict.
4. Set the threshold
Require B2 for customer-facing roles or C1 for managerial, then route at scale across your high-volume hiring pipeline.

Speech API solution Soft Skills Assessment API AI Interview API Assess English Proficiency for Hiring Interview Analysis Developer platform

Speech Analysis API questions, answered

What is a speech analysis API for hiring?+

A speech analysis API for hiring is a REST endpoint that scores a candidate's spoken English from an audio or video clip and returns a CEFR A1-C2 level plus 0-100 dimensional scores as JSON. ZenHire's runs in about 4 minutes per response and adds built-in fraud signals on every call, so a phone screen becomes a single API request: the core need in high-volume bpo hiring.

How accurate is the language scoring api against human reviewers?+

Scored against a panel of five PhD linguists, the language scoring api lands within their band 90-96% of the time. Hand the same clip to two untrained recruiters and they agree with each other only 68-75% of the time, because ear-fatigue and personal accent bias drift the verdict. The api does not drift: it is a glass-box model, so a re-run returns the identical band and each one arrives with the evidence attached.

Does the voice assessment api penalize non-native accents?+

The voice assessment api does not penalize non-native accents. Accent is rated for intelligibility only, meaning whether the speaker can be understood, and sensitive attributes like race, gender, and age are architecturally excluded from the inputs and the model.

How does the speech analysis API detect cheating?+

The speech analysis API runs integrity checks on every call and returns them in a fraud_signals object: multiple-speaker detection catches proxy interviews, reading detection flags scripted or AI-generated answers at 91% accuracy, and diarization isolates the candidate's voice in a two-way conversation for clean scoring.

What inputs and outputs does the speech analysis API support?+

The speech analysis API accepts MP3, WAV, and other audio formats plus video, which it auto-transcribes. It returns a JSON response with the CEFR level, fluency, vocabulary, pronunciation, and accent-intelligibility scores, an overall number, a fraud_signals object, and optional transcript, candidate-only audio, and per-dimension explanations.

Free for Speech Analysis API quickstart

The speech analysis API quickstart and scoring spec

A developer spec sheet covering the /v1/speech-analyze request and JSON schema, every dimension and the fraud_signals object, sample payloads, and the CEFR thresholds to set for customer-facing and managerial roles.

Written by

Vladimir Božović(LinkedIn, opens in a new tab)

CEO & Founder, ZenHire

Reviewed by

Sara Dojčinović(LinkedIn, opens in a new tab)

HR & Recruitment Expert

Score spoken English in four minutes, not a phone screen.

Drop the speech analysis API into your stack and grade language, soft skills, and integrity from one recording.

Get API access Book a demo

Speech Processing API: language and soft-skill scoring from one recording

CEFR A1-C2 in ~4 minutes90-96% alignment with PhD linguistsJSON response

{ "cefr_level": "B2", "overall_score": 78, "dimensions": { "fluency": 81, "vocabulary": 74, "pronunciation": 79, "accent_intelligibility": 83 }, "fraud_signals": { "multiple_speakers": false, "reading_detected": false } }

What does the speech analysis API score from a recording?

Field

What it carries

cefr_level

Standardized A1-C2 band derived from the dimensional scores

fluency

0-100 speech-rate and conversational-flow score

hesitation

0-100 score for micro-pauses and filler frequency

vocabulary

0-100 score for lexical diversity and contextual fit

pronunciation

0-100 clarity and intelligibility score

vocabularyCefr, fluencyCefr, accentCefr

A separate CEFR level per dimension alongside the overall band

transcript

Full transcript with speaker diarization, a speakerCount, and automatic candidate detection flagged high, medium, or low confidence

fraud_signals

Per-call multiple-speaker, reading, and diarization flags