Candidate detection
Most interview recordings contain at least two speakers: an interviewer and a candidate. The scoring pipeline must know which speaker to score — otherwise you'd be scoring your recruiter's English instead of the candidate's.
How it works
After transcription and speaker diarization, the API inspects each speaker's segments and picks the candidate using:
- Total speaking time (candidates typically speak more)
- Nature of the speech (answering questions vs. asking them)
- Word count and segment distribution
The decision is returned on every successful poll response:
{
"candidateDetection": {
"speaker": "speaker_1",
"speakerLabel": "Speaker 2",
"confidence": "high",
"reason": "Speaker answers questions and describes experience",
"wordCount": 432
}
}
Confidence levels
| Confidence | When you see it |
|---|---|
high | One speaker clearly dominates in candidate-like behavior. |
medium | Less clear — maybe the interview was very balanced. |
low | Ambiguous. Consider reviewing the transcript manually. |
Scores are always produced regardless of confidence level, but low
results deserve extra human review before being used in hiring decisions.
Single-speaker recordings
If the recording contains only one speaker (e.g., a pre-recorded monologue
or screening answer), that speaker is scored. speakerCount: 1 is returned.
Minimum useful duration
The pipeline needs at least 3 minutes of total audio and a reasonable
amount of candidate speech to produce reliable scores. Submissions under
3 minutes are rejected with AUDIO_TOO_SHORT.
See in the API reference
The candidateDetection object is returned on every successful poll response:
- GET /api/v1/speech/analyze/{id} — see the
candidateDetectionfield in the 200 response schema