Skip to main content

Candidate detection

Most interview recordings contain at least two speakers: an interviewer and a candidate. The scoring pipeline must know which speaker to score — otherwise you'd be scoring your recruiter's English instead of the candidate's.

How it works

After transcription and speaker diarization, the API inspects each speaker's segments and picks the candidate using:

  • Total speaking time (candidates typically speak more)
  • Nature of the speech (answering questions vs. asking them)
  • Word count and segment distribution

The decision is returned on every successful poll response:

{
"candidateDetection": {
"speaker": "speaker_1",
"speakerLabel": "Speaker 2",
"confidence": "high",
"reason": "Speaker answers questions and describes experience",
"wordCount": 432
}
}

Confidence levels

ConfidenceWhen you see it
highOne speaker clearly dominates in candidate-like behavior.
mediumLess clear — maybe the interview was very balanced.
lowAmbiguous. Consider reviewing the transcript manually.

Scores are always produced regardless of confidence level, but low results deserve extra human review before being used in hiring decisions.

Single-speaker recordings

If the recording contains only one speaker (e.g., a pre-recorded monologue or screening answer), that speaker is scored. speakerCount: 1 is returned.

Minimum useful duration

The pipeline needs at least 3 minutes of total audio and a reasonable amount of candidate speech to produce reliable scores. Submissions under 3 minutes are rejected with AUDIO_TOO_SHORT.

See in the API reference

The candidateDetection object is returned on every successful poll response: