Computing time-constrained WER

I am thinking of a metric for long-form ASR and segmentation. Consider the following scenario:
* The input is a long recording (either single speaker or multi speaker).
* References may be with word-level timestamps (CTM file) or segment-level (STM).
* Hypothesis may be word-level or segment-level (CTM or STM).

If reference is STM and hypothesis is CTM, this may correspond to computing the asclite aWER metric, but we also want to support (i) other kinds of systems that may not provide word-level timestamps, and (ii) tighter penalty on segmentation by providing reference CTM.

Additionally, we also want to be able to include multiple possible references (e.g., references may be orthographic or normalized in some way), although I understand that this may be beyond the scope of this toolkit.

I am looking for suggestions about what would be a good metric (if one exists) for this scenario.

(cc @MartinKocour since we were having related discussions.)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Computing time-constrained WER #17

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Computing time-constrained WER #17

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions