-
Notifications
You must be signed in to change notification settings - Fork 15
Open
Description
I am thinking of a metric for long-form ASR and segmentation. Consider the following scenario:
- The input is a long recording (either single speaker or multi speaker).
- References may be with word-level timestamps (CTM file) or segment-level (STM).
- Hypothesis may be word-level or segment-level (CTM or STM).
If reference is STM and hypothesis is CTM, this may correspond to computing the asclite aWER metric, but we also want to support (i) other kinds of systems that may not provide word-level timestamps, and (ii) tighter penalty on segmentation by providing reference CTM.
Additionally, we also want to be able to include multiple possible references (e.g., references may be orthographic or normalized in some way), although I understand that this may be beyond the scope of this toolkit.
I am looking for suggestions about what would be a good metric (if one exists) for this scenario.
(cc @MartinKocour since we were having related discussions.)
Metadata
Metadata
Assignees
Labels
No labels