Skip to content

Computing time-constrained WER #17

@desh2608

Description

@desh2608

I am thinking of a metric for long-form ASR and segmentation. Consider the following scenario:

  • The input is a long recording (either single speaker or multi speaker).
  • References may be with word-level timestamps (CTM file) or segment-level (STM).
  • Hypothesis may be word-level or segment-level (CTM or STM).

If reference is STM and hypothesis is CTM, this may correspond to computing the asclite aWER metric, but we also want to support (i) other kinds of systems that may not provide word-level timestamps, and (ii) tighter penalty on segmentation by providing reference CTM.

Additionally, we also want to be able to include multiple possible references (e.g., references may be orthographic or normalized in some way), although I understand that this may be beyond the scope of this toolkit.

I am looking for suggestions about what would be a good metric (if one exists) for this scenario.

(cc @MartinKocour since we were having related discussions.)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions