feat: Add subtitle generation with speaker and timing information #1124
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
This pull request introduces a highly valuable new feature to the inference CLI: the ability to generate a structured subtitle file (in JSON format) concurrently with the synthesized audio.
This provides users with precise timing, text, and speaker information for each audio segment, which is essential for applications like video production, transcription alignment, and content creation.
New Feature Details
A new command-line argument, --output_subtitle_file (short alias -j), has been added. When a filename is provided (e.g., subtitle.json), the script generates a JSON file containing an array of subtitle entries.
Each entry in the JSON file is an object with the following fields:
"text": The cleaned-up text content of the audio segment.
"speaker": The speaker tag associated with the segment (e.g., "speaker1").
"time_begin": The start time of the segment in milliseconds.
"time_end": The end time of the segment in milliseconds.
"text_begin": The starting character offset of the segment's text within the full input text.
"text_end": The ending character offset of the segment's text.
How It Works
Calculate Duration: The script precisely calculates the duration of each synthesized audio chunk.
Maintain Timeline: It maintains a cumulative timeline to ensure the "time_begin" and "time_end" values for each consecutive segment are accurate.
Track Text Offsets: It also tracks character offsets to map the generated audio segments directly and accurately back to the source text.
Write to File: The final list of subtitle data is written to the specified JSON file using UTF-8 encoding to ensure proper handling of all character types.
By providing critical metadata that was previously unavailable, this feature significantly enhances the project's overall value and utility.