Skip to content

feat: Add subtitle generation with speaker and timing information #1124

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

ypwcharles
Copy link

Summary

This pull request introduces a highly valuable new feature to the inference CLI: the ability to generate a structured subtitle file (in JSON format) concurrently with the synthesized audio.

This provides users with precise timing, text, and speaker information for each audio segment, which is essential for applications like video production, transcription alignment, and content creation.

New Feature Details

A new command-line argument, --output_subtitle_file (short alias -j), has been added. When a filename is provided (e.g., subtitle.json), the script generates a JSON file containing an array of subtitle entries.

Each entry in the JSON file is an object with the following fields:

"text": The cleaned-up text content of the audio segment.

"speaker": The speaker tag associated with the segment (e.g., "speaker1").

"time_begin": The start time of the segment in milliseconds.

"time_end": The end time of the segment in milliseconds.

"text_begin": The starting character offset of the segment's text within the full input text.

"text_end": The ending character offset of the segment's text.

How It Works

Calculate Duration: The script precisely calculates the duration of each synthesized audio chunk.

Maintain Timeline: It maintains a cumulative timeline to ensure the "time_begin" and "time_end" values for each consecutive segment are accurate.

Track Text Offsets: It also tracks character offsets to map the generated audio segments directly and accurately back to the source text.

Write to File: The final list of subtitle data is written to the specified JSON file using UTF-8 encoding to ensure proper handling of all character types.

By providing critical metadata that was previously unavailable, this feature significantly enhances the project's overall value and utility.

ypwcharles and others added 5 commits July 8, 2025 18:35
This commit introduces the capability to generate a JSON-formatted subtitle file synchronously with the audio synthesis.

- Added  argument to  to specify the output subtitle file name.
- The feature is also supported through  configuration files by adding the  key.
- Updated  and  to include documentation and examples for the new feature.
合并主分支更新
The previous text splitting logic using  with a lookahead assertion was unreliable and failed to split the text correctly if the file did not start with a voice tag, causing the entire text to be synthesized with the default voice.

This commit replaces the splitting mechanism with a more robust method that uses  with a capturing group. This correctly tokenizes the text based on  markers, ensuring each segment is associated with the proper voice as intended.
This commit improves the subtitle generation feature by:
1.  Adding a "speaker" field to each subtitle entry, indicating which voice (e.g., "main", "town") synthesized the text.
2.  Cleaning the subtitle text by removing leading/trailing quotation marks from the original text, providing a cleaner output.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant