feat: Add subtitle generation with speaker and timing information #1124

ypwcharles · 2025-07-12T18:13:36Z

Summary

This pull request introduces a highly valuable new feature to the inference CLI: the ability to generate a structured subtitle file (in JSON format) concurrently with the synthesized audio.

This provides users with precise timing, text, and speaker information for each audio segment, which is essential for applications like video production, transcription alignment, and content creation.

New Feature Details

A new command-line argument, --output_subtitle_file (short alias -j), has been added. When a filename is provided (e.g., subtitle.json), the script generates a JSON file containing an array of subtitle entries.

Each entry in the JSON file is an object with the following fields:

"text": The cleaned-up text content of the audio segment.

"speaker": The speaker tag associated with the segment (e.g., "speaker1").

"time_begin": The start time of the segment in milliseconds.

"time_end": The end time of the segment in milliseconds.

"text_begin": The starting character offset of the segment's text within the full input text.

"text_end": The ending character offset of the segment's text.

How It Works

Calculate Duration: The script precisely calculates the duration of each synthesized audio chunk.

Maintain Timeline: It maintains a cumulative timeline to ensure the "time_begin" and "time_end" values for each consecutive segment are accurate.

Track Text Offsets: It also tracks character offsets to map the generated audio segments directly and accurately back to the source text.

Write to File: The final list of subtitle data is written to the specified JSON file using UTF-8 encoding to ensure proper handling of all character types.

By providing critical metadata that was previously unavailable, this feature significantly enhances the project's overall value and utility.

This commit introduces the capability to generate a JSON-formatted subtitle file synchronously with the audio synthesis. - Added argument to to specify the output subtitle file name. - The feature is also supported through configuration files by adding the key. - Updated and to include documentation and examples for the new feature.

合并主分支更新

The previous text splitting logic using with a lookahead assertion was unreliable and failed to split the text correctly if the file did not start with a voice tag, causing the entire text to be synthesized with the default voice. This commit replaces the splitting mechanism with a more robust method that uses with a capturing group. This correctly tokenizes the text based on markers, ensuring each segment is associated with the proper voice as intended.

This commit improves the subtitle generation feature by: 1. Adding a "speaker" field to each subtitle entry, indicating which voice (e.g., "main", "town") synthesized the text. 2. Cleaning the subtitle text by removing leading/trailing quotation marks from the original text, providing a cleaner output.

ypwcharles and others added 5 commits July 8, 2025 18:35

Merge pull request #1 from SWivid/main

f6fbf88

合并主分支更新

Merge branch 'main' of https://github.com/ypwcharles/F5-TTS-subtitle

c7c4be6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Add subtitle generation with speaker and timing information #1124

feat: Add subtitle generation with speaker and timing information #1124

Uh oh!

ypwcharles commented Jul 12, 2025

Uh oh!

Uh oh!

feat: Add subtitle generation with speaker and timing information #1124

Are you sure you want to change the base?

feat: Add subtitle generation with speaker and timing information #1124

Uh oh!

Conversation

ypwcharles commented Jul 12, 2025

Summary

New Feature Details

How It Works

Uh oh!

Uh oh!