Skip to content

feat: Add subtitle generation with speaker and timing information #1124

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
57 changes: 57 additions & 0 deletions .gemini/gemini.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
## F5-TTS 项目指南

### 1. 项目概述

本项目是 F5-TTS,一个基于 Flow Matching 的文本转语音(TTS)深度学习模型。旨在生成流畅、忠实的语音。主要源代码位于 `src/f5_tts` 目录。

### 2. 技术栈与规范

- **语言**: Python 3.10
- **包管理**: 使用 `pip` 和 `pyproject.toml`。通过 `pip install -e .` 在本地进行可编辑模式的安装。
- **代码风格与检查**:
- **工具**: 项目使用 `ruff` 进行代码格式化、导入排序和 linting。
- **配置**: 规则定义在 `ruff.toml` 文件中。行长度限制为 120 个字符。
- **提交流程**: 项目使用 `pre-commit` 在代码提交前自动运行 `ruff` 检查。任何代码修改都必须通过这些检查。

### 3. 关键命令

- **安装依赖**:
```bash
pip install -e .
```
- **安装 pre-commit 钩子**:
```bash
pre-commit install
```
- **手动运行代码检查**:
```bash
# 运行所有 pre-commit 钩子
pre-commit run --all-files

# 或者单独运行 ruff
ruff format .
ruff check --fix .
```
- **运行 Gradio 应用 (推理)**:
```bash
f5-tts_infer-gradio
```
- **运行命令行推理**:
```bash
f5-tts_infer-cli -c <path_to_config.toml>
```
- **运行 Gradio 应用 (微调)**:
```bash
f5-tts_finetune-gradio
```

### 4. 代码修改指南

- **遵循现有风格**: 所有代码修改都应严格遵守 `ruff` 所定义的现有代码风格和格式。
- **使用 `pre-commit`**: 在提交(commit)代码之前,请务必运行 `pre-commit run --all-files` 以确保代码质量。
- **命令行工具**: 项目通过 `pyproject.toml` 的 `[project.scripts]` 定义了多个命令行入口点,如 `f5-tts_infer-cli`。在添加新脚本时,应遵循此模式。
- **配置文件**: 项目广泛使用 `.toml` 和 `.yaml` 文件进行配置(如推理和模型配置)。修改时请注意其结构。

### 5. 提交信息

- 提交信息应清晰、简洁,并能准确描述所做的更改。可以参考 `git log` 中现有的提交信息风格。
3 changes: 3 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -185,6 +185,9 @@ f5-tts_infer-cli --model F5TTS_v1_Base \
--ref_text "The content, subtitle or transcription of reference audio." \
--gen_text "Some text you want TTS model generate for you."

# Generate a subtitle file along with the audio
f5-tts_infer-cli --output_subtitle_file subtitle.json --gen_text "This is a test."

# Run with default setting. src/f5_tts/infer/examples/basic/basic.toml
f5-tts_infer-cli
# Or with your own .toml file
Expand Down
2 changes: 2 additions & 0 deletions src/f5_tts/infer/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -103,6 +103,8 @@ gen_text = "I don't really care what you call me. I've been a silent spectator,
gen_file = ""
remove_silence = false
output_dir = "tests"
# To generate a subtitle file along with the audio
output_subtitle_file = "subtitle.json"
```

You can also leverage `.toml` file to do multi-style generation, refer to `src/f5_tts/infer/examples/multi/story.toml`.
Expand Down
2 changes: 1 addition & 1 deletion src/f5_tts/infer/examples/multi/story.txt
Original file line number Diff line number Diff line change
@@ -1 +1 @@
A Town Mouse and a Country Mouse were acquaintances, and the Country Mouse one day invited his friend to come and see him at his home in the fields. The Town Mouse came, and they sat down to a dinner of barleycorns and roots, the latter of which had a distinctly earthy flavour. The fare was not much to the taste of the guest, and presently he broke out with [town] "My poor dear friend, you live here no better than the ants! Now, you should just see how I fare! My larder is a regular horn of plenty. You must come and stay with me, and I promise you you shall live on the fat of the land." [main] So when he returned to town he took the Country Mouse with him, and showed him into a larder containing flour and oatmeal and figs and honey and dates. The Country Mouse had never seen anything like it, and sat down to enjoy the luxuries his friend provided: but before they had well begun, the door of the larder opened and someone came in. The two Mice scampered off and hid themselves in a narrow and exceedingly uncomfortable hole. Presently, when all was quiet, they ventured out again; but someone else came in, and off they scuttled again. This was too much for the visitor. [country] "Goodbye," [main] said he, [country] "I'm off. You live in the lap of luxury, I can see, but you are surrounded by dangers; whereas at home I can enjoy my simple dinner of roots and corn in peace."
[main] A Town Mouse and a Country Mouse were acquaintances, and the Country Mouse one day invited his friend to come and see him at his home in the fields. The Town Mouse came, and they sat down to a dinner of barleycorns and roots, the latter of which had a distinctly earthy flavour. The fare was not much to the taste of the guest, and presently he broke out with [town] "My poor dear friend, you live here no better than the ants! Now, you should just see how I fare! My larder is a regular horn of plenty. You must come and stay with me, and I promise you you shall live on the fat of the land." [main] So when he returned to town he took the Country Mouse with him, and showed him into a larder containing flour and oatmeal and figs and honey and dates. The Country Mouse had never seen anything like it, and sat down to enjoy the luxuries his friend provided: but before they had well begun, the door of the larder opened and someone came in. The two Mice scampered off and hid themselves in a narrow and exceedingly uncomfortable hole. Presently, when all was quiet, they ventured out again; but someone else came in, and off they scuttled again. This was too much for the visitor. [country] "Goodbye," [main] said he, [country] "I'm off. You live in the lap of luxury, I can see, but you are surrounded by dangers; whereas at home I can enjoy my simple dinner of roots and corn in peace."
59 changes: 55 additions & 4 deletions src/f5_tts/infer/infer_cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -108,6 +108,12 @@
type=str,
help="The name of output file",
)
parser.add_argument(
"-j",
"--output_subtitle_file",
type=str,
help="The name of output subtitle file, e.g. subtitle.json",
)
parser.add_argument(
"--save_chunk",
action="store_true",
Expand Down Expand Up @@ -201,6 +207,7 @@
output_file = args.output_file or config.get(
"output_file", f"infer_cli_{datetime.now().strftime(r'%Y%m%d_%H%M%S')}.wav"
)
output_subtitle_file = args.output_subtitle_file or config.get("output_subtitle_file", "")

save_chunk = args.save_chunk or config.get("save_chunk", False)
use_legacy_text = args.no_legacy_text or config.get("no_legacy_text", False) # no_legacy_text is a store_false arg
Expand Down Expand Up @@ -315,16 +322,30 @@ def main():
print("ref_audio_", voices[voice]["ref_audio"], "\n\n")

generated_audio_segments = []
reg1 = r"(?=\[\w+\])"
chunks = re.split(reg1, gen_text)
subtitle_data = []
cumulative_time_ms = 0
text_offset = 0

reg1 = r"(\[\w+\])"
parts = re.split(reg1, gen_text)
chunks = []
# The first part may not have a tag, default to [main]
if parts[0].strip():
chunks.append("[main]" + parts[0])
# Combine tags with their corresponding text
for i in range(1, len(parts), 2):
tag = parts[i]
text = parts[i + 1]
if text.strip():
chunks.append(tag + text)

reg2 = r"\[(\w+)\]"
for text in chunks:
if not text.strip():
continue
match = re.match(reg2, text)
if match:
voice = match[1]
else:
# This else block should ideally not be reached with the new logic
print("No voice tag found, using main.")
voice = "main"
if voice not in voices:
Expand Down Expand Up @@ -354,6 +375,28 @@ def main():
)
generated_audio_segments.append(audio_segment)

# Subtitle generation
if output_subtitle_file:
# Clean up text for subtitle and add speaker tag
clean_text = gen_text_
if clean_text.startswith('"') and clean_text.endswith('"'):
clean_text = clean_text[1:-1]

segment_duration_ms = (len(audio_segment) / final_sample_rate) * 1000
text_begin = text_offset
text_end = text_offset + len(gen_text_) # Use original length for offset
subtitle_entry = {
"text": clean_text,
"speaker": voice,
"time_begin": cumulative_time_ms,
"time_end": cumulative_time_ms + segment_duration_ms,
"text_begin": text_begin,
"text_end": text_end,
}
subtitle_data.append(subtitle_entry)
cumulative_time_ms += segment_duration_ms
text_offset = text_end

if save_chunk:
if len(gen_text_) > 200:
gen_text_ = gen_text_[:200] + " ... "
Expand All @@ -378,6 +421,14 @@ def main():
remove_silence_for_generated_wav(f.name)
print(f.name)

if output_subtitle_file and subtitle_data:
import json

subtitle_path = Path(output_dir) / output_subtitle_file
with open(subtitle_path, "w", encoding="utf-8") as f:
json.dump(subtitle_data, f, indent=3, ensure_ascii=False)
print(f"Subtitle file saved to: {subtitle_path}")


if __name__ == "__main__":
main()