diff --git a/.gemini/gemini.md b/.gemini/gemini.md
new file mode 100644
index 000000000..2e3cb40d8
--- /dev/null
+++ b/.gemini/gemini.md
@@ -0,0 +1,57 @@
+## F5-TTS 项目指南
+
+### 1. 项目概述
+
+本项目是 F5-TTS，一个基于 Flow Matching 的文本转语音（TTS）深度学习模型。旨在生成流畅、忠实的语音。主要源代码位于 `src/f5_tts` 目录。
+
+### 2. 技术栈与规范
+
+- **语言**: Python 3.10
+- **包管理**: 使用 `pip` 和 `pyproject.toml`。通过 `pip install -e .` 在本地进行可编辑模式的安装。
+- **代码风格与检查**:
+    - **工具**: 项目使用 `ruff` 进行代码格式化、导入排序和 linting。
+    - **配置**: 规则定义在 `ruff.toml` 文件中。行长度限制为 120 个字符。
+    - **提交流程**: 项目使用 `pre-commit` 在代码提交前自动运行 `ruff` 检查。任何代码修改都必须通过这些检查。
+
+### 3. 关键命令
+
+- **安装依赖**:
+  ```bash
+  pip install -e .
+  ```
+- **安装 pre-commit 钩子**:
+  ```bash
+  pre-commit install
+  ```
+- **手动运行代码检查**:
+  ```bash
+  # 运行所有 pre-commit 钩子
+  pre-commit run --all-files
+
+  # 或者单独运行 ruff
+  ruff format .
+  ruff check --fix .
+  ```
+- **运行 Gradio 应用 (推理)**:
+  ```bash
+  f5-tts_infer-gradio
+  ```
+- **运行命令行推理**:
+  ```bash
+  f5-tts_infer-cli -c <path_to_config.toml>
+  ```
+- **运行 Gradio 应用 (微调)**:
+  ```bash
+  f5-tts_finetune-gradio
+  ```
+
+### 4. 代码修改指南
+
+- **遵循现有风格**: 所有代码修改都应严格遵守 `ruff` 所定义的现有代码风格和格式。
+- **使用 `pre-commit`**: 在提交（commit）代码之前，请务必运行 `pre-commit run --all-files` 以确保代码质量。
+- **命令行工具**: 项目通过 `pyproject.toml` 的 `[project.scripts]` 定义了多个命令行入口点，如 `f5-tts_infer-cli`。在添加新脚本时，应遵循此模式。
+- **配置文件**: 项目广泛使用 `.toml` 和 `.yaml` 文件进行配置（如推理和模型配置）。修改时请注意其结构。
+
+### 5. 提交信息
+
+- 提交信息应清晰、简洁，并能准确描述所做的更改。可以参考 `git log` 中现有的提交信息风格。
diff --git a/README.md b/README.md
index 96a226d57..7b53332b2 100644
--- a/README.md
+++ b/README.md
@@ -185,6 +185,9 @@ f5-tts_infer-cli --model F5TTS_v1_Base \
 --ref_text "The content, subtitle or transcription of reference audio." \
 --gen_text "Some text you want TTS model generate for you."
 
+# Generate a subtitle file along with the audio
+f5-tts_infer-cli --output_subtitle_file subtitle.json --gen_text "This is a test."
+
 # Run with default setting. src/f5_tts/infer/examples/basic/basic.toml
 f5-tts_infer-cli
 # Or with your own .toml file
diff --git a/src/f5_tts/infer/README.md b/src/f5_tts/infer/README.md
index 9de47aa75..37c67a2b2 100644
--- a/src/f5_tts/infer/README.md
+++ b/src/f5_tts/infer/README.md
@@ -103,6 +103,8 @@ gen_text = "I don't really care what you call me. I've been a silent spectator,
 gen_file = ""
 remove_silence = false
 output_dir = "tests"
+# To generate a subtitle file along with the audio
+output_subtitle_file = "subtitle.json"
 ```
 
 You can also leverage `.toml` file to do multi-style generation, refer to `src/f5_tts/infer/examples/multi/story.toml`.
diff --git a/src/f5_tts/infer/examples/multi/story.txt b/src/f5_tts/infer/examples/multi/story.txt
index 844e521f4..691acea02 100644
--- a/src/f5_tts/infer/examples/multi/story.txt
+++ b/src/f5_tts/infer/examples/multi/story.txt
@@ -1 +1 @@
-A Town Mouse and a Country Mouse were acquaintances, and the Country Mouse one day invited his friend to come and see him at his home in the fields. The Town Mouse came, and they sat down to a dinner of barleycorns and roots, the latter of which had a distinctly earthy flavour. The fare was not much to the taste of the guest, and presently he broke out with [town] "My poor dear friend, you live here no better than the ants! Now, you should just see how I fare! My larder is a regular horn of plenty. You must come and stay with me, and I promise you you shall live on the fat of the land." [main] So when he returned to town he took the Country Mouse with him, and showed him into a larder containing flour and oatmeal and figs and honey and dates. The Country Mouse had never seen anything like it, and sat down to enjoy the luxuries his friend provided: but before they had well begun, the door of the larder opened and someone came in. The two Mice scampered off and hid themselves in a narrow and exceedingly uncomfortable hole. Presently, when all was quiet, they ventured out again; but someone else came in, and off they scuttled again. This was too much for the visitor. [country] "Goodbye," [main] said he, [country] "I'm off. You live in the lap of luxury, I can see, but you are surrounded by dangers; whereas at home I can enjoy my simple dinner of roots and corn in peace."
\ No newline at end of file
+[main] A Town Mouse and a Country Mouse were acquaintances, and the Country Mouse one day invited his friend to come and see him at his home in the fields. The Town Mouse came, and they sat down to a dinner of barleycorns and roots, the latter of which had a distinctly earthy flavour. The fare was not much to the taste of the guest, and presently he broke out with [town] "My poor dear friend, you live here no better than the ants! Now, you should just see how I fare! My larder is a regular horn of plenty. You must come and stay with me, and I promise you you shall live on the fat of the land." [main] So when he returned to town he took the Country Mouse with him, and showed him into a larder containing flour and oatmeal and figs and honey and dates. The Country Mouse had never seen anything like it, and sat down to enjoy the luxuries his friend provided: but before they had well begun, the door of the larder opened and someone came in. The two Mice scampered off and hid themselves in a narrow and exceedingly uncomfortable hole. Presently, when all was quiet, they ventured out again; but someone else came in, and off they scuttled again. This was too much for the visitor. [country] "Goodbye," [main] said he, [country] "I'm off. You live in the lap of luxury, I can see, but you are surrounded by dangers; whereas at home I can enjoy my simple dinner of roots and corn in peace."
\ No newline at end of file
diff --git a/src/f5_tts/infer/infer_cli.py b/src/f5_tts/infer/infer_cli.py
index 7d511706c..3e6114f16 100644
--- a/src/f5_tts/infer/infer_cli.py
+++ b/src/f5_tts/infer/infer_cli.py
@@ -108,6 +108,12 @@
     type=str,
     help="The name of output file",
 )
+parser.add_argument(
+    "-j",
+    "--output_subtitle_file",
+    type=str,
+    help="The name of output subtitle file, e.g. subtitle.json",
+)
 parser.add_argument(
     "--save_chunk",
     action="store_true",
@@ -201,6 +207,7 @@
 output_file = args.output_file or config.get(
     "output_file", f"infer_cli_{datetime.now().strftime(r'%Y%m%d_%H%M%S')}.wav"
 )
+output_subtitle_file = args.output_subtitle_file or config.get("output_subtitle_file", "")
 
 save_chunk = args.save_chunk or config.get("save_chunk", False)
 use_legacy_text = args.no_legacy_text or config.get("no_legacy_text", False)  # no_legacy_text is a store_false arg
@@ -315,16 +322,30 @@ def main():
         print("ref_audio_", voices[voice]["ref_audio"], "\n\n")
 
     generated_audio_segments = []
-    reg1 = r"(?=\[\w+\])"
-    chunks = re.split(reg1, gen_text)
+    subtitle_data = []
+    cumulative_time_ms = 0
+    text_offset = 0
+
+    reg1 = r"(\[\w+\])"
+    parts = re.split(reg1, gen_text)
+    chunks = []
+    # The first part may not have a tag, default to [main]
+    if parts[0].strip():
+        chunks.append("[main]" + parts[0])
+    # Combine tags with their corresponding text
+    for i in range(1, len(parts), 2):
+        tag = parts[i]
+        text = parts[i + 1]
+        if text.strip():
+            chunks.append(tag + text)
+
     reg2 = r"\[(\w+)\]"
     for text in chunks:
-        if not text.strip():
-            continue
         match = re.match(reg2, text)
         if match:
             voice = match[1]
         else:
+            # This else block should ideally not be reached with the new logic
             print("No voice tag found, using main.")
             voice = "main"
         if voice not in voices:
@@ -354,6 +375,28 @@ def main():
         )
         generated_audio_segments.append(audio_segment)
 
+        # Subtitle generation
+        if output_subtitle_file:
+            # Clean up text for subtitle and add speaker tag
+            clean_text = gen_text_
+            if clean_text.startswith('"') and clean_text.endswith('"'):
+                clean_text = clean_text[1:-1]
+
+            segment_duration_ms = (len(audio_segment) / final_sample_rate) * 1000
+            text_begin = text_offset
+            text_end = text_offset + len(gen_text_)  # Use original length for offset
+            subtitle_entry = {
+                "text": clean_text,
+                "speaker": voice,
+                "time_begin": cumulative_time_ms,
+                "time_end": cumulative_time_ms + segment_duration_ms,
+                "text_begin": text_begin,
+                "text_end": text_end,
+            }
+            subtitle_data.append(subtitle_entry)
+            cumulative_time_ms += segment_duration_ms
+            text_offset = text_end
+
         if save_chunk:
             if len(gen_text_) > 200:
                 gen_text_ = gen_text_[:200] + " ... "
@@ -378,6 +421,14 @@ def main():
                 remove_silence_for_generated_wav(f.name)
             print(f.name)
 
+    if output_subtitle_file and subtitle_data:
+        import json
+
+        subtitle_path = Path(output_dir) / output_subtitle_file
+        with open(subtitle_path, "w", encoding="utf-8") as f:
+            json.dump(subtitle_data, f, indent=3, ensure_ascii=False)
+        print(f"Subtitle file saved to: {subtitle_path}")
+
 
 if __name__ == "__main__":
     main()