MMDiT streams difference

### Checks

- [x] This template is only for research question, not usage problems, feature requests or bug reports.
- [x] I have thoroughly reviewed the project documentation and read the related paper(s).
- [x] I have searched for existing issues, including closed ones, no similar questions.
- [x] I am using English to submit this issue to facilitate community communication.

### Question details

According to your description of MMDiT [here](https://github.com/SWivid/F5-TTS/blob/main/src/f5_tts/model/backbones/README.md) , the implementation follows the SD3/Flux convention where the left stream handles text embeddings and the right stream handles noise. However, in the actual code in the MMDiT blocks at lines [609-610](https://github.com/SWivid/F5-TTS/blob/main/src/f5_tts/model/modules.py#L608-L610), the concatenation appears to place noise attention data on the left side (query, key, value) and context/text data on the right side (c_query, c_key, c_value). 
```python
        query = torch.cat([query, c_query], dim=2)
        key = torch.cat([key, c_key], dim=2)
        value = torch.cat([value, c_value], dim=2)
```
This seems to be the opposite of what's described and differs from the SD3 implementation. Is this intentional behavior specific to F5-TTS?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

MMDiT streams difference #1107

Checks

Question details

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

MMDiT streams difference #1107

Description

Checks

Question details

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions