Closed
Description
Checks
- This template is only for research question, not usage problems, feature requests or bug reports.
- I have thoroughly reviewed the project documentation and read the related paper(s).
- I have searched for existing issues, including closed ones, no similar questions.
- I am using English to submit this issue to facilitate community communication.
Question details
According to your description of MMDiT here , the implementation follows the SD3/Flux convention where the left stream handles text embeddings and the right stream handles noise. However, in the actual code in the MMDiT blocks at lines 609-610, the concatenation appears to place noise attention data on the left side (query, key, value) and context/text data on the right side (c_query, c_key, c_value).
query = torch.cat([query, c_query], dim=2)
key = torch.cat([key, c_key], dim=2)
value = torch.cat([value, c_value], dim=2)
This seems to be the opposite of what's described and differs from the SD3 implementation. Is this intentional behavior specific to F5-TTS?