Skip to content

MMDiT streams difference #1107

Closed
Closed
@Tera2Space

Description

@Tera2Space

Checks

  • This template is only for research question, not usage problems, feature requests or bug reports.
  • I have thoroughly reviewed the project documentation and read the related paper(s).
  • I have searched for existing issues, including closed ones, no similar questions.
  • I am using English to submit this issue to facilitate community communication.

Question details

According to your description of MMDiT here , the implementation follows the SD3/Flux convention where the left stream handles text embeddings and the right stream handles noise. However, in the actual code in the MMDiT blocks at lines 609-610, the concatenation appears to place noise attention data on the left side (query, key, value) and context/text data on the right side (c_query, c_key, c_value).

        query = torch.cat([query, c_query], dim=2)
        key = torch.cat([key, c_key], dim=2)
        value = torch.cat([value, c_value], dim=2)

This seems to be the opposite of what's described and differs from the SD3 implementation. Is this intentional behavior specific to F5-TTS?

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions