Open
Description
Hi,
The reconstruction part of TokLIP (codebook & decoder) wasn't trained. Why do VQGAN and TokLIP have different FID scores in Table 4? If it was trained, what would the rFID score be? Also, why not include generative capabilities in the multimodal model? Could you explain the reasoning behind this?

Metadata
Metadata
Assignees
Labels
No labels