Open
Description
I found that the front/back normal maps are also used as input to the encoder and image to generate three-plane features. I want to know why? Will the result be improved?
Reading the code, I found that after obtaining the three-plane feature map, it was concatenated with the normal feature.
I only input the image through VitPose's pre-trained ViTencoder model to get the image features, and then also through the three decoders to get the three-plane features and splice with the normal features. Is that all right?
Metadata
Metadata
Assignees
Labels
No labels