ViTencoder input

I found that the front/back normal maps are also used as input to the encoder and image to generate three-plane features. I want to know why? Will the result be improved? 
<img width="919" alt="image" src="https://github.com/River-Zhang/GTA/assets/65723649/3dcf5b25-5615-4f92-850b-d76534c55a9e">
Reading the code, I found that after obtaining the three-plane feature map, it was concatenated with the normal feature. 
<img width="928" alt="image" src="https://github.com/River-Zhang/GTA/assets/65723649/0d29dfce-da6e-49e5-8a88-1557af44e829">
 I only input the image through VitPose's pre-trained ViTencoder model to get the image features, and then also through the three decoders to get the three-plane features and splice with the normal features. Is that all right?



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ViTencoder input #15

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

ViTencoder input #15

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions