I'm baffled by "post-normalized vision transformers don't not converge". Thoughts? #745
Unanswered
alexander-soare
asked this question in
General
Replies: 1 comment 3 replies
-
@alexander-soare There original NFNet (characterizing signal propagation) paper might have some insight... perhaps analysing the signal prop would provide clues... |
Beta Was this translation helpful? Give feedback.
3 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I think I'm reading this for the second time now while looking at the CaiT paper. Section 2:
The ViT paper has this line alone:
"Does not converge" is very significant, much more so than "performs slightly worse" would be. If the whole thesis of the vision transformer is so dependent on the placement of the norm layer, it's a BIG deal and needs to be discussed right?
Moreover, I wish I could ask the ViT authors how they figured this out. So it fails to converge the first time they try (because naturally they follow Vaswani). Then I'd love to know what process led them to figure out that going pre-norm fixes it.
Beta Was this translation helpful? Give feedback.
All reactions