I'm baffled by "post-normalized vision transformers don't not converge". Thoughts? #745

alexander-soare · 2021-07-09T11:19:50Z

alexander-soare
Jul 9, 2021

I think I'm reading this for the second time now while looking at the CaiT paper. Section 2:

but in our experiments the DeiT training does not converge with post-normalization

The ViT paper has this line alone:

Layernorm (LN) is applied before every block

"Does not converge" is very significant, much more so than "performs slightly worse" would be. If the whole thesis of the vision transformer is so dependent on the placement of the norm layer, it's a BIG deal and needs to be discussed right?

Moreover, I wish I could ask the ViT authors how they figured this out. So it fails to converge the first time they try (because naturally they follow Vaswani). Then I'd love to know what process led them to figure out that going pre-norm fixes it.

rwightman · 2021-07-09T15:37:38Z

rwightman
Jul 9, 2021
Maintainer

@alexander-soare There original NFNet (characterizing signal propagation) paper might have some insight... perhaps analysing the signal prop would provide clues...

3 replies

alexander-soare Jul 9, 2021
Author

I haven't read it. Thanks for the lead!

rwightman Jul 10, 2021
Maintainer

@alexander-soare another very recent paper on the topic, variance of activations in forward, and norm of gradient for backward are what they focus on for comparing normalizers in residual vs non-residual networks ... https://arxiv.org/abs/2106.05956

alexander-soare Jul 10, 2021
Author

Ah yes, saw that retweet. I have a longer list of norm questions (as everyone does) so this will be nice.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

I'm baffled by "post-normalized vision transformers don't not converge". Thoughts? #745

{{title}}

Replies: 1 comment 3 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

I'm baffled by "post-normalized vision transformers don't not converge". Thoughts? #745

alexander-soare Jul 9, 2021

Replies: 1 comment · 3 replies

rwightman Jul 9, 2021 Maintainer

alexander-soare Jul 9, 2021 Author

rwightman Jul 10, 2021 Maintainer

alexander-soare Jul 10, 2021 Author

alexander-soare
Jul 9, 2021

Replies: 1 comment 3 replies

rwightman
Jul 9, 2021
Maintainer

alexander-soare Jul 9, 2021
Author

rwightman Jul 10, 2021
Maintainer

alexander-soare Jul 10, 2021
Author