Scratching Visual Transformer's Back
with Uniform Attention

Naver AI Lab, POSTECH, University of Tubingen
International Conference on Computer Vision (ICCV) 2023

Dense attention is hard to learn by softmax. Infusing dense attention splits the responsibility of interactions; the burden of interactions of self-attention is reduced. Self-attention is now more likely to learn sparse interaction that is in favor of softmax.

Abstract

The favorable performance of Vision Transformers (ViTs) is often attributed to the multi-head self-attention (MSA), which enables global interactions at each layer of a ViT model. Previous works acknowledge the property of long-range dependency for the effectiveness in MSA.

In this work, we study the role of MSA in terms of the different axis, density. Our preliminary analyses suggest that the spatial interactions of learned attention maps are close to dense interactions rather than sparse ones. This is a curious phenomenon because dense attention maps are harder for the model to learn due to softmax. We interpret this opposite behavior against softmax as a strong preference for the ViT models to include dense interaction. We thus manually insert the dense uniform attention to each layer of the ViT models to supply the much-needed dense interactions. We call this method Context Broadcasting, CB.

Our study demonstrates the inclusion of CB takes the role of dense attention, and thereby reduces the degree of density in the original attention maps by complying softmax in MSA. We also show that, with negligible costs of CB (1 line in your model code and no additional parameters), both the capacity and generalizability of the ViT models are increased.

Motivation

  • A majority of the attention in ViTs has such high entropy values
  • The steeper gradient for the MSA layer with denser attention maps
  • Hard to learn Dense attention maps, but vital to ViTs

Method

Decide to inject uniform attention because

    (1) uniform attention is the densest attention and is unstable in terms of gradient view

    (2) but, humans can supply uniform attention easily

    (3) uniform attention requires no additional parameters and small computation costs.

We do this through the broadcasting context with the CB module.

Characteristics

  • The insertion of CB module lowers the entropy values significantly
  • Injecting the dense global interactions into ViT does not hurt the range of interactions
  • The upper layers prefer dense interactions more than the lower layers
  • More effective in a lower number of heads rather than the large number of heads

BibTeX


    @inproceedings{hyeon2022scratching,
      title={Scratching Visual Transformer's Back with Uniform Attention},
      author={Hyeon-Woo, Nam and Yu-Ji, Kim and Heo, Byeongho and Han, Doonyoon and Oh, Seong Joon and Oh, Tae-Hyun},
      booktitle = {ICCV},
      year={2023}
    }