Decide to inject uniform attention because
(1) uniform attention is the densest attention and is unstable in terms of gradient view
(2) but, humans can supply uniform attention easily
(3) uniform attention requires no additional parameters and small computation costs.
We do this through the broadcasting context with the CB module.
@inproceedings{hyeon2022scratching,
title={Scratching Visual Transformer's Back with Uniform Attention},
author={Hyeon-Woo, Nam and Yu-Ji, Kim and Heo, Byeongho and Han, Doonyoon and Oh, Seong Joon and Oh, Tae-Hyun},
booktitle = {ICCV},
year={2023}
}