Title |
Refined Feature-Space Window Attention Vision Transformer for Image Classification |
Authors |
유다연(Dayeon Yoo) ; 유진우(Jinwoo Yoo) |
DOI |
https://doi.org/10.5370/KIEE.2024.73.6.1004 |
Keywords |
Image Classification; Deep learning; Vision Transformer |
Abstract |
The window-based self-attention vision transformer (ViT) reduces computational complexity by computing attention within a specific window. However, it is difficult to capture the interactions between pixels from different windows. To address this issue, Swin transformer, a representative window-based self-attention ViT, introduces shifted window multi-head self-attention (SW-MSA) to capture the cross-window information. However, tokens that are distant from each other still cannot be grouped into one window. This paper proposes a method to cluster tokens based on similarity in the feature-space and compute attention within the cluster. The proposed method is an alternative to the SW-MSA of the existing Swin transformer. Additionally, this paper adopts a method to refine the feature space using convolutional block attention module (CBAM) to enhance the representational power of the model. In experimental results, the proposed network outperforms existing convolutional neural networks and transformer-based backbones in the classification task for ImageNet-1K. |