Entropy is not Enough for Test-time Adaptation: From the Perspective of Disentangled Factors

1Department of Electrical and Computer Engineering, Seoul National University 2Department of Computer Science and Engineering, Soongsil University 3Interdisciplinary Program in Artificial Intelligence, Seoul National University 4Division of Digital Healthcare, Yonsei University
ICLR 2024 Spotlight

*Equal Contribution

Corresponding Authors

Abstract

Test-time adaptation (TTA) fine-tunes pre-trained deep neural networks for unseen test data. The primary challenge of TTA is limited access to the entire test dataset during online updates, causing error accumulation. To mitigate it, TTA methods have utilized the model output's entropy as a confidence metric that aims to determine which samples have a lower likelihood of causing error. Through experimental studies, however, we observed the unreliability of entropy as a confidence metric for TTA under biased scenarios and theoretically revealed that it stems from the neglect of the influence of latent disentangled factors of data on predictions. Building upon these findings, we introduce a novel TTA method named Destroy Your Object (DeYO), which leverages a newly proposed confidence metric named Pseudo-Label Probability Difference (PLPD). PLPD quantifies the influence of the shape of an object on prediction by measuring the difference between predictions before and after applying an object-destructive transformation. DeYO consists of sample selection and sample weighting, which employ entropy and PLPD concurrently. For robust adaptation, DeYO prioritizes samples that dominantly incorporate shape information when making predictions. Our extensive experiments demonstrate the consistent superiority of DeYO over baseline methods across various scenarios, including biased and wild.

Method: Destroy Your Object

MY ALT TEXT

The overview of DeYO. DeYO comprises sample selection and sample weighting mechanisms. We propose a novel TTA method named DeYO that incorporates the newly proposed Pseudo-Label Probability Difference (PLPD) score to account for the influence of Commonly Positively-coRrelated with label (CPR) factors on the model's predictions, particularly the shape information of objects. By integrating the PLPD score that enforces the consideration of CPR factors while suppressing TRAin-time only Positively correlated with label (TRAP) factors, we alleviate the limitations tied to the exclusive reliance on entropy.

Main Experimental Results

The overall results verify that DeYO provides stronger robustness against various distribution shifts.

MY ALT TEXT

- Comparisons with baselines on ImageNet-C at severity level 5 under a mild scenario regarding accuracy (%).

MY ALT TEXT

- Comparisons with baselines on ImageNet-C at severity level 5 under online imbalanced label shifts (imbalance ratio = infinity) regarding accuracy (%).

MY ALT TEXT

- Comparisons with baselines on ImageNet-C at severity level 5 under batch size 1 regarding accuracy (%).

MY ALT TEXT

- Comparisons with baselines on ColoredMNIST (left) or WaterBirds (right) regarding accuracy (%).

BibTeX

@inproceedings{
    lee2024entropy,
    title={Entropy is not Enough for Test-time Adaptation: From the Perspective of Disentangled Factors},
    author={Jonghyun Lee and Dahuin Jung and Saehyung Lee and Junsung Park and Juhyeon Shin and Uiwon Hwang and Sungroh Yoon},
    booktitle={The Twelfth International Conference on Learning Representations},
    year={2024},
    url={https://openreview.net/forum?id=9w3iw8wDuE}
}