Abstract: Visual question answering (VQA) systems face significant challenges when adapting to real-world data shifts, especially in multi-modal contexts. While robust fine-tuning strategies are ...
Multimodal large language models (MLLMs) have attracted considerable attention for their impressive capabilities in understanding and generating visual-language content, particularly in tasks such as ...