Towards self-improving scene understanding with vision-language knowledge integration

Image captioning has seen immense progress in the last few years. However, general-purpose systems often fail to provide personalised, context-aware captions tailored to individual users or domains. In this work, we investigate the task of personalised and contextualised image captioning by leveraging foundational models, including large language models (LLMs) and Read more…