Abstract
Abstract
Contemporary deep learning models, such as CLIP, exhibit strong zero-shot recognition performance on a broad range of tasks. However, they still substantially benefit from limited supervision in few-shot regimes. This thesis investigates ways to inject external knowledge to make few-shot adaptation models more data-efficient and/or interpretable. We start with studying the role of shape-only representations in object recognition and show that they are more data-efficient than raw RGB representations. Motivated by these findings, and texture versus shape bias literature, we propose \emph{v1-shape}, augmenting CLIP-based few shot recognition with an additional shape-conditioned branch, yielding modest gains. We also introduce \emph{v1-concept}, a CLIP-based concept bottleneck model encouraged to base its decisions on more general semantic concepts, improving few-shot accuracy under many settings. Finally, we explore a CLIP adaptation approach blending zero-shot and linear probe logits adaptively during inference.