ยท 1 min read

What do We Know about Vision-Language Models

Understanding

  • Attention Sink

  • Logit Lens, e.g. unsupervised segmentation

Efficiency

  • Token Pruning/Merging

  • Connect to videos

Capabilities

  • Visual Task Ablation (different types of images, e.g. from Berkeley work)

  • Dense image, ineffectiveness of CLIP encoder vs DINO encoder

  • Plantonous Hypothesis