ยท
1 min read
What do We Know about Vision-Language Models
Understanding
Attention Sink
Logit Lens, e.g. unsupervised segmentation
Efficiency
Token Pruning/Merging
Connect to videos
Capabilities
Visual Task Ablation (different types of images, e.g. from Berkeley work)
Dense image, ineffectiveness of CLIP encoder vs DINO encoder
Plantonous Hypothesis