论文地址 BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation 论文思想 其实Clip就相当于只用了ITC