一些 VLP 下游任务的相关探索

一、Image-Text Retrieval (ITR , 图像文本检索)

任务目的：

检索与给定文本最匹配的图像，或者给定图像最匹配的文本。

跨模态图像-文本检索（ITR）是根据用户给定的一种模态中的表达，从另一模态中检索出相关样本，通常包括两个子任务：图像-文本（i2t）和文本-图像（t2i）检索。

数据集格式

以 filter8k数据集为例。官网🤠

其 caption target 的格式为

1000268201_693b08cb0e.jpg,A child in a pink dress is climbing up a set of stairs in an entry way .
1000268201_693b08cb0e.jpg,A girl going into a wooden building .
1000268201_693b08cb0e.jpg,A little girl climbing into a wooden playhouse .
1000268201_693b08cb0e.jpg,A little girl climbing the stairs to her playhouse .
1000268201_693b08cb0e.jpg,A little girl in a pink dress going into a wooden cabin .
1001773457_577c3a7d70.jpg,A black dog and a spotted dog are fighting
1001773457_577c3a7d70.jpg,A black dog and a tri-colored dog playing with each other on the road .
1001773457_577c3a7d70.jpg,A black dog and a white dog with brown spots are staring at each other in the street .
1001773457_577c3a7d70.jpg,Two dogs of different breeds looking at each other on the road .
1001773457_577c3a7d70.jpg,Two dogs on pavement moving toward each other .

可以看到，每张图片配有五个不同的标题。图片和标题实况举例

image：

caption：

A child in a pink dress is climbing up a set of stairs in an entry way . # 一个穿着粉红色连衣裙的孩子正在爬入口处的一组楼梯。
A girl going into a wooden building .  # 一个女孩走进一栋木屋。
A little girl climbing into a wooden playhouse .  # 一个小女孩爬进了一个木制的剧场。
A little girl climbing the stairs to her playhouse .  # 一个小女孩爬楼梯去她的游戏屋。
A little girl in a pink dress going into a wooden cabin .  # 一个穿着粉红色连衣裙的小女孩走进一间木屋。

在训练时，image会经过数据增强，caption对一些噪声符号进行去除，然后每条注释格式会进行配对，（image，caption， idx）。其中idx是图像的索引。（idx用来索引图像，作为文本检索图像时模型的预测目标）

注意：这里虽然是单个图像与单个文本配对儿，实际上每个图像对应五条文本，只不过不是一次性的训练，即一图像与五条文本配对，而是分开的一对一作为样本对儿。

训练流程

1、 caption text 进行量化，text token送入 text encoder， image 送入 image encoder。

2、计算ITC损失。过程中利用idx构造真实图像文本匹配的one hot target。可以参考这里 🐼 ，

3、文本表征与图像表征送入多模态Encoder，进行融合前向处理。

计算 ITM损失。可以参考这里 👿

evaluation流程

1、text token 送入text Encoder， image 送入 image Encoder

2、计算相似性矩阵。例如
sims_matrix = image_embeds @ text_embeds.t()
主要目的是拿出image space 和 text space中最对齐的特征送入多模态Encoder中去。计算分数。

这个过程重复两次，一次是 i2t ,一次是t2i

最终返回的是匹配分数矩阵。

3、进行评估。

评估细节用工具 API实现的，因此这里不做详述。

实际使用推测猜想

本次没有做实际使用demo的相关代码阅读。不过根据evaluate，在实际使用的时候，不管是图像-文本，还是文本-图像，最终的检索结果只能包含于使用的训练数据集中。因为它是根据索引去选择预测的结果，而不是生成式的去生成结果。

因此，实际使用中，是需要有这样的一个包含图像-文本对儿的数据库去检索的，你输入的单一模态的数据可以不来源数据集，但是它会去数据集中匹配最佳的结果。而不是说像GPT那样你描述一个场景，它去给你生成。当然，GPT等可能也融合了这种检索任务，你描述的场景如果存在，就去检索，不存在则去生成。我感觉那种网上的搜索任务，比如根据问题描述去找解决办法，可以靠这种检索去实现。有的根据描述问题让GPT去编写代码，很大可能都是靠检索去完成的（利用一个很大的代码库，比如github上的，leetcode上的）。实际靠语言模型去回归预测代码的编写感觉不太靠谱。（只限于目前自己的联想，因为学习也是循序渐进的，视野也是逐渐走向开阔的，不可能保证一开始的认知就是正确的，哈哈。如有错误，请求指正。）

二、Visual Question Answering （VQA ，视觉问答）

任务目的

VQA的任务是通过理解图像中的内容并结合问题的文本描述，生成合适的答案。

回答有关图像的问题。大多数研究人员将其视为一项分类任务，即从答案库中选择正确的答案。

通过给定一个图像和一段关于图像的自然语言，这个任务将提供一个精确的自然语言答案。这个任务可以映射到现实生活的场景中：比如说帮助视障人士，问题和答案都是开放性的。

数据集格式

以VQA 数据集为例，官网🤠

这个数据集的配置分成了四个部分：注释（即答案）、问题、图像、互补对其列表。

1、注释示例

{"question_type": "what is this", "multiple_choice_answer": "net", "answers": [{"answer": "net", "answer_confidence": "maybe", "answer_id": 1}, {"answer": "net", "answer_confidence": "yes", "answer_id": 2}, {"answer": "net", "answer_confidence": "yes", "answer_id": 3}, {"answer": "netting", "answer_confidence": "yes", "answer_id": 4}, {"answer": "net", "answer_confidence": "yes", "answer_id": 5}, {"answer": "net", "answer_confidence": "yes", "answer_id": 6}, {"answer": "mesh", "answer_confidence": "maybe", "answer_id": 7}, {"answer": "net", "answer_confidence": "yes", "answer_id": 8}, {"answer": "net", "answer_confidence": "yes", "answer_id": 9}, {"answer": "net", "answer_confidence": "yes", "answer_id": 10}], "image_id": 458752, "answer_type": "other", "question_id": 458752000}
=============
{"question_type": "what", "multiple_choice_answer": "pitcher", "answers": [{"answer": "pitcher", "answer_confidence": "yes", "answer_id": 1}, {"answer": "catcher", "answer_confidence": "no", "answer_id": 2}, {"answer": "pitcher", "answer_confidence": "yes", "answer_id": 3}, {"answer": "pitcher", "answer_confidence": "yes", "answer_id": 4}, {"answer": "pitcher", "answer_confidence": "yes", "answer_id": 5}, {"answer": "pitcher", "answer_confidence": "yes", "answer_id": 6}, {"answer": "pitcher", "answer_confidence": "yes", "answer_id": 7}, {"answer": "pitcher", "answer_confidence": "yes", "answer_id": 8}, {"answer": "pitcher", "answer_confidence": "yes", "answer_id": 9}, {"answer": "pitcher", "answer_confidence": "yes", "answer_id": 10}], "image_id": 458752, "answer_type": "other", "question_id": 458752001},

举了两个例子。可以看到一问题有10个答案。这里只列举了两个，其实看那个 image id，一个图片大概可以有五个问题，那么综合起来，每个图片有 5*10=50 个问答场景。

2、问题示例

{"image_id": 458752, "question": "What is this photo taken looking through?", "question_id": 458752000}, {"image_id": 458752, "question": "What position is this man playing?", "question_id": 458752001}, {"image_id": 458752, "question": "What color is the players shirt?", "question_id": 458752002}, {"image_id": 458752, "question": "Is this man a professional baseball player?", "question_id": 458752003},

图像id，问题caption，以及问题id。

3、图像示例

{"file_name": "abstract_v002_train2015_000000011779.png", "image_id": 11779, "height": 400, "url": "http://visualqa.org/data/abstract_v002/scene_img/img/11779.png", "width": 700}, {"file_name": "abstract_v002_train2015_000000005536.png", "image_id": 5536, "height": 400, "url": "http://visualqa.org/data/abstract_v002/scene_img/img/5536.png", "width": 700}, {"file_name": "abstract_v002_train2015_000000016949.png", "image_id": 16949, "height": 400, "url": "http://visualqa.org/data/abstract_v002/scene_img/img/16949.png", "width": 700}, {"file_name": "abstract_v002_train2015_000000019949.png", "image_id": 19949, "height": 400, "url": "http://visualqa.org/data/abstract_v002/scene_img/img/19949.png", "width": 700},

图像文件的name，图像id，图像的地址url，以及长和宽等信息

4、互补对齐列表示例

[158307014, 254204008], [158307013, 89462005], [472405000, 79224002]

两个问题的id。这个是由于存在当两个不同的图片伴相同的问题，但是有着不同的答案的场景。

具体地示例展示

训练流程

以ALBEF算法中VQA流程的为例

1、image 送入 encoder，然后与 question token（量化后的）一起送入Multimodal Encoder（这里 text encoder也一起包含其中了，区分就是Bert的前6层与后6层，以及多模态使用cross attention 层）

2、将 answer 的 token ，上面得到的question state 送入decoder，跟据Bert的流程去计算损失。

BERT在第一句前会加一个[CLS]标志，最后一层该位对应向量可以作为整句话的语义表示（即句子embedding），从而用于下游的分类任务等。与文本中已有的其它词相比，这个无明显语义信息的符号会更“公平”地融合文本中各个词的语义信息，从而更好的表示整句话的语义。