News

Composing Text and Image to Image Retrieval (CTI-IR) aims at finding the target image, which matches the query image visually along with the query text semantically. However, existing works ignore the ...
Grounding language to visual relations is critical to various language-and-vision applications. In this work, we tackle two fundamental language-and-vision tasks: image-text matching and image ...
Recently, research in machine learning has become more reliant on data-driven approaches. However, understanding the general theory behind optimal neural network architecture is, arguably, just as ...
The objective is to derive a relation prompt within the text embedding space of a pre-trained text-to-image diffusion model, where objects in each exemplar image follow a specific relation. Combining ...
Specifically, CRN includes: 1) Cross Relation Network comprehensively captures the relationships of various composed retrieval scenarios caused by two different query text types, allowing a unified ...