FlexCap is a versatile vision-language model (VLM) capable of generating region-specific descriptions of
varying lengths. Trained to produce length-conditioned captions for input bounding boxes, FlexCap empowers
the user to control the information density of its output. This allows for descriptions ranging from
concise object labels to detailed captions. This capability has several valuable applications. Firstly,
FlexCap demonstrates superior performance in dense captioning tasks on the Visual Genome dataset.
Additionally, we introduce FlexCap-LLM, a visual question answering (VQA) system. FlexCap-LLM employs
FlexCap to generate localized descriptions which serve as inputs to a large language model (LLM),
achieving state-of-the-art zero-shot performance on some VQA datasets. Finally, we qualitatively
demonstrate FlexCap's broad applicability in tasks such as image labeling, object attribute recognition,
and visual dialog.
FlexCap generates controllably rich localized descriptions for any region in an image. It has the flexibility to produce captions in a controllable manner which allows the full spectrum of valid descriptions to be explored from short object category names to fully-detailed captions.
For results on Length Conditioned Captions , click on any image below to inspect closely.
FlexCap can help in open-world detection by describing salient regions. Unlike prior dense captioning works, FlexCap generates more diverse sentences to describe visual content in controllable detail.
Here we present interactive showcase of results for region captioning here. Explore the images for Interactive Region-Captioning. Click on any image below to inspect closely.
Training FlexCap on a large dataset leads to an emergent capability: the model can extract desired information for a specific image region using input prefixes. We present below some examples of attributes that FlexCap can generate.
Click on the image to inspect the bounding box and caption closely.
Rich localized captions generated by FlexCap can be easily passed onto Large Language Models (LLMs) to enable zero-shot visual question answering.
Here we present some of the results of FlexCapLLM. Click on any of the images to inspect closely. Note: in the images below "FlexCap" refers to the system "FlexCapLLM".