TL;DR
- Florence-2 unifies captioning, object detection, referring expression segmentation, and OCR inside a single sequence-to-sequence model with no task-specific heads β the task prompt alone selects the behavior.
- All spatial outputs (bounding boxes, polygons, region coordinates) are expressed as location tokens (
<loc_0>β¦<loc_999>) appended to the standard text vocabulary, so the decoder generates spatial predictions identically to how it generates words. - A DaViT vision encoder produces visual tokens that are concatenated with prompt tokens and fed into a transformer encoder-decoder, trained on FLD-5B β 5.4 billion annotations across nine task categories on 126 million images.
- Florence-2 achieves strong zero-shot transfer across spatial and semantic tasks without fine-tuning, outperforming much larger specialist models on several benchmarks after pre-training alone.
One model, many tasks via prompts
Traditional vision pipelines assign a separate model head to each task: one for detection, one for segmentation, one for captioning. Florence-2 collapses this into a single model where the task prompt token β <CAPTION>, <OD>, <REFERRING_EXPRESSION_SEGMENTATION>, <OCR> β fully determines the output format. The image encoder, the transformer encoder-decoder, and all weights remain identical across tasks.
This design means the model learns shared representations that generalize across tasks rather than task-specific feature extractors. Captioning encourages semantic understanding; detection forces spatial precision; OCR requires character-level recognition. Each task regularizes the others, producing a backbone that is simultaneously richer and more compact than any individual specialist.
The prompt token is prepended to the userβs input text (or left as the only input for tasks with no region specification). The model then generates the appropriate structured output: a sentence for captions, an interleaved sequence of label tokens and location tokens for detection, a polygon token sequence for segmentation, or a text string with region tokens for OCR.
Spatial outputs as location tokens
Object detection and segmentation require predicting coordinates β bounding boxes, polygon vertices, reference points. In prior work these were typically produced by task-specific regression heads that output continuous values and are trained with dedicated losses (L1, GIoU, dice). Florence-2 removes this entirely.
The vocabulary is extended with 1,000 special tokens <loc_0> through <loc_999>. Image dimensions (width and height) are each divided into a 0β999 grid. Any coordinate is quantized to its nearest bin: loc_N = round(coord / image_size Γ 999). A bounding box (xβ, yβ, xβ, yβ) thus becomes the four-token sequence <loc_x1><loc_y1><loc_x2><loc_y2>, which the decoder generates with the same cross-entropy loss used for text.
The quantization introduces a small spatial error of at most Β±0.1% of the image dimension per coordinate β negligible for detection and grounding tasks. Polygons for segmentation are encoded as longer token sequences (one <loc> pair per vertex), which the decoder can generate autoregressively to arbitrary precision.
This formulation has a key downstream benefit: the model never needs a hand-designed detection head. Detection, grounding, and segmentation simply become different output vocabularies that the decoder learns to produce conditioned on the task prompt.
A single seq2seq architecture
Florence-2βs backbone is a standard encoder-decoder transformer with one non-standard component: the image encoder.
DaViT image encoder. Instead of a plain ViT, Florence-2 uses a DaViT (Dual Attention Vision Transformer) that applies both channel attention and spatial self-attention in each block. The hierarchical design produces multi-scale features that are then flattened into a sequence of visual tokens. DaViT is pretrained on ImageNet-1K before fine-tuning on FLD-5B.
Prompt concatenation. The task prompt (a short text like <OD>) is tokenized with a standard BPE tokenizer. The visual tokens from DaViT are prepended to the prompt tokens, forming a joint sequence that is passed to the transformer encoder.
Transformer encoder-decoder. A standard cross-attention encoder-decoder with masked self-attention in the decoder. The decoder attends over the encoded visual+prompt sequence via cross-attention and generates the output sequence autoregressively. Output tokens are drawn from the extended vocabulary (text + 1,000 location tokens). The training objective is cross-entropy over this combined vocabulary.
FLD-5B training data. The model is trained on the Florence Large-scale Dataset (FLD-5B), which contains 5.4 billion text-image annotation pairs across nine task types: image-level (caption, classification), region-level (object detection, dense region caption, region proposal, referencing), and pixel-level (semantic segmentation, expression segmentation, OCR). FLD-5B was assembled by an automated annotation pipeline that combined public datasets, internet-crawled data, and machine-generated annotations, then filtered for quality.
Why it mattered
Before Florence-2, building a vision system that handled detection, grounding, segmentation, captioning, and OCR required five separate models with five separate training pipelines and five separate inference stacks. Florence-2 compressed all of this into a single compact model trained with one objective.
The location-token formulation was the key insight: by treating coordinates as vocabulary items rather than continuous regression targets, spatial tasks become text generation tasks. This allowed Florence-2 to inherit all the benefits of large-scale language model pre-training (efficient scaling, prompt-based task switching, transferability) while retaining precise spatial capabilities that earlier vision-language models lacked.
Florence-2-base (232M parameters) and Florence-2-large (771M parameters) both demonstrate strong zero-shot performance across tasks, and after fine-tuning on task-specific data they are competitive with much larger specialist models. The design has since influenced subsequent unified vision models that adopt similar prompt-and-token-sequence formulations.
Related Reading
- SAM β a complementary foundation model for segmentation that uses spatial prompts (points, boxes) rather than text tokens, showing an alternative design for promptable spatial understanding
- DETR β the transformer-based detection architecture that first brought encoder-decoder seq2seq thinking to object detection, paving the way for Florence-2βs unified formulation
- CLIP β the contrastive vision-language pretraining approach that established joint image-text embedding spaces, which Florence-2 extends to generative seq2seq modeling across spatial tasks
