OneFormer: One Transformer to Rule Universal Image Segmentation

CVPR 2023

Jitesh Jain1,3         Jiachen Li1*         MangTik Chiu1*         Ali Hassani1         Nikita Orlov2         Humphrey Shi1,2

1 SHI Labs @ U of Oregon & UIUC     2 Picsart AI Research     3 IIT Roorkee
*Equal Contribution











Universal Image Segmentation is not a new concept. Past attempts to unify image segmentation in the last decades include scene parsing, panoptic segmentation, and, more recently, new panoptic architectures. However, such panoptic architectures do not truly unify image segmentation because they need to be trained individually on the semantic, instance, or panoptic segmentation to achieve the best performance. Ideally, a truly universal framework should be trained only once and achieve SOTA performance across all three image segmentation tasks. To that end, we propose OneFormer, a universal image segmentation framework that unifies segmentation with a multi-task train-once design. We first propose a task-conditioned joint training strategy that enables training on ground truths of each domain (semantic, instance, and panoptic segmentation) within a single multi-task training process. Secondly, we introduce a task token to condition our model on the task at hand, making our model task-dynamic to support multi-task training and inference. Thirdly, we propose using a query-text contrastive loss during training to establish better inter-task and inter-class distinctions. Notably, our single OneFormer model outperforms specialized Mask2Former models across all three segmentation tasks on ADE20k, CityScapes, and COCO, despite the latter being trained on each of the three tasks individually with three times the resources. With new ConvNeXt and DiNAT backbones, we observe even more performance improvement. We believe OneFormer is a significant step towards making image segmentation more universal and accessible.


We propose OneFormer, the first multi-task universal image segmentation framework based on transformers that need to be trained only once with a single universal architecture, a single model, and on a single dataset, to outperform existing frameworks across semantic, instance, and panoptic segmentation tasks, despite the latter need to be trained separately on each task using multiple times of the resources. We introduce a task input to condition the model on the task in focus, making our architecture task-guided for training, and task-dynamic for inference, all with a single model. Furthermore, we jointly train our framework by uniformly sampling ground truths from all three image segmentation domains during training. Following the recent success of transformer-based frameworks in the computer vision domain, we also formulate our framework as query-based with an additional contrastive loss on the queries during training. Specifically, we initialize our queries as repetitions of the task token and consequently compute a query-text contrastive loss with the text derived from the corresponding ground-truth label. We hypothesize that a contrastive loss on the queries helps in making the model more task-sensitive.

Task Conditioned Joint Training

We tackle the multi-task train-once challenge for image segmentation using a task-conditioned joint training strategy. Particularly, we first uniformly sample the \(\texttt{task}\) domain \(\in\) {panoptic, semantic, instance} for the GT label. We realize the unification potential of panoptic annotations by deriving the task-specific labels from the panoptic annotations, thus, using only one set of annotations. For e.g., when \(\texttt{task}\) is semantic/instance segmentation, we derive the corresponding label from the GT panoptic annotation for the sampled image. Next, we extract a set of binary masks for each category present in the image from the task-specific GT label, i.e., semantic task guarantees only one amorphous binary mask for each class present in the image, whereas, instance task signifies non-overlapping binary masks for only thing classes, ignoring the stuff regions. Panoptic task denotes a single amorphous mask for stuff classes and non-overlapping masks for thing classes. Subsequently, we iterate over the set of masks to create a list of text entries (\(\mathbf{T}_\text{list}\)) with a template “a photo with a {\(\texttt{CLS}\)}”, where \(\texttt{CLS}\) is the class name for the corresponding binary mask object. The number of binary masks per sample varies over the dataset. Therefore, we pad(\(\mathbf{T}_\text{list}\)) with “a/an {\(\texttt{task}\)} photo” entries to obtain a padded list (\(\mathbf{T}_\text{pad}\)) with text queries of constant length. We are motivated to represent the text queries as a padded list keeping in mind the meaning of the object queries, which represent the number of objects present in an image. We later use (\(\mathbf{T}_\text{pad}\)) to calculate a object text query contrastive loss over the object queries (\(\mathbf{Q}\)) make our model sensitive to the inter-task differences.

We condition our architecture on the task using a task input with the template “the task is {\(\texttt{task}\)}”, which is tokenized and mapped to task-token (\(\mathbf{Q}_\text{task}\)). We use \(\mathbf{Q}_\text{task}\) to initialize and condition the object queries (\(\mathbf{Q}\)) on the \(\texttt{task}\).


We experiment on three widely used datasets: COCO, Cityscapes and ADE20K, that support all three: semantic, instance, and panoptic segmentation tasks. Our OneFormer, trained only once, outperforms the the individually trained specialized Mask2Former models, the previous single-architecture state of the art, on all three segmentation tasks across major datasets.

Please check our paper 📄 for extensive experiments and ablation studies.

Task Dynamic Nature of OneFormer

We demonstrate that our framework is sensitive to the task token input by setting the value of {\(\texttt{task}\)} during inference as panoptic, instance, or semantic. We report results with our Swin-L\(^{\dagger}\) OneFormer trained on Cityscapes dataset. We observe a significant drop in the PQ and mIoU metrics when \(\texttt{task}\) is instance compared to panoptic. Moreover, the PQ\(^\text{St}\) drops to 0%, and there is only a -0.2% drop on PQ\(^\text{Th}\) metric, proving that the network learns to focus majorly on the distinct “thing” instances when the \(\texttt{task}\) is instance. Similarly, there is a sizable drop in the PQ, PQ\(^\textbf{Th}\) and AP metrics for the semantic task with PQ\(^\textbf{St}\) improving by +0.2% showing that our framework can segment out amorphous masks for “stuff” regions but does not predict different masks for “thing” objects. Therefore, our model can dynamically learn the task-specific features during training which is critical for a train-once multi-task architecture.

Reduced Category Misclassifcations

Our query-text contrastive loss helps us establish an implicit inter-task separation during training and reduces the category misclassifcations in the final predictions. Mask2Former incorrectly predicts “wall” as “fence” in the first row, “vegetation” as “terrain”, and “terrain” as “sidewalk”. At the same time, our OneFormer produces more accurate predictions in regions (inside blue boxes) with similar classes.

Individual v/s Joint Training

We analyze our OneFormer's performance with individual training on the panoptic, instance, and semantic segmentation task. OneFormer outperforms Mask2Former (the previous SOTA semi-universal image segmentation method) with every training strategy. Furthermore, with joint training, Mask2Former suffers a significant drop in performance, and OneFormer achieves the highest PQ, AP and mIoU scores.


If you found OneFormer useful in your research, please consider starring ⭐ us on GitHub and citing 📚 us in your research!

title={{OneFormer: One Transformer to Rule Universal Image Segmentation}},
author={Jitesh Jain and Jiachen Li and MangTik Chiu and Ali Hassani and Nikita Orlov and Humphrey Shi},