Open
Description
Hi, here to share a new image segmentation paper using ViT !
Paper : https://arxiv.org/abs/2503.19108
Code : https://github.com/tue-mps/eomt
This papers reach almost SOTA result with considerably less complex architectures (vision transformer only), if they are already well pretrained. EoMT only uses the architecture of the plain ViT with a few extra learned queries and a small mask prediction module. It works on par with ViT-Adapter + Mask2Former while being much less complex.
It would be interesting to have in this library !