This project is based on aofrancani/TSformer-VO, with architectural modifications inspired by SWFormer-VO. We replace TimeSformer with a Video Swin Transformer (stages 1–3) and introduce early fusion of RGB and pseudo-depth inputs.
⚠️ This repository currently provides inference and evaluation code only.
Training code (e.g., training script, optimizer, and hyperparameter configs) will be released after paper acceptance.
This project presents VSTFusion-VO, a Swin Transformer-based monocular visual odometry framework designed for multimodal input.
By leveraging recent advances in spatiotemporal modeling and depth-aware fusion, our method achieves robust 6-DoF pose estimation from monocular RGB video and pseudo-depth maps.
Key design features include:
- A Video Swin Transformer backbone (stages 1–3) tailored for long-range temporal modeling.
- Early fusion of RGB and Depth embeddings before temporal encoding to enhance geometric consistency.
- Seamless integration into a temporal pose estimation pipeline with transformer-based architecture.
- Evaluated on the KITTI Odometry benchmark, showing consistent improvements over transformer-based baselines, including:
- ↓3.59% Translational Error (%)
- ↓8.76% Absolute Trajectory Error (ATE)
- ↓2.54% Relative Pose Error (RPE)
Quantitative results (7-DoF alignment) on selected KITTI sequences:
Sequence | Trans. Error (%) | Rot. Error (°/100m) | ATE (m) | RPE (m) | RPE (°) |
---|---|---|---|---|---|
01 | 25.11 | 5.77 | 76.28 | 0.703 | 0.260 |
03 | 14.75 | 9.20 | 20.34 | 0.101 | 0.221 |
04 | 4.84 | 2.55 | 3.29 | 0.085 | 0.129 |
05 | 9.39 | 4.09 | 42.31 | 0.104 | 0.201 |
06 | 10.40 | 3.69 | 25.97 | 0.133 | 0.179 |
07 | 8.20 | 6.44 | 19.53 | 0.102 | 0.214 |
10 | 8.65 | 3.45 | 14.33 | 0.117 | 0.241 |
Evaluation was performed using kitti-odom-eval with 7-DoF alignment.
VSTFusion-VO is a Swin-based monocular visual odometry framework that integrates RGB and depth information through early-stage fusion. The model leverages a Video Swin Transformer as its temporal backbone, enabling hierarchical spatiotemporal representation learning for accurate 6-DoF pose estimation. By embedding geometric information from pseudo-depth at the input level, VSTFusion-VO achieves improved results on the KITTI benchmark, demonstrating the effectiveness of multimodal fusion and video-native transformer design in visual motion estimation.
Download the KITTI Odometry dataset (grayscale) for training and evaluation.
RGB images are stored in .jpg
format.
Use png_to_jpg.py to convert the original .png
files.
The depth maps are pseudo-depths generated by Monodepth2,
predicted from grayscale KITTI frames and saved as .jpeg
images.
The data structure should be as follows:
TSformer-VO/
└── data/
├── sequences_jpg/
│ ├── 00/
│ │ └── image_0/
│ │ ├── 000000.jpg # RGB image
│ │ ├── 000000_disp.jpeg # Depth map (Monodepth2)
│ │ ├── 000001.jpg
│ │ ├── 000001_disp.jpeg
│ │ └── ...
│ ├── 01/
│ └── ...
└── poses/
├── 00.txt
├── 01.txt
└── ...
Here you find the checkpoints of our trained-models.
Google Drive folder: link to checkpoints in GDrive
- Create a virtual environment using Anaconda and activate it:
conda create -n tsformer-vo python==3.8.0
conda activate tsformer-vo
- Install dependencies (with environment activated):
pip install -r requirements.txt
PS: So far we are changing the settings and hyperparameters directly in the variables and dictionaries. As further work, we will use pre-set configurations with the argparse
module to make a user-friendly interface.
In predict_poses.py
:
- Manually set the variables to read the checkpoint and sequences.
Variables | Info |
---|---|
checkpoint_path | String with the path to the trained model you want to use for inference. Ex: checkpoint_path = "checkpoints/Model1" |
checkpoint_name | String with the name of the desired checkpoint (name of the .pth file). Ex: checkpoint_name = "checkpoint_model2_exp19" |
sequences | List with strings representing the KITTI sequences. Ex: sequences = ["03", "04", "10"] |
In plot_results.py
:
- Manually set the variables to the checkpoint and desired sequences, similarly to Inference
The evaluation is done with the KITTI odometry evaluation toolbox. Please go to the evaluation repository to see more details about the evaluation metrics and how to run the toolbox.
If you find this implementation helpful in your work, please consider referencing this repository.
A citation entry will be provided if a related publication becomes available.
Code adapted from TimeSformer.
Check out our previous work on monocular visual odometry: DPT-VO