First GPU occupies more VRAM in distributed training

[link](https://github.com/ExponentialML/Text-To-Video-Finetuning/blob/main/utils/dataset.py#L580)，
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
cached_latent = torch.load(self.cached_data_list[index], map_location=device)
Otherwise, in multi-GPU distributed training, the first GPU may occupy excessive VRAM compared to the other GPUs.