Open
Description
my training environment is a docker image pulled from deepspeed/deepspeed:v072_torch112_cu117
and i run it with docker run -it --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --network train-net --name fuyx-work -v /home/fuyx/big_disk_1000/DeepSpeedExamples/applications/DeepSpeed-Chat:/root/DeepSpeed-Chat b1d
in a overlay docker network.
then after i complete The previous two steps,i run the last step by python train.py --actor-model facebook/opt-1.3b --reward-model facebook/opt-350m --deployment-type multi_node --step 3
my hostfile is
jes-work slots=1
fuyx-work slots=1
and i get this error
jes-work: Traceback (most recent call last):
jes-work: File "main.py", line 522, in <module>
jes-work: main()
jes-work: File "main.py", line 390, in main
jes-work: rlhf_engine = DeepSpeedRLHFEngine(
jes-work: File "/root/DeepSpeed-Chat/training/step3_rlhf_finetuning/rlhf_engine.py", line 48, in __init__
jes-work: self.actor = self._init_actor(
jes-work: File "/root/DeepSpeed-Chat/training/step3_rlhf_finetuning/rlhf_engine.py", line 119, in _init_actor
jes-work: actor_engine, *_ = deepspeed.initialize(model=actor_model,
jes-work: File "/opt/conda/lib/python3.8/site-packages/deepspeed/__init__.py", line 153, in initialize
jes-work: engine = DeepSpeedHybridEngine(args=args,
jes-work: File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/hybrid_engine.py", line 52, in __init__
jes-work: self.create_inference_module()
jes-work: File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/hybrid_engine.py", line 359, in create_inference_module
jes-work: self.create_inference_containers(self.module)
jes-work: File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/hybrid_engine.py", line 308, in create_inference_containers
jes-work: self.create_inference_containers(child, layer_id=layer_id)
jes-work: File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/hybrid_engine.py", line 308, in create_inference_containers
jes-work: self.create_inference_containers(child, layer_id=layer_id)
jes-work: File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/hybrid_engine.py", line 308, in create_inference_containers
jes-work: self.create_inference_containers(child, layer_id=layer_id)
jes-work: File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/hybrid_engine.py", line 288, in create_inference_containers
jes-work: self._inference_containers.append(self.inference_policies[child.__class__][0](
jes-work: File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/hybrid_engine.py", line 107, in new_inference_container
jes-work: _container.set_tensor_parallel_config(self._config.hybrid_engine.inference_tp_size, self.mp_group)
jes-work: File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 461, in __getattr__
jes-work: raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
jes-work: AttributeError: 'DeepSpeedHybridEngine' object has no attribute 'mp_group'
the deepspeed command is below,i don't have any change except reduce some batch size to slow the gpu's pressure:
deepspeed --master_port 12346\
--hostfile=hostfile \
main.py \
--data_path Dahoas/rm-static \
--data_split 2,4,4 \
--actor_model_name_or_path $ACTOR_MODEL_PATH \
--critic_model_name_or_path $CRITIC_MODEL_PATH \
--num_padding_at_beginning 1 \
--per_device_train_batch_size 1 \
--per_device_mini_train_batch_size 1 \
--generation_batch_numbers 1 \
--ppo_epochs 1 \
--max_answer_seq_len 256 \
--max_prompt_seq_len 256 \
--actor_learning_rate ${Actor_Lr} \
--critic_learning_rate ${Critic_Lr} \
--actor_weight_decay 0.1 \
--critic_weight_decay 0.1 \
--num_train_epochs 1 \
--lr_scheduler_type cosine \
--gradient_accumulation_steps 1 \
--num_warmup_steps 100 \
--deepspeed --seed 1234 \
--enable_hybrid_engine \
--inference_tp_size 8 \
--tp_gather_partition_size 4 \
--actor_zero_stage $ACTOR_ZERO_STAGE \
--critic_zero_stage $CRITIC_ZERO_STAGE \
--actor_gradient_checkpointing \
--disable_actor_dropout \
--actor_lora_dim 128 \
--actor_lora_module_name decoder.layers. \
--output_dir $OUTPUT \
&> $OUTPUT/training.log
Metadata
Metadata
Assignees
Type
Projects
Milestone
Relationships
Development
No branches or pull requests
Activity
jomayeri commentedon May 16, 2023
Hi @qingchu123 could you report which version of DeepSpeed you are running?
qingchu123 commentedon May 19, 2023
@jomayeri
i use
pip show deepspeed
and it shows:i have install the git last deepspeed,Commits on May 13, 2023,sha:
5c6da1f001f936234a31a238e71ca386e34eb51a
jomayeri commentedon Jun 5, 2023
@qingchu123 try adjusting the
--inference_tp_size
to a lower number, it may be you don't have enough GPUs across your nodes.kkk935208447 commentedon Mar 18, 2024
thanks,it work