Skip to content

[bug]AttributeError: 'DeepSpeedHybridEngine' object has no attribute 'mp_group' #525

Open
@qingchu123

Description

@qingchu123

my training environment is a docker image pulled from deepspeed/deepspeed:v072_torch112_cu117
and i run it with docker run -it --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --network train-net --name fuyx-work -v /home/fuyx/big_disk_1000/DeepSpeedExamples/applications/DeepSpeed-Chat:/root/DeepSpeed-Chat b1d in a overlay docker network.
then after i complete The previous two steps,i run the last step by python train.py --actor-model facebook/opt-1.3b --reward-model facebook/opt-350m --deployment-type multi_node --step 3
my hostfile is

jes-work slots=1
fuyx-work slots=1

and i get this error

jes-work: Traceback (most recent call last):
jes-work:   File "main.py", line 522, in <module>
jes-work:     main()
jes-work:   File "main.py", line 390, in main
jes-work:     rlhf_engine = DeepSpeedRLHFEngine(
jes-work:   File "/root/DeepSpeed-Chat/training/step3_rlhf_finetuning/rlhf_engine.py", line 48, in __init__
jes-work:     self.actor = self._init_actor(
jes-work:   File "/root/DeepSpeed-Chat/training/step3_rlhf_finetuning/rlhf_engine.py", line 119, in _init_actor
jes-work:     actor_engine, *_ = deepspeed.initialize(model=actor_model,
jes-work:   File "/opt/conda/lib/python3.8/site-packages/deepspeed/__init__.py", line 153, in initialize
jes-work:     engine = DeepSpeedHybridEngine(args=args,
jes-work:   File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/hybrid_engine.py", line 52, in __init__
jes-work:     self.create_inference_module()
jes-work:   File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/hybrid_engine.py", line 359, in create_inference_module
jes-work:     self.create_inference_containers(self.module)
jes-work:   File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/hybrid_engine.py", line 308, in create_inference_containers
jes-work:     self.create_inference_containers(child, layer_id=layer_id)
jes-work:   File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/hybrid_engine.py", line 308, in create_inference_containers
jes-work:     self.create_inference_containers(child, layer_id=layer_id)
jes-work:   File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/hybrid_engine.py", line 308, in create_inference_containers
jes-work:     self.create_inference_containers(child, layer_id=layer_id)
jes-work:   File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/hybrid_engine.py", line 288, in create_inference_containers
jes-work:     self._inference_containers.append(self.inference_policies[child.__class__][0](
jes-work:   File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/hybrid_engine.py", line 107, in new_inference_container
jes-work:     _container.set_tensor_parallel_config(self._config.hybrid_engine.inference_tp_size, self.mp_group)
jes-work:   File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 461, in __getattr__
jes-work:     raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
jes-work: AttributeError: 'DeepSpeedHybridEngine' object has no attribute 'mp_group'

the deepspeed command is below,i don't have any change except reduce some batch size to slow the gpu's pressure:

deepspeed --master_port 12346\
    --hostfile=hostfile \
     main.py \
   --data_path Dahoas/rm-static \
   --data_split 2,4,4 \
   --actor_model_name_or_path $ACTOR_MODEL_PATH \
   --critic_model_name_or_path $CRITIC_MODEL_PATH \
   --num_padding_at_beginning 1 \
   --per_device_train_batch_size 1 \
   --per_device_mini_train_batch_size 1 \
   --generation_batch_numbers 1 \
   --ppo_epochs 1 \
   --max_answer_seq_len 256 \
   --max_prompt_seq_len 256 \
   --actor_learning_rate ${Actor_Lr} \
   --critic_learning_rate ${Critic_Lr} \
   --actor_weight_decay 0.1 \
   --critic_weight_decay 0.1 \
   --num_train_epochs 1 \
   --lr_scheduler_type cosine \
   --gradient_accumulation_steps 1 \
   --num_warmup_steps 100 \
   --deepspeed --seed 1234 \
   --enable_hybrid_engine \
   --inference_tp_size 8 \
   --tp_gather_partition_size 4 \
   --actor_zero_stage $ACTOR_ZERO_STAGE \
   --critic_zero_stage $CRITIC_ZERO_STAGE \
   --actor_gradient_checkpointing \
   --disable_actor_dropout \
   --actor_lora_dim 128 \
   --actor_lora_module_name decoder.layers. \
   --output_dir $OUTPUT \
    &> $OUTPUT/training.log

Activity

jomayeri

jomayeri commented on May 16, 2023

@jomayeri
Contributor

Hi @qingchu123 could you report which version of DeepSpeed you are running?

qingchu123

qingchu123 commented on May 19, 2023

@qingchu123
Author

@jomayeri
i use pip show deepspeed and it shows:

Name: deepspeed
Version: 0.9.3+5c6da1f0
Summary: DeepSpeed library
Home-page: http://deepspeed.ai
Author: DeepSpeed Team
Author-email: deepspeed-info@microsoft.com
License: Apache Software License 2.0
Location: /opt/conda/lib/python3.8/site-packages
Requires: hjson, ninja, numpy, packaging, psutil, py-cpuinfo, pydantic, torch, tqdm

i have install the git last deepspeed,Commits on May 13, 2023,sha:5c6da1f001f936234a31a238e71ca386e34eb51a

jomayeri

jomayeri commented on Jun 5, 2023

@jomayeri
Contributor

@qingchu123 try adjusting the --inference_tp_size to a lower number, it may be you don't have enough GPUs across your nodes.

kkk935208447

kkk935208447 commented on Mar 18, 2024

@kkk935208447

try adjusting the --inference_tp_size to a lower number, it may be you don't have enough GPUs across your nodes.

thanks,it work

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingdeespeed chatDeepSpeed Chathybrid enginerelating to the hybrid engine

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

    Development

    No branches or pull requests

      Participants

      @awan-10@kkk935208447@qingchu123@jomayeri

      Issue actions

        [bug]AttributeError: 'DeepSpeedHybridEngine' object has no attribute 'mp_group' · Issue #525 · deepspeedai/DeepSpeedExamples