Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Qs #230

Open
zws98 opened this issue Apr 16, 2024 · 3 comments
Open

Qs #230

zws98 opened this issue Apr 16, 2024 · 3 comments

Comments

@zws98
Copy link

zws98 commented Apr 16, 2024

I trained MOE on 8 gpus with 8 experts. When I conducted the inference in parallel, I found each process had a similar but different result. I would like to ask you what could be the cause of this?

@ghostplant
Copy link
Contributor

Maybe you can consider if drop-less MOE mode can solve your issue, which is achieved by setting capacity_factor=0

@zws98
Copy link
Author

zws98 commented Apr 16, 2024

The results are still diverse for each process and the results are different from setting capacity_factor=1.25.

@ghostplant
Copy link
Contributor

Do you have more information? I didn't get what you said.

Outputs from different GPUs:

STEP-10: loss = 21.11541, step_time = 3.628716 sec, perf = 0.08 tflops.

[Summary] Average synchronized step_time = 0.3628715753555298 sec.
STEP-10: loss = 21.11541, step_time = 3.670310 sec, perf = 0.07 tflops.

[Summary] Average synchronized step_time = 0.36703104972839357 sec.
STEP-10: loss = 21.11541, step_time = 3.689584 sec, perf = 0.07 tflops.

[Summary] Average synchronized step_time = 0.3689584493637085 sec.
STEP-10: loss = 21.11541, step_time = 3.675405 sec, perf = 0.07 tflops.

[Summary] Average synchronized step_time = 0.36754045486450193 sec.
STEP-10: loss = 21.11541, step_time = 3.681213 sec, perf = 0.07 tflops.

[Summary] Average synchronized step_time = 0.36812126636505127 sec.
STEP-10: loss = 21.11541, step_time = 3.629702 sec, perf = 0.08 tflops.

[Summary] Average synchronized step_time = 0.3629701852798462 sec.
STEP-10: loss = 21.11541, step_time = 3.700365 sec, perf = 0.07 tflops.

[Summary] Average synchronized step_time = 0.37003653049468993 sec.
STEP-10: loss = 21.11541, step_time = 3.658189 sec, perf = 0.08 tflops.

[Summary] Average synchronized step_time = 0.3658188819885254 sec.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants