Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] internlm docker image issue #182

Open
marks221b opened this issue Apr 8, 2024 · 2 comments
Open

[Bug] internlm docker image issue #182

marks221b opened this issue Apr 8, 2024 · 2 comments
Assignees
Labels
bug Something isn't working

Comments

@marks221b
Copy link

Describe the bug

The container version of internlm has not been built successfully. After pulling the image and entering the container, it was found that it is not possible to directly use the container for training and inference. The environment inside the container also differs significantly from the actual runtime environment needed.

Environment

The container version of internlm has not been built successfully. After pulling the image and entering the container, it was found that it is not possible to directly use the container for training and inference. The environment inside the container also differs significantly from the actual runtime environment needed.

Other information

No response

@marks221b marks221b added the bug Something isn't working label Apr 8, 2024
@sunpengsdu
Copy link
Contributor

please update the docker file with lasted requirements. @li126com

@li126com
Copy link
Collaborator

li126com commented Apr 12, 2024

New docker images are ready at https://hub.docker.com/r/internlm/internevo/tags. The relative docs in the InternEvo have been updated.

yingtongxiong pushed a commit to yingtongxiong/InternEvo that referenced this issue May 20, 2024
* feat(XXX): add moe

* reformat code

* modified:   .pre-commit-config.yaml
	modified:   internlm/model/moe.py
	modified:   internlm/model/modeling_internlm.py

* modified:   internlm/model/modeling_internlm.py

* modified:   internlm/core/context/process_group_initializer.py
	modified:   internlm/core/scheduler/no_pipeline_scheduler.py
	modified:   internlm/solver/optimizer/hybrid_zero_optim.py

* modified:   internlm/model/moe.py
	modified:   internlm/moe/sharded_moe.py
	modified:   internlm/utils/parallel.py

* rollback .pre-commit-config.yaml

* add residual and other moe features

* modify grad clipping due to moe

* add param arguments

* reformat code

* add expert data support and fix bugs

* Update .pre-commit-config.yaml

* modified:   internlm/model/modeling_internlm.py

* add no-interleaved & no-overlapped moe pp support

* support zero_overlap_communication

* avoid moe parameter partition in zero optimizer

* fix the moe_loss_coeff bug

* suppport interleaved pp

* fix moe bugs in zero optimizer

* fix more moe bugs in zero optimizer

* fix moe bugs in zero optimizer

* add logger for moe_loss

* fix bugs with merge

* fix the pp moe bugs

* fix bug on logger

* update moe training cfg on real-dataset

* refactor code

* refactor code

* fix bugs with compute moe norm

* optimize code with moe norm computing

* fix the bug that missing scale the latent moe loss

* refactor code

* fix moe loss logger for the interleaved pp

* change the scale position for latent moe_loss

* Update 7B_sft.py

* add support for moe checkpoint

* add comments for moe

* reformat code

* fix bugs

* fix bugs

* Update .pre-commit-config.yaml

* remove moe_loss_coeff parameter passing

* fix group_norms computing in hybrid_zero_optim

* use dummy mode to generate random numbers in model construction

* replace flashatten experts by feedforward experts

* fix bugs with _compute_norm_with_moe_group

* merge upstream/develop into feature_add_moe

* merge upstream/develop into feature_add_moe

* change float16 to bfloat16

* fix interface for dense pipeline

* refactor split_moe_group code

* fix precision inconsistency

* refactor code

* Update 7B_sft.py

* refactor code

* refactor code

* refactor code

* refactor code

* refactor code for split group

* refactor code for log

* fix logger for moe

* refactor code for split param group

* fix the moe_loss for ci and val

* refactor

* fix bugs with split group

* fix bugs in save/load moe checkpoint

* add moe module to `__init__.py`

* add compatible code for old version

* update moe config file

* modify moe config file

* fix merge bugs

* update moe config file

* change condition for compatibility

---------

Co-authored-by: zhanglei <ryancheung98@163.com>
Co-authored-by: Ryan (张磊) <leizhang.real@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants