Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]当数据不够的时候,会出现StopIteration。 #77

Open
Carol-gutianle opened this issue Mar 11, 2024 · 2 comments
Open

[Bug]当数据不够的时候,会出现StopIteration。 #77

Carol-gutianle opened this issue Mar 11, 2024 · 2 comments
Assignees
Labels
bug Something isn't working

Comments

@Carol-gutianle
Copy link

描述该错误

在我的数据不足以跑完整个totalstep的时候,会出现StopIteration的报错。原因是虽然在外面使用了try-except,但是在第512行的next(train_state.batch_sampler_iter仍然会出现越界,原因是train_state.batch_sampler也跟着迭代,所以即使train_state.batch_sampler_iter重新赋值,仍然会越界。

环境信息

image

其他信息

做了如下修改之后,可以跑通。
image
image

@Carol-gutianle Carol-gutianle added the bug Something isn't working label Mar 11, 2024
@sunpengsdu
Copy link
Contributor

@zigzagcai

@zigzagcai
Copy link
Collaborator

zigzagcai commented Mar 21, 2024

I can reproduce this bug.
It can run without error when we switch to use iter(train_state.batch_sampler.copy()) and iter(self.batch_sampler.copy()), but train_state.batch_sampler might lose track for the current batch_count and num_consumed_samples_in_epoch because in this case the two values are always zero, which might not ideal when we resume training from previous checkpoints.

The root cause is that train_state.batch_sampler.batch_count and train_state.batch_sampler.num_consumed_samples_in_epoch will increment over iterations. Everytime we re-create generator train_state.batch_sampler_iter = iter(train_state.batch_sampler), it will always start from the last saved batch_count and num_consumed_samples_in_epoch. We want the generator to re-start iteration from batch_num=0, not from the last saved position that already crossed the index boundary. That's why the bug occurs.

So it is clear that as a solution, we need to reset the two values in the StopIteration exception handling.
We fix this bug in the #102

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants