-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allgather Test Suite Occassionally Sees Very Long Test Cases (Non-deterministic, Time Out) #8603
Comments
Was so far unable to reproduce the hang with a more isolated test list. Things I've tried so far:
Given that individual allgather configs can easily run 100k iterations without hangs (I've also had multiple succesful 1M+ runs in past days, but at 800MHz), I think this hang may have something to do with running different configurations back to back. I think maybe there aren't enough configs in I'll try again when I've got some machine downtime (i.e. when I'm doing dev work as opposed to something like active debug. |
Setup stress test pipeline branch: I added a tt-smi-metal -r 0,1,2,3 which will reset between tests, you can use it to lazily submit many jobs with 1000s of iterations and reset between to ensure good board state |
I ran the allgather post-commit suite overnight in a loop with a really long timeout so I could capture the machine in hang state without any device or dispatcher teardown and found something really unexpected. The post commit tests are still running (no hangs). This is suspicious because I was able to reproduce a hang after a couple hours with other attempts. I found somethng interesting. I don't think we have a real hang here and instead we have some pathological behaviour that's causing really slow behaviour. Here's a couple snapshots from my log that show a multi-hour delay between adjacent test cases: Screenshot 1Screenshot 2Screenshot 3TL;DR: Not a real hang!? Instead some pathological behaviour that causes ridiculous slowdown in some part of what looks like readback? This reminds me of this issue (#6212) I was seeing a little while ago in that I wouldn't be able to reliably see the pathological behaviour deterministically and sometimes only after a couple runs. I wonder if the issue also popped up for the smaller shapes (like I see in a couple of the screen shots above), but I just never noticed because the extra memory never ate into swap. I wonder if the two are related (Update on above: I realized I was running in debug mode. However I tried again in release mode and saw similar behaviour -> see next comment.) |
FYI @cfjchu, @tt-aho, @tt-asaigal since this has the potential to be related to something in the runtime and I know you guys have been dealing with difficulties related to pytorch recently. Putting it on your radar in case you have any ideas or see things in the future that could be related. |
The all-gather test suite will non-deterministically hang after several successful post-commit runs.
(Update: Not actually a hang - just a very slow operation that occasionally pops up and causes the test to timeout "early" -- see later comments. I think this also means this is likely not an allgather issue)
For example, I saw the following failures on the 3rd post commit run after 2 successful ones.
For reference, here are the pytest parametrizations for
post_commit_looping
since the post commit on main will likely look different:There doesn't seem to be a pattern between allgather config (shape, datatype, async mode, mem config) and a hang presenting.
At this time I have no indication about the source of the hang (op vs infra vs something else). Interestingly, I have successfully run 1.5M iterations of an allgather config successfully (8, 1, [1, 1, 32, 32768], L1, fp16) but at 800MHz.
The text was updated successfully, but these errors were encountered: