Allgather Test Suite Occassionally Sees Very Long Test Cases (Non-deterministic, Time Out) #8603

SeanNijjar · 2024-05-17T13:16:32Z

The all-gather test suite will non-deterministically hang after several successful post-commit runs.
(Update: Not actually a hang - just a very slow operation that occasionally pops up and causes the test to timeout "early" -- see later comments. I think this also means this is likely not an allgather issue)

For example, I saw the following failures on the 3rd post commit run after 2 successful ones.

ERROR tests/tt_eager/python_api_testing/unit_testing/misc/test_all_gather.py::test_all_gather_on_t3000_post_commit_looping[True-100-mem_config1-input_dtype1-8-1-input_shape6-3-layout6] - Failed: Timeout >2400.0s
FAILED tests/tt_eager/python_api_testing/unit_testing/misc/test_all_gather.py::test_all_gather_on_t3000_post_commit_looping[True-100-mem_config1-input_dtype1-4-2-input_shape0-0-layout0] - Failed: Timeout >2400.0s
FAILED tests/tt_eager/python_api_testing/unit_testing/misc/test_all_gather.py::test_all_gather_on_t3000_post_commit_looping[False-100-mem_config0-input_dtype1-8-1-input_shape1-0-layout1] - Failed: Timeout >2400.0s
FAILED tests/tt_eager/python_api_testing/unit_testing/misc/test_all_gather.py::test_all_gather_on_t3000_post_commit_looping[False-100-mem_config1-input_dtype0-8-1-input_shape2-3-layout2] - Failed: Timeout >2400.0s
FAILED tests/tt_eager/python_api_testing/unit_testing/misc/test_all_gather.py::test_all_gather_on_t3000_post_commit_looping[False-100-mem_config1-input_dtype1-4-2-input_shape0-0-layout0] - Failed: Timeout >2400.0s

For reference, here are the pytest parametrizations for post_commit_looping since the post commit on main will likely look different:

@pytest.mark.parametrize(
    "num_devices, num_links, input_shape, dim, layout",
    [
        (4, 2, [4, 1, 256, 32], 0, ttl.tensor.Layout.TILE),
        (8, 1, [8, 1, 256, 32], 0, ttl.tensor.Layout.TILE),
        (8, 1, [1, 1, 32, 16384], 3, ttl.tensor.Layout.TILE),
        (4, 2, [1, 1, 32, 32768], 3, ttl.tensor.Layout.TILE),
        (4, 2, [4, 1, 256, 32], 0, ttl.tensor.Layout.ROW_MAJOR),
        (8, 1, [8, 1, 256, 32], 0, ttl.tensor.Layout.ROW_MAJOR),
        (8, 1, [1, 1, 32, 16384], 3, ttl.tensor.Layout.ROW_MAJOR),
        (4, 2, [1, 1, 32, 32768], 3, ttl.tensor.Layout.ROW_MAJOR),
    ],
)
@pytest.mark.parametrize(
    "input_dtype",
    [
        ttl.tensor.DataType.BFLOAT16,
        ttl.tensor.DataType.BFLOAT8_B,
    ],
)
@pytest.mark.parametrize(
    "mem_config",
    [
        ttl.tensor.MemoryConfig(buffer_type=ttl.tensor.BufferType.DRAM),
        ttl.tensor.MemoryConfig(buffer_type=ttl.tensor.BufferType.L1),
    ],
)
@pytest.mark.parametrize("num_iters", [100])  # TODO: restore to 500
@pytest.mark.parametrize("enable_async", [True, False])

There doesn't seem to be a pattern between allgather config (shape, datatype, async mode, mem config) and a hang presenting.
At this time I have no indication about the source of the hang (op vs infra vs something else). Interestingly, I have successfully run 1.5M iterations of an allgather config successfully (8, 1, [1, 1, 32, 32768], L1, fp16) but at 800MHz.

First thing to try is one of these configs with 1M iterations at 1GHz
Next to try is a tight loop of these configs back to back within the same test invocation (no device close between each one) to try to force the hang out more quickly

The text was updated successfully, but these errors were encountered:

SeanNijjar · 2024-05-17T20:21:06Z

Was so far unable to reproduce the hang with a more isolated test list. Things I've tried so far:

Run the above configs for 100k iterations each
-> No hangs detected
Various subset of post_commit_looping tests run in loop (invoke https://pypi.org/project/pytest-repeat/ with --count=20)
-> No hangs detected
Run the post_commit_looping test in a loop (invoke https://pypi.org/project/pytest-repeat/ with --count=20)
-> No hangs detected
Run the post_commit_looping test in a loop (invoke https://pypi.org/project/pytest-repeat/ with --count=20), but where each test only has num_iters=1
-> No hangs detected

Given that individual allgather configs can easily run 100k iterations without hangs (I've also had multiple succesful 1M+ runs in past days, but at 800MHz), I think this hang may have something to do with running different configurations back to back. I think maybe there aren't enough configs in post_commit_looping to expose whatever this bug is.

I'll try again when I've got some machine downtime (i.e. when I'm doing dev work as opposed to something like active debug.

tapspatel · 2024-05-17T20:44:56Z

Setup stress test pipeline

branch: t3000-stress-pipeline
test_file: tt-metal/tests/scripts/t3000/run_t3000_stress_tests.sh
pipeline: https://github.com/tenstorrent/tt-metal/actions/workflows/t3000-stress-tests.yaml

I added a tt-smi-metal -r 0,1,2,3 which will reset between tests, you can use it to lazily submit many jobs with 1000s of iterations and reset between to ensure good board state

SeanNijjar · 2024-05-18T14:42:48Z

I ran the allgather post-commit suite overnight in a loop with a really long timeout so I could capture the machine in hang state without any device or dispatcher teardown and found something really unexpected.

The post commit tests are still running (no hangs). This is suspicious because I was able to reproduce a hang after a couple hours with other attempts.

I found somethng interesting. I don't think we have a real hang here and instead we have some pathological behaviour that's causing really slow behaviour. Here's a couple snapshots from my log that show a multi-hour delay between adjacent test cases:

Screenshot 1

Screenshot 2

Screenshot 3

TL;DR: Not a real hang!? Instead some pathological behaviour that causes ridiculous slowdown in some part of what looks like readback?

This reminds me of this issue (#6212) I was seeing a little while ago in that I wouldn't be able to reliably see the pathological behaviour deterministically and sometimes only after a couple runs. I wonder if the issue also popped up for the smaller shapes (like I see in a couple of the screen shots above), but I just never noticed because the extra memory never ate into swap. I wonder if the two are related

(Update on above: I realized I was running in debug mode. However I tried again in release mode and saw similar behaviour -> see next comment.)

SeanNijjar · 2024-05-18T16:30:46Z

From my release build run, I'm seeing the same thing (albeit in different places):

Image 1

Image 2

Run still in progress but this is pretty reassuring that we're not actually seeing a hang at all, just some really slow operation somewhere.

SeanNijjar · 2024-05-18T20:53:23Z

FYI @cfjchu, @tt-aho, @tt-asaigal since this has the potential to be related to something in the runtime and I know you guys have been dealing with difficulties related to pytorch recently. Putting it on your radar in case you have any ideas or see things in the future that could be related.

SeanNijjar added the P0_Showstopper label May 17, 2024

SeanNijjar self-assigned this May 17, 2024

SeanNijjar added the bug Something isn't working label May 18, 2024

SeanNijjar changed the title ~~Allgather Test Suite Hangs Non-deterministically~~ Allgather Test Suite Occassionally Sees Very Long Test Cases (Non-deterministic, Time Out) May 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allgather Test Suite Occassionally Sees Very Long Test Cases (Non-deterministic, Time Out) #8603

Allgather Test Suite Occassionally Sees Very Long Test Cases (Non-deterministic, Time Out) #8603

SeanNijjar commented May 17, 2024 •

edited

SeanNijjar commented May 17, 2024 •

edited

tapspatel commented May 17, 2024 •

edited

SeanNijjar commented May 18, 2024 •

edited

SeanNijjar commented May 18, 2024

SeanNijjar commented May 18, 2024

Allgather Test Suite Occassionally Sees Very Long Test Cases (Non-deterministic, Time Out) #8603

Allgather Test Suite Occassionally Sees Very Long Test Cases (Non-deterministic, Time Out) #8603

Comments

SeanNijjar commented May 17, 2024 • edited

SeanNijjar commented May 17, 2024 • edited

tapspatel commented May 17, 2024 • edited

SeanNijjar commented May 18, 2024 • edited

Screenshot 1

Screenshot 2

Screenshot 3

SeanNijjar commented May 18, 2024

Image 1

Image 2

SeanNijjar commented May 18, 2024

SeanNijjar commented May 17, 2024 •

edited

SeanNijjar commented May 17, 2024 •

edited

tapspatel commented May 17, 2024 •

edited

SeanNijjar commented May 18, 2024 •

edited