Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allgather Test Suite Occassionally Sees Very Long Test Cases (Non-deterministic, Time Out) #8603

Open
SeanNijjar opened this issue May 17, 2024 · 5 comments
Assignees
Labels
bug Something isn't working P0_Showstopper

Comments

@SeanNijjar
Copy link
Contributor

SeanNijjar commented May 17, 2024

The all-gather test suite will non-deterministically hang after several successful post-commit runs.
(Update: Not actually a hang - just a very slow operation that occasionally pops up and causes the test to timeout "early" -- see later comments. I think this also means this is likely not an allgather issue)

For example, I saw the following failures on the 3rd post commit run after 2 successful ones.

ERROR tests/tt_eager/python_api_testing/unit_testing/misc/test_all_gather.py::test_all_gather_on_t3000_post_commit_looping[True-100-mem_config1-input_dtype1-8-1-input_shape6-3-layout6] - Failed: Timeout >2400.0s
FAILED tests/tt_eager/python_api_testing/unit_testing/misc/test_all_gather.py::test_all_gather_on_t3000_post_commit_looping[True-100-mem_config1-input_dtype1-4-2-input_shape0-0-layout0] - Failed: Timeout >2400.0s
FAILED tests/tt_eager/python_api_testing/unit_testing/misc/test_all_gather.py::test_all_gather_on_t3000_post_commit_looping[False-100-mem_config0-input_dtype1-8-1-input_shape1-0-layout1] - Failed: Timeout >2400.0s
FAILED tests/tt_eager/python_api_testing/unit_testing/misc/test_all_gather.py::test_all_gather_on_t3000_post_commit_looping[False-100-mem_config1-input_dtype0-8-1-input_shape2-3-layout2] - Failed: Timeout >2400.0s
FAILED tests/tt_eager/python_api_testing/unit_testing/misc/test_all_gather.py::test_all_gather_on_t3000_post_commit_looping[False-100-mem_config1-input_dtype1-4-2-input_shape0-0-layout0] - Failed: Timeout >2400.0s

For reference, here are the pytest parametrizations for post_commit_looping since the post commit on main will likely look different:

@pytest.mark.parametrize(
    "num_devices, num_links, input_shape, dim, layout",
    [
        (4, 2, [4, 1, 256, 32], 0, ttl.tensor.Layout.TILE),
        (8, 1, [8, 1, 256, 32], 0, ttl.tensor.Layout.TILE),
        (8, 1, [1, 1, 32, 16384], 3, ttl.tensor.Layout.TILE),
        (4, 2, [1, 1, 32, 32768], 3, ttl.tensor.Layout.TILE),
        (4, 2, [4, 1, 256, 32], 0, ttl.tensor.Layout.ROW_MAJOR),
        (8, 1, [8, 1, 256, 32], 0, ttl.tensor.Layout.ROW_MAJOR),
        (8, 1, [1, 1, 32, 16384], 3, ttl.tensor.Layout.ROW_MAJOR),
        (4, 2, [1, 1, 32, 32768], 3, ttl.tensor.Layout.ROW_MAJOR),
    ],
)
@pytest.mark.parametrize(
    "input_dtype",
    [
        ttl.tensor.DataType.BFLOAT16,
        ttl.tensor.DataType.BFLOAT8_B,
    ],
)
@pytest.mark.parametrize(
    "mem_config",
    [
        ttl.tensor.MemoryConfig(buffer_type=ttl.tensor.BufferType.DRAM),
        ttl.tensor.MemoryConfig(buffer_type=ttl.tensor.BufferType.L1),
    ],
)
@pytest.mark.parametrize("num_iters", [100])  # TODO: restore to 500
@pytest.mark.parametrize("enable_async", [True, False])

There doesn't seem to be a pattern between allgather config (shape, datatype, async mode, mem config) and a hang presenting.
At this time I have no indication about the source of the hang (op vs infra vs something else). Interestingly, I have successfully run 1.5M iterations of an allgather config successfully (8, 1, [1, 1, 32, 32768], L1, fp16) but at 800MHz.

  • First thing to try is one of these configs with 1M iterations at 1GHz
  • Next to try is a tight loop of these configs back to back within the same test invocation (no device close between each one) to try to force the hang out more quickly
@SeanNijjar
Copy link
Contributor Author

SeanNijjar commented May 17, 2024

Was so far unable to reproduce the hang with a more isolated test list. Things I've tried so far:

  1. Run the above configs for 100k iterations each
    -> No hangs detected
  2. Various subset of post_commit_looping tests run in loop (invoke https://pypi.org/project/pytest-repeat/ with --count=20)
    -> No hangs detected
  3. Run the post_commit_looping test in a loop (invoke https://pypi.org/project/pytest-repeat/ with --count=20)
    -> No hangs detected
  4. Run the post_commit_looping test in a loop (invoke https://pypi.org/project/pytest-repeat/ with --count=20), but where each test only has num_iters=1
    -> No hangs detected

Given that individual allgather configs can easily run 100k iterations without hangs (I've also had multiple succesful 1M+ runs in past days, but at 800MHz), I think this hang may have something to do with running different configurations back to back. I think maybe there aren't enough configs in post_commit_looping to expose whatever this bug is.

I'll try again when I've got some machine downtime (i.e. when I'm doing dev work as opposed to something like active debug.

@tapspatel
Copy link
Contributor

tapspatel commented May 17, 2024

Setup stress test pipeline

branch: t3000-stress-pipeline
test_file: tt-metal/tests/scripts/t3000/run_t3000_stress_tests.sh
pipeline: https://github.com/tenstorrent/tt-metal/actions/workflows/t3000-stress-tests.yaml

I added a tt-smi-metal -r 0,1,2,3 which will reset between tests, you can use it to lazily submit many jobs with 1000s of iterations and reset between to ensure good board state

@SeanNijjar
Copy link
Contributor Author

SeanNijjar commented May 18, 2024

I ran the allgather post-commit suite overnight in a loop with a really long timeout so I could capture the machine in hang state without any device or dispatcher teardown and found something really unexpected.

The post commit tests are still running (no hangs). This is suspicious because I was able to reproduce a hang after a couple hours with other attempts.

I found somethng interesting. I don't think we have a real hang here and instead we have some pathological behaviour that's causing really slow behaviour. Here's a couple snapshots from my log that show a multi-hour delay between adjacent test cases:

Screenshot 1

image

Screenshot 2

image

Screenshot 3

image

TL;DR: Not a real hang!? Instead some pathological behaviour that causes ridiculous slowdown in some part of what looks like readback?

This reminds me of this issue (#6212) I was seeing a little while ago in that I wouldn't be able to reliably see the pathological behaviour deterministically and sometimes only after a couple runs. I wonder if the issue also popped up for the smaller shapes (like I see in a couple of the screen shots above), but I just never noticed because the extra memory never ate into swap. I wonder if the two are related

(Update on above: I realized I was running in debug mode. However I tried again in release mode and saw similar behaviour -> see next comment.)

@SeanNijjar
Copy link
Contributor Author

From my release build run, I'm seeing the same thing (albeit in different places):

Image 1

image

Image 2

image

Run still in progress but this is pretty reassuring that we're not actually seeing a hang at all, just some really slow operation somewhere.

@SeanNijjar SeanNijjar added the bug Something isn't working label May 18, 2024
@SeanNijjar SeanNijjar changed the title Allgather Test Suite Hangs Non-deterministically Allgather Test Suite Occassionally Sees Very Long Test Cases (Non-deterministic, Time Out) May 18, 2024
@SeanNijjar
Copy link
Contributor Author

FYI @cfjchu, @tt-aho, @tt-asaigal since this has the potential to be related to something in the runtime and I know you guys have been dealing with difficulties related to pytorch recently. Putting it on your radar in case you have any ideas or see things in the future that could be related.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P0_Showstopper
Projects
None yet
Development

No branches or pull requests

2 participants