Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Device Tilizer with multicore=True failing because of unhandled case of row-width being too large #8617

Open
cfjchu opened this issue May 17, 2024 · 0 comments
Assignees
Labels
bug Something isn't working P1_critical

Comments

@cfjchu
Copy link
Contributor

cfjchu commented May 17, 2024

Since our host-tilizer is pretty slow, we'd like to use device-tilizer instead.

Here's an example where using a single-code on-device tilizer is significantly better than using the host-tilizer:


def test_device_tilize(device):
    import time
    torch_tensor = torch.randn((4544, 18176), dtype=torch.bfloat16)
    
    start = time.time()
    tensor = ttnn.from_torch(torch_tensor, dtype=ttnn.bfloat8_b, layout=ttnn.TILE_LAYOUT)
    end = time.time()
    logger.info(f"Time taken to convert to tensor using host-tilizer: {end-start}")
    

    start = time.time()
    tensor = ttnn.from_torch(torch_tensor, layout=ttnn.ROW_MAJOR_LAYOUT, device=device, dtype=ttnn.bfloat16, memory_config=ttnn.DRAM_MEMORY_CONFIG)
    tensor = ttnn.experimental.tensor.tilize(tensor, output_dtype=ttnn.bfloat8_b, use_multicore=False)
    end = time.time()
    logger.info(f"Time taken to convert to tensor using device-tilizer: {end-start}")
tests/ttnn/unit_tests/operations/test_matmul.py::test_device_tilize 2024-05-17 20:10:59.307 | DEBUG    | ttnn:manage_config:143 - Set ttnn.CONFIG.report_name to tests/ttnn/unit_tests/operations/test_matmul.py::test_device_tilize: 2024-05-17 20:10:59 (UTC)
                 Device | INFO     | Opening user mode device driver
  Detecting chips (found 8)
2024-05-17 20:10:59.741 | INFO     | SiliconDriver   - Detected 4 PCI devices : {0, 1, 2, 3}
2024-05-17 20:10:59.867 | INFO     | SiliconDriver   - Software version 6.0.0, Ethernet FW version 6.9.0 (Device 0)
2024-05-17 20:10:59.868 | INFO     | SiliconDriver   - Software version 6.0.0, Ethernet FW version 6.9.0 (Device 4)
2024-05-17 20:10:59.912 | INFO     | SiliconDriver   - Detected 4 PCI devices : {0, 1, 2, 3}
2024-05-17 20:10:59.977 | INFO     | SiliconDriver   - Software version 6.0.0, Ethernet FW version 6.9.0 (Device 1)
2024-05-17 20:10:59.978 | INFO     | SiliconDriver   - Software version 6.0.0, Ethernet FW version 6.9.0 (Device 5)
2024-05-17 20:11:00.022 | INFO     | SiliconDriver   - Detected 4 PCI devices : {0, 1, 2, 3}
2024-05-17 20:11:00.087 | INFO     | SiliconDriver   - Software version 6.0.0, Ethernet FW version 6.9.0 (Device 2)
2024-05-17 20:11:00.088 | INFO     | SiliconDriver   - Software version 6.0.0, Ethernet FW version 6.9.0 (Device 6)
2024-05-17 20:11:00.130 | INFO     | SiliconDriver   - Detected 4 PCI devices : {0, 1, 2, 3}
2024-05-17 20:11:00.195 | INFO     | SiliconDriver   - Software version 6.0.0, Ethernet FW version 6.9.0 (Device 3)
2024-05-17 20:11:00.196 | INFO     | SiliconDriver   - Software version 6.0.0, Ethernet FW version 6.9.0 (Device 7)
                  Metal | INFO     | Initializing device 0. Program cache is NOT enabled
                  Metal | INFO     | AI CLK for device 0 is:   800 MHz
2024-05-17 20:11:07.655 | INFO     | tests.ttnn.unit_tests.operations.test_matmul:test_device_tilize:640 - Time taken to convert to tensor using host-tilizer: 4.485525369644165
2024-05-17 20:11:08.005 | INFO     | tests.ttnn.unit_tests.operations.test_matmul:test_device_tilize:647 - Time taken to convert to tensor using device-tilizer: 0.34972691535949707
PASSED                  Metal | INFO     | Closing device 0
                  Metal | INFO     | Disabling and clearing program cache on device 0
2024-05-17 20:11:08.059 | DEBUG    | ttnn:manage_config:146 - Restored ttnn.CONFIG.report_name to None

Since ttnn.experimental.tensor.tilize(tensor, output_dtype=ttnn.bfloat8_b, use_multicore=True) uses use_multicore=True by default, there seems to be an unhandled case that needs to get fixed.

Failure signature

                 Always | FATAL    | Statically allocated circular buffers on core range [(x=0,y=0) - (x=7,y=4)] grow to 1888576 B which is beyond max L1 size of 1499136 B
terminate called after throwing an instance of 'std::runtime_error'
  what():  TT_THROW @ tt_metal/impl/program/program.cpp:491: tt::exception
info:
Statically allocated circular buffers on core range [(x=0,y=0) - (x=7,y=4)] grow to 1888576 B which is beyond max L1 size of 1499136 B
backtrace:
 --- /home/jchu/tt-metal/build/lib/libtt_metal.so(+0x321206) [0x7f7cfd646206]
 --- tt::tt_metal::EnqueueProgramImpl(tt::tt_metal::CommandQueue&, std::variant<std::reference_wrapper<tt::tt_metal::Program>, std::shared_ptr<tt::tt_metal::Program> >, bool)
 --- tt::tt_metal::CommandQueue::run_command_impl(tt::tt_metal::CommandInterface const&)
 --- tt::tt_metal::CommandQueue::run_command(tt::tt_metal::CommandInterface const&)
 --- tt::tt_metal::EnqueueProgram(tt::tt_metal::CommandQueue&, std::variant<std::reference_wrapper<tt::tt_metal::Program>, std::shared_ptr<tt::tt_metal::Program> >, bool)
 --- /home/jchu/tt-metal/tt_eager/tt_lib/_C.so(+0x8595ef) [0x7f7cfe0115ef]
 --- std::vector<tt::tt_metal::Tensor, std::allocator<tt::tt_metal::Tensor> > tt::tt_metal::operation::detail::run_device_operation<std::vector<tt::tt_metal::Tensor, std::allocator<tt::tt_me
tal::Tensor> > >(std::optional<std::reference_wrapper<tt::tt_metal::CommandQueue> >, tt::tt_metal::operation::DeviceOperation<std::vector<tt::tt_metal::Tensor, std::allocator<tt::tt_metal::T
ensor> > > const&, std::vector<tt::tt_metal::Tensor, std::allocator<tt::tt_metal::Tensor> > const&, std::vector<std::optional<tt::tt_metal::Tensor const>, std::allocator<std::optional<tt::tt
_metal::Tensor const> > > const&, std::vector<std::optional<tt::tt_metal::Tensor>, std::allocator<std::optional<tt::tt_metal::Tensor> > > const&)
 --- std::vector<tt::tt_metal::Tensor, std::allocator<tt::tt_metal::Tensor> > tt::tt_metal::operation::run<std::vector<tt::tt_metal::Tensor, std::allocator<tt::tt_metal::Tensor> > >(tt::tt_m
etal::operation::DeviceOperation<std::vector<tt::tt_metal::Tensor, std::allocator<tt::tt_metal::Tensor> > > const&, std::vector<tt::tt_metal::Tensor, std::allocator<tt::tt_metal::Tensor> > c
onst&, std::vector<std::optional<tt::tt_metal::Tensor const>, std::allocator<std::optional<tt::tt_metal::Tensor const> > > const&, std::vector<std::optional<tt::tt_metal::Tensor>, std::alloc
ator<std::optional<tt::tt_metal::Tensor> > > const&)
@cfjchu cfjchu added bug Something isn't working P1_critical labels May 17, 2024
cfjchu added a commit that referenced this issue May 20, 2024
…..) api

- Selectively use multi-core device tilizer to work around #8617
cfjchu added a commit that referenced this issue May 20, 2024
…..) api

- Selectively use multi-core device tilizer to work around #8617
cfjchu added a commit that referenced this issue May 20, 2024
…..) api

- Selectively use multi-core device tilizer to work around #8617
cfjchu added a commit that referenced this issue May 21, 2024
…..) api

- Selectively use multi-core device tilizer to work around #8617
cfjchu added a commit that referenced this issue May 21, 2024
…..) api

- Selectively use multi-core device tilizer to work around #8617
cfjchu added a commit that referenced this issue May 21, 2024
…..) api

- Selectively use multi-core device tilizer to work around #8617
cfjchu added a commit that referenced this issue May 21, 2024
…..) api

- Selectively use multi-core device tilizer to work around #8617
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P1_critical
Projects
None yet
Development

No branches or pull requests

2 participants