Device Tilizer with multicore=True failing because of unhandled case of row-width being too large #8617

cfjchu · 2024-05-17T19:56:42Z

Since our host-tilizer is pretty slow, we'd like to use device-tilizer instead.

Here's an example where using a single-code on-device tilizer is significantly better than using the host-tilizer:


def test_device_tilize(device):
    import time
    torch_tensor = torch.randn((4544, 18176), dtype=torch.bfloat16)
    
    start = time.time()
    tensor = ttnn.from_torch(torch_tensor, dtype=ttnn.bfloat8_b, layout=ttnn.TILE_LAYOUT)
    end = time.time()
    logger.info(f"Time taken to convert to tensor using host-tilizer: {end-start}")
    

    start = time.time()
    tensor = ttnn.from_torch(torch_tensor, layout=ttnn.ROW_MAJOR_LAYOUT, device=device, dtype=ttnn.bfloat16, memory_config=ttnn.DRAM_MEMORY_CONFIG)
    tensor = ttnn.experimental.tensor.tilize(tensor, output_dtype=ttnn.bfloat8_b, use_multicore=False)
    end = time.time()
    logger.info(f"Time taken to convert to tensor using device-tilizer: {end-start}")

tests/ttnn/unit_tests/operations/test_matmul.py::test_device_tilize 2024-05-17 20:10:59.307 | DEBUG    | ttnn:manage_config:143 - Set ttnn.CONFIG.report_name to tests/ttnn/unit_tests/operations/test_matmul.py::test_device_tilize: 2024-05-17 20:10:59 (UTC)
                 Device | INFO     | Opening user mode device driver
  Detecting chips (found 8)
2024-05-17 20:10:59.741 | INFO     | SiliconDriver   - Detected 4 PCI devices : {0, 1, 2, 3}
2024-05-17 20:10:59.867 | INFO     | SiliconDriver   - Software version 6.0.0, Ethernet FW version 6.9.0 (Device 0)
2024-05-17 20:10:59.868 | INFO     | SiliconDriver   - Software version 6.0.0, Ethernet FW version 6.9.0 (Device 4)
2024-05-17 20:10:59.912 | INFO     | SiliconDriver   - Detected 4 PCI devices : {0, 1, 2, 3}
2024-05-17 20:10:59.977 | INFO     | SiliconDriver   - Software version 6.0.0, Ethernet FW version 6.9.0 (Device 1)
2024-05-17 20:10:59.978 | INFO     | SiliconDriver   - Software version 6.0.0, Ethernet FW version 6.9.0 (Device 5)
2024-05-17 20:11:00.022 | INFO     | SiliconDriver   - Detected 4 PCI devices : {0, 1, 2, 3}
2024-05-17 20:11:00.087 | INFO     | SiliconDriver   - Software version 6.0.0, Ethernet FW version 6.9.0 (Device 2)
2024-05-17 20:11:00.088 | INFO     | SiliconDriver   - Software version 6.0.0, Ethernet FW version 6.9.0 (Device 6)
2024-05-17 20:11:00.130 | INFO     | SiliconDriver   - Detected 4 PCI devices : {0, 1, 2, 3}
2024-05-17 20:11:00.195 | INFO     | SiliconDriver   - Software version 6.0.0, Ethernet FW version 6.9.0 (Device 3)
2024-05-17 20:11:00.196 | INFO     | SiliconDriver   - Software version 6.0.0, Ethernet FW version 6.9.0 (Device 7)
                  Metal | INFO     | Initializing device 0. Program cache is NOT enabled
                  Metal | INFO     | AI CLK for device 0 is:   800 MHz
2024-05-17 20:11:07.655 | INFO     | tests.ttnn.unit_tests.operations.test_matmul:test_device_tilize:640 - Time taken to convert to tensor using host-tilizer: 4.485525369644165
2024-05-17 20:11:08.005 | INFO     | tests.ttnn.unit_tests.operations.test_matmul:test_device_tilize:647 - Time taken to convert to tensor using device-tilizer: 0.34972691535949707
PASSED                  Metal | INFO     | Closing device 0
                  Metal | INFO     | Disabling and clearing program cache on device 0
2024-05-17 20:11:08.059 | DEBUG    | ttnn:manage_config:146 - Restored ttnn.CONFIG.report_name to None

Since ttnn.experimental.tensor.tilize(tensor, output_dtype=ttnn.bfloat8_b, use_multicore=True) uses use_multicore=True by default, there seems to be an unhandled case that needs to get fixed.

Failure signature

                 Always | FATAL    | Statically allocated circular buffers on core range [(x=0,y=0) - (x=7,y=4)] grow to 1888576 B which is beyond max L1 size of 1499136 B
terminate called after throwing an instance of 'std::runtime_error'
  what():  TT_THROW @ tt_metal/impl/program/program.cpp:491: tt::exception
info:
Statically allocated circular buffers on core range [(x=0,y=0) - (x=7,y=4)] grow to 1888576 B which is beyond max L1 size of 1499136 B
backtrace:
 --- /home/jchu/tt-metal/build/lib/libtt_metal.so(+0x321206) [0x7f7cfd646206]
 --- tt::tt_metal::EnqueueProgramImpl(tt::tt_metal::CommandQueue&, std::variant<std::reference_wrapper<tt::tt_metal::Program>, std::shared_ptr<tt::tt_metal::Program> >, bool)
 --- tt::tt_metal::CommandQueue::run_command_impl(tt::tt_metal::CommandInterface const&)
 --- tt::tt_metal::CommandQueue::run_command(tt::tt_metal::CommandInterface const&)
 --- tt::tt_metal::EnqueueProgram(tt::tt_metal::CommandQueue&, std::variant<std::reference_wrapper<tt::tt_metal::Program>, std::shared_ptr<tt::tt_metal::Program> >, bool)
 --- /home/jchu/tt-metal/tt_eager/tt_lib/_C.so(+0x8595ef) [0x7f7cfe0115ef]
 --- std::vector<tt::tt_metal::Tensor, std::allocator<tt::tt_metal::Tensor> > tt::tt_metal::operation::detail::run_device_operation<std::vector<tt::tt_metal::Tensor, std::allocator<tt::tt_me
tal::Tensor> > >(std::optional<std::reference_wrapper<tt::tt_metal::CommandQueue> >, tt::tt_metal::operation::DeviceOperation<std::vector<tt::tt_metal::Tensor, std::allocator<tt::tt_metal::T
ensor> > > const&, std::vector<tt::tt_metal::Tensor, std::allocator<tt::tt_metal::Tensor> > const&, std::vector<std::optional<tt::tt_metal::Tensor const>, std::allocator<std::optional<tt::tt
_metal::Tensor const> > > const&, std::vector<std::optional<tt::tt_metal::Tensor>, std::allocator<std::optional<tt::tt_metal::Tensor> > > const&)
 --- std::vector<tt::tt_metal::Tensor, std::allocator<tt::tt_metal::Tensor> > tt::tt_metal::operation::run<std::vector<tt::tt_metal::Tensor, std::allocator<tt::tt_metal::Tensor> > >(tt::tt_m
etal::operation::DeviceOperation<std::vector<tt::tt_metal::Tensor, std::allocator<tt::tt_metal::Tensor> > > const&, std::vector<tt::tt_metal::Tensor, std::allocator<tt::tt_metal::Tensor> > c
onst&, std::vector<std::optional<tt::tt_metal::Tensor const>, std::allocator<std::optional<tt::tt_metal::Tensor const> > > const&, std::vector<std::optional<tt::tt_metal::Tensor>, std::alloc
ator<std::optional<tt::tt_metal::Tensor> > > const&)

The text was updated successfully, but these errors were encountered:

…..) api - Selectively use multi-core device tilizer to work around #8617

cfjchu added bug Something isn't working P1_critical labels May 17, 2024

cfjchu assigned yan-zaretskiy May 17, 2024

cfjchu added a commit that referenced this issue May 20, 2024

#8653: Parametrize option to select device tilizer in ttnn.as_tensor(…

48473c2

…..) api - Selectively use multi-core device tilizer to work around #8617

cfjchu added a commit that referenced this issue May 20, 2024

#8653: Parametrize option to select device tilizer in ttnn.as_tensor(…

f6071b0

…..) api - Selectively use multi-core device tilizer to work around #8617

cfjchu added a commit that referenced this issue May 20, 2024

#8653: Parametrize option to select device tilizer in ttnn.as_tensor(…

7814e8c

…..) api - Selectively use multi-core device tilizer to work around #8617

cfjchu added a commit that referenced this issue May 21, 2024

#8653: Parametrize option to select device tilizer in ttnn.as_tensor(…

ed0e44b

…..) api - Selectively use multi-core device tilizer to work around #8617

cfjchu added a commit that referenced this issue May 21, 2024

#8653: Parametrize option to select device tilizer in ttnn.as_tensor(…

e2fd648

…..) api - Selectively use multi-core device tilizer to work around #8617

cfjchu added a commit that referenced this issue May 21, 2024

#8653: Parametrize option to select device tilizer in ttnn.as_tensor(…

b608f0a

…..) api - Selectively use multi-core device tilizer to work around #8617

cfjchu mentioned this issue May 21, 2024

#8653: Parametrize option to select device tilizer in ttnn.as_tensor API #8660

Merged

cfjchu added a commit that referenced this issue May 21, 2024

#8653: Parametrize option to select device tilizer in ttnn.as_tensor(…

9545eb0

…..) api - Selectively use multi-core device tilizer to work around #8617

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Device Tilizer with multicore=True failing because of unhandled case of row-width being too large #8617

Device Tilizer with multicore=True failing because of unhandled case of row-width being too large #8617

cfjchu commented May 17, 2024 •

edited

Device Tilizer with multicore=True failing because of unhandled case of row-width being too large #8617

Device Tilizer with multicore=True failing because of unhandled case of row-width being too large #8617

Comments

cfjchu commented May 17, 2024 • edited

cfjchu commented May 17, 2024 •

edited