Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segnet fails to load when using DLA, but PASSED using TensorRT.trtexec #1838

Open
khsafkatamin opened this issue May 10, 2024 · 2 comments
Open

Comments

@khsafkatamin
Copy link

Hi @dusty-nv,

I am using a custom segnet model trained following the steps from Onixaz Pytorch Segmentation. I can run the model using Device GPU without any issues. But When I run with the Device DLA. I face the following issue...


[TRT]    =============== Computing costs for 
[TRT]    *************** Autotuning format combination: Half(1572864,524288,1024,1) -> Half(6144,512,32,1) ***************
[TRT]    --------------- Timing Runner: {ForeignNode[/backbone/conv1/Conv.../classifier/classifier.4/Conv]} (DLA)
[TRT]    Skipping tactic 0x0000000000000003 due to exception Assertion context.dlaContext != nullptr failed. 
[TRT]    Fastest Tactic: 0xd15ea5edd15ea5ed Time: inf
[TRT]    *************** Autotuning format combination: Half(1572864,524288,1024,1) -> Half(512,512:16,32,1) ***************
[TRT]    --------------- Timing Runner: {ForeignNode[/backbone/conv1/Conv.../classifier/classifier.4/Conv]} (DLA)
[TRT]    Skipping tactic 0x0000000000000003 due to exception Assertion context.dlaContext != nullptr failed. 
[TRT]    Fastest Tactic: 0xd15ea5edd15ea5ed Time: inf
[TRT]    *************** Autotuning format combination: Half(524288,1:4,1024,1) -> Half(6144,512,32,1) ***************
[TRT]    --------------- Timing Runner: {ForeignNode[/backbone/conv1/Conv.../classifier/classifier.4/Conv]} (DLA)
[TRT]    Skipping tactic 0x0000000000000003 due to exception Assertion context.dlaContext != nullptr failed. 
[TRT]    Fastest Tactic: 0xd15ea5edd15ea5ed Time: inf
[TRT]    *************** Autotuning format combination: Half(524288,1:4,1024,1) -> Half(512,512:16,32,1) ***************
[TRT]    --------------- Timing Runner: {ForeignNode[/backbone/conv1/Conv.../classifier/classifier.4/Conv]} (DLA)
[TRT]    Skipping tactic 0x0000000000000003 due to exception Assertion context.dlaContext != nullptr failed. 
[TRT]    Fastest Tactic: 0xd15ea5edd15ea5ed Time: inf
[TRT]    *************** Autotuning format combination: Half(524288,524288:16,1024,1) -> Half(6144,512,32,1) ***************
[TRT]    --------------- Timing Runner: {ForeignNode[/backbone/conv1/Conv.../classifier/classifier.4/Conv]} (DLA)
[TRT]    Skipping tactic 0x0000000000000003 due to exception Assertion context.dlaContext != nullptr failed. 
[TRT]    Fastest Tactic: 0xd15ea5edd15ea5ed Time: inf
[TRT]    *************** Autotuning format combination: Half(524288,524288:16,1024,1) -> Half(512,512:16,32,1) ***************
[TRT]    --------------- Timing Runner: {ForeignNode[/backbone/conv1/Conv.../classifier/classifier.4/Conv]} (DLA)
[TRT]    Skipping tactic 0x0000000000000003 due to exception Assertion context.dlaContext != nullptr failed. 
[TRT]    Fastest Tactic: 0xd15ea5edd15ea5ed Time: inf
[TRT]    10: [optimizer.cpp::computeCosts::3728] Error Code 10: Internal Error (Could not find any implementation for node {ForeignNode[/backbone/conv1/Conv.../classifier/classifier.4/Conv]}.)
[TRT]    device DLA_0, failed to build CUDA engine
[TRT]    device DLA_0, failed to load fcn_resnet18.onnx
[TRT]    segNet -- failed to load.
segnet:  failed to initialize segNet

I tested with trtexec. it does not show any error.

Can you please tell me what is the meaning of these errors?

  • What does it mean to provide a valid calibrator? To run it on DLA, I only changed the device type to DEVICE_DLA here, do I have to change anything else?
    [TRT] requested fasted precision for device DLA_0 without providing valid calibrator, disabling INT8

  • Why only the classifier 4 layer running on DLA?

[TRT]    ---------- Layers Running on DLA ----------
[TRT]    [DlaLayer] {ForeignNode[/backbone/conv1/Conv.../classifier/classifier.4/Conv]}
[TRT]    ---------- Layers Running on GPU ----------
[TRT]    Trying to load shared library libcublas.so.11
[TRT]    Loaded shared library libcublas.so.11
[TRT]    Using cublas as plugin tactic source
[TRT]    Trying to load shared library libcublasLt.so.11
[TRT]    Loaded shared library libcublasLt.so.11
[TRT]    Using cublasLt as core library tactic source
[TRT]    [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +260, GPU +322, now: CPU 660, GPU 5053 (MiB)
[TRT]    Trying to load shared library libcudnn.so.8
[TRT]    Loaded shared library libcudnn.so.8
[TRT]    Using cuDNN as plugin tactic source
[TRT]    Using cuDNN as core library tactic source
[TRT]    [MemUsageChange] Init cuDNN: CPU +82, GPU +125, now: CPU 742, GPU 5178 (MiB)
[TRT]    Global timing cache in use. Profiling results in this builder pass will be stored.
[TRT]    Constructing optimization profile number 0 [1/1].
[TRT]    Reserving memory for host IO tensors. Host: 0 bytes

  • And what is the meaning of this error? I checked, all the layers are supported by DLA
[TRT]    --------------- Timing Runner: {ForeignNode[/backbone/conv1/Conv.../classifier/classifier.4/Conv]} (DLA)
[TRT]    Skipping tactic 0x0000000000000003 due to exception Assertion context.dlaContext != nullptr failed. 

The model seems to run without any error when tested with trtexec-

/usr/src/tensorrt/bin/trtexec --onnx=/home/galaxis/projects/amin/jetson-inference/data/pytorch-segmentation/fcn_resnet18.onnx --fp16 --useDLACore=0 --allowGPUFallback

Output:

[05/10/2024-12:24:42] [I] Output:
[05/10/2024-12:24:42] [I] === Build Options ===
[05/10/2024-12:24:42] [I] Max batch: explicit batch
[05/10/2024-12:24:42] [I] Memory Pools: workspace: default, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default
[05/10/2024-12:24:42] [I] minTiming: 1
[05/10/2024-12:24:42] [I] avgTiming: 8
[05/10/2024-12:24:42] [I] Precision: FP32+FP16
[05/10/2024-12:24:42] [I] LayerPrecisions: 
[05/10/2024-12:24:42] [I] Calibration: 
[05/10/2024-12:24:42] [I] Refit: Disabled
[05/10/2024-12:24:42] [I] Sparsity: Disabled
[05/10/2024-12:24:42] [I] Safe mode: Disabled
[05/10/2024-12:24:42] [I] DirectIO mode: Disabled
[05/10/2024-12:24:42] [I] Restricted mode: Disabled
[05/10/2024-12:24:42] [I] Build only: Disabled
[05/10/2024-12:24:42] [I] Save engine: 
[05/10/2024-12:24:42] [I] Load engine: 
[05/10/2024-12:24:42] [I] Profiling verbosity: 0
[05/10/2024-12:24:42] [I] Tactic sources: Using default tactic sources
[05/10/2024-12:24:42] [I] timingCacheMode: local
[05/10/2024-12:24:42] [I] timingCacheFile: 
[05/10/2024-12:24:42] [I] Heuristic: Disabled
[05/10/2024-12:24:42] [I] Preview Features: Use default preview flags.
[05/10/2024-12:24:42] [I] Input(s)s format: fp32:CHW
[05/10/2024-12:24:42] [I] Output(s)s format: fp32:CHW
[05/10/2024-12:24:42] [I] Input build shapes: model
[05/10/2024-12:24:42] [I] Input calibration shapes: model
[05/10/2024-12:24:42] [I] === System Options ===
[05/10/2024-12:24:42] [I] Device: 0
[05/10/2024-12:24:42] [I] DLACore: 0(With GPU fallback)
[05/10/2024-12:24:42] [I] Plugins:
[05/10/2024-12:24:42] [I] === Inference Options ===
[05/10/2024-12:24:42] [I] Batch: Explicit
[05/10/2024-12:24:42] [I] Input inference shapes: model
[05/10/2024-12:24:42] [I] Iterations: 10
[05/10/2024-12:24:42] [I] Duration: 3s (+ 200ms warm up)
[05/10/2024-12:24:42] [I] Sleep time: 0ms
[05/10/2024-12:24:42] [I] Idle time: 0ms
[05/10/2024-12:24:42] [I] Streams: 1
[05/10/2024-12:24:42] [I] ExposeDMA: Disabled
[05/10/2024-12:24:42] [I] Data transfers: Enabled
[05/10/2024-12:24:42] [I] Spin-wait: Disabled
[05/10/2024-12:24:42] [I] Multithreading: Disabled
[05/10/2024-12:24:42] [I] CUDA Graph: Disabled
[05/10/2024-12:24:42] [I] Separate profiling: Disabled
[05/10/2024-12:24:42] [I] Time Deserialize: Disabled
[05/10/2024-12:24:42] [I] Time Refit: Disabled
[05/10/2024-12:24:42] [I] NVTX verbosity: 0
[05/10/2024-12:24:42] [I] Persistent Cache Ratio: 0
[05/10/2024-12:24:42] [I] Inputs:
[05/10/2024-12:24:42] [I] === Reporting Options ===
[05/10/2024-12:24:42] [I] Verbose: Disabled
[05/10/2024-12:24:42] [I] Averages: 10 inferences
[05/10/2024-12:24:42] [I] Percentiles: 90,95,99
[05/10/2024-12:24:42] [I] Dump refittable layers:Disabled
[05/10/2024-12:24:42] [I] Dump output: Disabled
[05/10/2024-12:24:42] [I] Profile: Disabled
[05/10/2024-12:24:42] [I] Export timing to JSON file: 
[05/10/2024-12:24:42] [I] Export output to JSON file: 
[05/10/2024-12:24:42] [I] Export profile to JSON file: 
[05/10/2024-12:24:42] [I] 
[05/10/2024-12:24:42] [I] === Device Information ===
[05/10/2024-12:24:42] [I] Selected Device: Xavier
[05/10/2024-12:24:42] [I] Compute Capability: 7.2
[05/10/2024-12:24:42] [I] SMs: 8
[05/10/2024-12:24:42] [I] Compute Clock Rate: 1.377 GHz
[05/10/2024-12:24:42] [I] Device Global Memory: 31002 MiB
[05/10/2024-12:24:42] [I] Shared Memory per SM: 96 KiB
[05/10/2024-12:24:42] [I] Memory Bus Width: 256 bits (ECC disabled)
[05/10/2024-12:24:42] [I] Memory Clock Rate: 1.377 GHz
[05/10/2024-12:24:42] [I] 
[05/10/2024-12:24:42] [I] TensorRT version: 8.5.2
[05/10/2024-12:24:43] [I] [TRT] [MemUsageChange] Init CUDA: CPU +187, GPU +0, now: CPU 216, GPU 5564 (MiB)
[05/10/2024-12:24:44] [I] [TRT] [MemUsageChange] Init builder kernel library: CPU +106, GPU +100, now: CPU 344, GPU 5678 (MiB)
[05/10/2024-12:24:44] [I] Start parsing network model
[05/10/2024-12:24:44] [I] [TRT] ----------------------------------------------------------------
[05/10/2024-12:24:44] [I] [TRT] Input filename:   /home/galaxis/projects/amin/jetson-inference/data/pytorch-segmentation/fcn_resnet18.onnx
[05/10/2024-12:24:44] [I] [TRT] ONNX IR version:  0.0.7
[05/10/2024-12:24:44] [I] [TRT] Opset version:    14
[05/10/2024-12:24:44] [I] [TRT] Producer name:    pytorch
[05/10/2024-12:24:44] [I] [TRT] Producer version: 2.0.0
[05/10/2024-12:24:44] [I] [TRT] Domain:           
[05/10/2024-12:24:44] [I] [TRT] Model version:    0
[05/10/2024-12:24:44] [I] [TRT] Doc string:       
[05/10/2024-12:24:44] [I] [TRT] ----------------------------------------------------------------
[05/10/2024-12:24:44] [I] Finish parsing network model
[05/10/2024-12:24:48] [I] [TRT] ---------- Layers Running on DLA ----------
[05/10/2024-12:24:48] [I] [TRT] [DlaLayer] {ForeignNode[/backbone/conv1/Conv.../classifier/classifier.4/Conv]}
[05/10/2024-12:24:48] [I] [TRT] ---------- Layers Running on GPU ----------
[05/10/2024-12:24:50] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +260, GPU +190, now: CPU 650, GPU 5970 (MiB)
[05/10/2024-12:24:50] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +82, GPU +80, now: CPU 732, GPU 6050 (MiB)
[05/10/2024-12:24:50] [I] [TRT] Local timing cache in use. Profiling results in this builder pass will not be stored.
[05/10/2024-12:24:56] [I] [TRT] Total Activation Memory: 32512303104
[05/10/2024-12:24:56] [I] [TRT] Detected 1 inputs and 1 output network tensors.
[05/10/2024-12:24:57] [I] [TRT] Total Host Persistent Memory: 96
[05/10/2024-12:24:57] [I] [TRT] Total Device Persistent Memory: 0
[05/10/2024-12:24:57] [I] [TRT] Total Scratch Memory: 0
[05/10/2024-12:24:57] [I] [TRT] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 26 MiB, GPU 22 MiB
[05/10/2024-12:24:57] [I] [TRT] [BlockAssignment] Started assigning block shifts. This will take 2 steps to complete.
[05/10/2024-12:24:57] [I] [TRT] [BlockAssignment] Algorithm ShiftNTopDown took 0.051169ms to assign 2 blocks to 2 nodes requiring 4206592 bytes.
[05/10/2024-12:24:57] [I] [TRT] Total Activation Memory: 4206592
[05/10/2024-12:24:57] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in building engine: CPU +26, GPU +0, now: CPU 26, GPU 0 (MiB)
[05/10/2024-12:24:57] [I] Engine built in 14.6812 sec.
[05/10/2024-12:24:57] [I] [TRT] Loaded engine size: 26 MiB
[05/10/2024-12:24:57] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +26, GPU +0, now: CPU 26, GPU 0 (MiB)
[05/10/2024-12:24:57] [I] Engine deserialized in 0.00459798 sec.
[05/10/2024-12:24:57] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +4, now: CPU 26, GPU 4 (MiB)
[05/10/2024-12:24:57] [I] Setting persistentCacheLimit to 0 bytes.
[05/10/2024-12:24:57] [I] Using random values for input input_0
[05/10/2024-12:24:57] [I] Created input binding for input_0 with dimensions 1x3x512x1024
[05/10/2024-12:24:57] [I] Using random values for output output_0
[05/10/2024-12:24:57] [I] Created output binding for output_0 with dimensions 1x12x16x32
[05/10/2024-12:24:57] [I] Starting inference
[05/10/2024-12:25:00] [I] Warmup completed 10 queries over 200 ms
[05/10/2024-12:25:00] [I] Timing trace has 145 queries over 3.05476 s
[05/10/2024-12:25:00] [I] 
[05/10/2024-12:25:00] [I] === Trace details ===
[05/10/2024-12:25:00] [I] Trace averages of 10 runs:
[05/10/2024-12:25:00] [I] Average on 10 runs - GPU latency: 20.9335 ms - Host latency: 21.4217 ms (enqueue 0.418831 ms)
[05/10/2024-12:25:00] [I] Average on 10 runs - GPU latency: 20.882 ms - Host latency: 21.372 ms (enqueue 0.459833 ms)
[05/10/2024-12:25:00] [I] Average on 10 runs - GPU latency: 20.9385 ms - Host latency: 21.4352 ms (enqueue 0.355542 ms)
[05/10/2024-12:25:00] [I] Average on 10 runs - GPU latency: 20.8787 ms - Host latency: 21.365 ms (enqueue 0.490649 ms)
[05/10/2024-12:25:00] [I] Average on 10 runs - GPU latency: 20.9028 ms - Host latency: 21.4042 ms (enqueue 0.437756 ms)
[05/10/2024-12:25:00] [I] Average on 10 runs - GPU latency: 20.9199 ms - Host latency: 21.4204 ms (enqueue 0.409814 ms)
[05/10/2024-12:25:00] [I] Average on 10 runs - GPU latency: 20.882 ms - Host latency: 21.3914 ms (enqueue 0.393835 ms)
[05/10/2024-12:25:00] [I] Average on 10 runs - GPU latency: 20.936 ms - Host latency: 21.439 ms (enqueue 0.41676 ms)
[05/10/2024-12:25:00] [I] Average on 10 runs - GPU latency: 20.8821 ms - Host latency: 21.3883 ms (enqueue 0.431421 ms)
[05/10/2024-12:25:00] [I] Average on 10 runs - GPU latency: 20.9209 ms - Host latency: 21.4322 ms (enqueue 0.465869 ms)
[05/10/2024-12:25:00] [I] Average on 10 runs - GPU latency: 20.9178 ms - Host latency: 21.425 ms (enqueue 0.431006 ms)
[05/10/2024-12:25:00] [I] Average on 10 runs - GPU latency: 20.9407 ms - Host latency: 21.4461 ms (enqueue 0.395068 ms)
[05/10/2024-12:25:00] [I] Average on 10 runs - GPU latency: 21.0056 ms - Host latency: 21.5125 ms (enqueue 0.414478 ms)
[05/10/2024-12:25:00] [I] Average on 10 runs - GPU latency: 20.9628 ms - Host latency: 21.4675 ms (enqueue 0.46853 ms)
[05/10/2024-12:25:00] [I] 
[05/10/2024-12:25:00] [I] === Performance summary ===
[05/10/2024-12:25:00] [I] Throughput: 47.467 qps
[05/10/2024-12:25:00] [I] Latency: min = 21.3491 ms, max = 21.6797 ms, mean = 21.4235 ms, median = 21.4022 ms, percentile(90%) = 21.5027 ms, percentile(95%) = 21.571 ms, percentile(99%) = 21.6519 ms
[05/10/2024-12:25:00] [I] Enqueue Time: min = 0.251709 ms, max = 0.654846 ms, mean = 0.43047 ms, median = 0.422363 ms, percentile(90%) = 0.566406 ms, percentile(95%) = 0.595459 ms, percentile(99%) = 0.63623 ms
[05/10/2024-12:25:00] [I] H2D Latency: min = 0.468445 ms, max = 0.602295 ms, mean = 0.495972 ms, median = 0.494629 ms, percentile(90%) = 0.51416 ms, percentile(95%) = 0.517578 ms, percentile(99%) = 0.540039 ms
[05/10/2024-12:25:00] [I] GPU Compute Time: min = 20.861 ms, max = 21.1443 ms, mean = 20.9222 ms, median = 20.908 ms, percentile(90%) = 21.0013 ms, percentile(95%) = 21.0591 ms, percentile(99%) = 21.1084 ms
[05/10/2024-12:25:00] [I] D2H Latency: min = 0.00415039 ms, max = 0.00634766 ms, mean = 0.00532332 ms, median = 0.00524902 ms, percentile(90%) = 0.00561523 ms, percentile(95%) = 0.00579834 ms, percentile(99%) = 0.00610352 ms
[05/10/2024-12:25:00] [I] Total Host Walltime: 3.05476 s
[05/10/2024-12:25:00] [I] Total GPU Compute Time: 3.03372 s
[05/10/2024-12:25:00] [I] Explanations of the performance metrics are printed in the verbose logs.
[05/10/2024-12:25:00] [I] 
&&&& PASSED TensorRT.trtexec [TensorRT v8502] # /usr/src/tensorrt/bin/trtexec --onnx=/home/galaxis/projects/amin/jetson-inference/data/pytorch-segmentation/fcn_resnet18.onnx --fp16 --useDLACore=0 --allowGPUFallback

@dusty-nv
Copy link
Owner

@khsafkatamin not sure haven't tried those on DLA, it doesn't support all the layers it seems. You could check DeepStream for other models working with DLA.

@khsafkatamin
Copy link
Author

@dusty-nv Thank you for your prompt response and suggestion. I will look into deepstream.

But one thing that, I set allowGPUFallback=true, but it still shows the error. do you know anything about that?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants