[Performance] LDM optimization patches #15824

drhead · 2024-05-17T16:16:17Z

Description

Change 1: Timestep Embedding Patch

Fixes a blocking op in the timestep embedding. It was creating a tensor on CPU and then moving it to GPU, which would force a sync every step.
Combined with the other performance PRs (mine and HCL's), Torch's dispatch queue should be completely unblocked (until extensions with similar problems mess it up). This will allow near constant 100% GPU usage.

Change 2: SpatialTransformer.forward einops removal

Changes the function to use native torch reshape/view/permute ops and removes the .contiguous() call.
Prevents 32 calls to aten::copy_ and void at::native::elementwise_kernel<128, 4, at::nati... per forward pass (SD 1.5). Speedup seems to be around 6-8 ms per forward, but my profiler is being a little inconsistent with the timing (512x512, batch 4, overclocked 3090)

drhead · 2024-05-17T16:22:07Z

I think #18620 might need to be merged before tests will pass on this.

w-e-w · 2024-05-17T16:44:25Z

so we need to wait 2769 new posts to merge this 🙃

drhead · 2024-05-17T16:46:32Z

Upon further review I think it would be sufficient for #15820 to be merged first lol

drhead · 2024-05-17T17:19:55Z

Added another patch, and it passes tests now.

Patch timestep embedding to create tensor on-device

53d6708

drhead requested a review from AUTOMATIC1111 as a code owner May 17, 2024 16:16

Add transformer forward patch

cc9ca67

drhead changed the title ~~Patch timestep embedding to create tensor on-device~~ LDM optimization patches May 17, 2024

drhead changed the title ~~LDM optimization patches~~ [Performance] LDM optimization patches May 21, 2024

AUTOMATIC1111 approved these changes Jun 8, 2024

View reviewed changes

Merge branch 'dev' into patch-4

ebfc9f6

AUTOMATIC1111 merged commit 93b53dc into AUTOMATIC1111:dev Jun 8, 2024
3 checks passed