Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Betsy to speed up BC6 compression #91535

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

BlueCube3310
Copy link
Contributor

@BlueCube3310 BlueCube3310 commented May 3, 2024

Excerpt of #91150.

Implements the Betsy GPU texture compressor to drastically speed up BC6 compression, which is essential for compressed baked lightmaps.

Performance (compression ONLY):
Build - Production 64 bit,
CPU - Ryzen 9 5900x,
GPU - RTX 4060 TI

image CVTT Betsy
Symmetrical garden 8k .hdr with mipmaps 92.4s 595ms
Cobblestone Street Night .hdr 4k with mipmaps 26.5s 362ms
Laufenurg Church 8k .hdr with mipmaps 99.3s 522ms
Little Paris 8k .hdr with mipmaps 92.7s 467ms

TODO:

  • optimize the setup (shader compilation, RGB->RGBA conversion), since that takes up most of the time,
  • fix signed compression (when importing a layered texture, some slices are signed and others aren't, so the combined texture breaks).

@BlueCube3310 BlueCube3310 marked this pull request as ready for review May 3, 2024 20:44
@BlueCube3310 BlueCube3310 requested review from a team as code owners May 3, 2024 20:44
thirdparty/README.md Outdated Show resolved Hide resolved
@clayjohn
Copy link
Member

clayjohn commented May 3, 2024

This is going to help immensely with our lightmapper! Currently the lightmapper doesn't compress the results since it is so slow. But with this, we can finally re-enable compression by default

@BlueCube3310
Copy link
Contributor Author

In terms of performance, a layered 4096x2048x8 image takes ~11 seconds to import on an RTX 4060TI.

Update: After optimizing RGB-RGBA conversion and disabling the forced auto mipmap generation for layered textures (is there any reason why it was enabled in the first place?), the compression time goes down to ~7 seconds, and that's still mostly CPU bound. I'm hoping to bring it down to around 2, 3 seconds.

@BlueCube3310
Copy link
Contributor Author

BlueCube3310 commented May 4, 2024

Signed compression is now supported, though because the sign detection happens on individual layers, directional lightmaps break when recombined. The same issue happens in master.

@fire
Copy link
Member

fire commented May 4, 2024

Is there a way of bc6 using sign detection on individual layers or the entire set at once without too much research time?

This seems like we need to implement a batch system?

@BlueCube3310
Copy link
Contributor Author

This is tricky since each layer is compressed as an independent image, and as such one layer isn't aware of the other ones. IMO the layered/3d/cube texture import pipeline needs a light rework as the current approach also causes other problems due to this issue.

@BlueCube3310
Copy link
Contributor Author

BlueCube3310 commented May 6, 2024

Did some benchmarks, the difference in performance is incredible (up to 200 times faster on some large images).

modules/betsy/SCsub Outdated Show resolved Hide resolved
modules/betsy/SCsub Outdated Show resolved Hide resolved
Copy link
Member

@Calinou Calinou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested locally, it works as expected.

Benchmark

PC specifications
  • CPU: Intel Core i9-13900K
  • GPU: NVIDIA GeForce RTX 4090
  • RAM: 64 GB (2×32 GB DDR5-5800 C30)
  • SSD: Solidigm P44 Pro 2 TB
  • OS: Linux (Fedora 39)

Using a Linux optimized (optimize=speed lto=full) editor build.

image CVTT Betsy
French graden 8k with mipmaps 52.3s 4.2s
German town at night 4k with mipmaps 14.1s 1.1s
Church at starry night 8k with mipmaps 56.7s 4.1s

This is on average a speedup of 13.1×.

It's strange how my import speeds are quite a bit slower than @BlueCube3310's though, considering my GPU is significantly faster. I measured the times by starting the editor with --verbose, which prints the time taken after every reimport. The texture was assigned to a StandardMaterial3D so it was detected as used in 3D, which enables VRAM compression and mipmaps.

Output quality seems pretty much identical – I can't spot any differences.

Impact on stripped binary size:

115,089,512  godot.linuxbsd.editor.x86_64
115,118,184  godot.linuxbsd.editor.x86_64.betsy-bc6h

This is only 30 KB and only affects editor builds, so it's pretty reasonable.

Great work 🙂

PS: hdri-haven.com is a typosquatting site, use hdrihaven.com (which redirects to polyhaven.com) instead.

core/io/image.cpp Outdated Show resolved Hide resolved
modules/betsy/image_compress_betsy.cpp Outdated Show resolved Hide resolved
modules/betsy/bc6h.glsl Outdated Show resolved Hide resolved
@BlueCube3310
Copy link
Contributor Author

BlueCube3310 commented May 6, 2024

It's strange how my import speeds are quite a bit slower than @BlueCube3310's though, considering my GPU is significantly faster.

My results only contain the time it takes to execute the compress function, and not the whole import process (decoding, mipmap generation, etc.). On my machine the whole process took ~14s with Betsy.

Edit: Both CVTT and Betsy now print the compression time with verbose.

PS: hdri-haven.com is a typosquatting site, use hdrihaven.com (which redirects to polyhaven.com) instead.

Redirected the sources to polyhaven.

@BlueCube3310 BlueCube3310 force-pushed the betsy-bc6h branch 3 times, most recently from 35cee57 to b567ad0 Compare May 7, 2024 13:14
@fire
Copy link
Member

fire commented May 7, 2024

I was thinking about fix signed compression (when importing a layered texture, some slices are signed and others aren't, so the combined texture breaks). can we disable the signed path or somehow resolve or do we have to implement batched signing properly?

@BlueCube3310
Copy link
Contributor Author

The GPU compressor can now be toggled from the project settings and the compressor works in rendering modes other than Forward+.

@BlueCube3310
Copy link
Contributor Author

So I tried compressing a directional lightmap with Betsy and noticed some weird artifacts:
betsy

I thought this was a compressor-specific issue, so I tried CVTT (the one used by Godot):
cvtt

I have no idea why this happens. It could be a format limitation, or simply a shared issue between the compressors. Either way, directional lightmaps should probably remain uncompressed.

@jcostello
Copy link
Contributor

Ohh, that is a shame. Directional Lightmaps are so heavy. Maybe for moment you can use compression on non-directional lightmaps until de issue is solved

@fire
Copy link
Member

fire commented May 9, 2024

Can you compress some thing else that has signed floats for comparison and against some stable signed float bc6 encoder?

@BlueCube3310
Copy link
Contributor Author

BlueCube3310 commented May 9, 2024

Can you compress some thing else that has signed floats for comparison and against some stable signed float bc6 encoder?

bc6

Left: Microsoft's DirectXTex,
Middle: CVTT,
Right: Betsy

So it looks like it's an issue with the compressors.

@fire
Copy link
Member

fire commented May 9, 2024

Is there a quality knob we can maximize for Betsy before we try too hard on debugging?

There is a "quality" knob mentioned here https://github.com/knarkowicz/GPURealTimeBC6H?tab=readme-ov-file#gpurealtimebc6h

#include "UavCrossPlatform_piece_all.glsl"

#VERSION_DEFINES
#define QUALITY
Copy link
Member

@fire fire May 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BlueCube3310 Here is the quality define.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Disabling the quality setting gets rid of the artifacts, but it also removes the negative values. Similarly, DirectXTex also seems to clamp the values to 0.

This comment was marked as outdated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BlueCube3310 What GPU and OS are you using? Are you able to try other GPUs?

Sometimes the artifacts are caused by genuine driver bugs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm on Windows 10 64bit, I've tested an RTX 4060TI and a GTX 1660TI on the latest drivers, both exhibit the artifacting.

@akien-mga
Copy link
Member

CC @darksylinc, FYI. Finally putting Betsy to good use in Godot!

Comment on lines +593 to +598
EncodeP1(block, blockMSLE, texels);
#ifdef QUALITY
for (uint i = 0u; i < 32u; ++i) {
EncodeP2Pattern(block, blockMSLE, i, texels);
}
#endif
Copy link
Member

@fire fire May 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
EncodeP1(block, blockMSLE, texels);
#ifdef QUALITY
for (uint i = 0u; i < 32u; ++i) {
EncodeP2Pattern(block, blockMSLE, i, texels);
}
#endif
EncodeP1(block, blockMSLE, texels);
#ifdef QUALITY
for (uint i = 0u; i < 64u; ++i) {
EncodeP2Pattern(block, blockMSLE, i, texels);
}
#endif

I wonder if I'll cause a driver hang if I double the search space for an encoding value. What's your setup to output the three images for bc6?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm using this image as a source: testout.zip.

For CVTT I disable the GPU compressor from the settings, for Betsy I leave it enabled, and for DirectXTex I use a custom program to convert the EXR to DDS with BC6SF compression.

Edit:
It looks like NVIDIA Texture Tools can preserve the negative data without artifacts:
nvtt

Copy link
Member

@fire fire May 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll stop poking at it, but one approach is to search more bc6 block modes but I have to go now.

@BlueCube3310
Copy link
Contributor Author

It looks like ASTC doesn't handle negative values either.

@Calinou
Copy link
Member

Calinou commented May 12, 2024

Could we bias directional lightmaps when baking (and reverse this biasing in the shader) so that we don't need the texture to use signed compression? This would make them ASTC-friendly as well.

Doing so will impact precision somewhat, so I assume we'll need to find a reasonable bias value that is flexible enough for very bright lights that you might use in a real world project (e.g. the equivalent of Color(10, 10, 10)).

@darksylinc
Copy link
Contributor

Could we bias directional lightmaps when baking (and reverse this biasing in the shader) so that we don't need the texture to use signed compression? This would make them ASTC-friendly as well.

Doing so will impact precision somewhat, so I assume we'll need to find a reasonable bias value that is flexible enough for very bright lights that you might use in a real world project (e.g. the equivalent of Color(10, 10, 10)).

Yes. But several biasing strategies would have to be tested to see which one provides overall the best precision.

e.g.

val = val * 0.5f + 0.5f; // Mapping [-1; 1] -> [0; 1]
val = 1.0f - val * 0.5f + 0.5f; // Mapping [-1; 1] -> [1; 0]
val = exp( val ); // Or exp2
val = exp( 1 - val );

@fire
Copy link
Member

fire commented May 13, 2024

Been playing around with a python script and chatgpt4.

import numpy as np

def ulp(x):
    return np.nextafter(x, np.inf) - x

def map_minus_one_to_one(val):
    return val * 0.5 + 0.5 # Mapping [-1; 1] -> [0; 1] 

def map_one_to_zero(val):
    return 1.0 - (val * 0.5 + 0.5); # Mapping [-1; 1] -> [1; 0]

def map_exp(val):
    return np.exp(val)

def map_exp2(val):
    return np.exp2(val)

def map_exp_inverse(val):
    return np.exp(1 - val)

def map_abs(val):
    return np.abs(val)

def map_half_plus_half(val):
    return 0.5 * val + 0.5

def map_cos(val):
    return (np.cos(val) + 1) / 2

def map_sin(val):
    return (np.sin(val) + 1) / 2

def map_tanh(val):
    return (np.tanh(val) + 1) / 2

values = np.linspace(-1, 1, 1000)
mappings = {'map_minus_one_to_one': map_minus_one_to_one, 
            'map_one_to_zero': map_one_to_zero, 
            'map_exp': map_exp, 
            'map_exp2': map_exp, 
            'map_exp_inverse': map_exp_inverse,
            'map_abs': map_abs,
            'map_half_plus_half': map_half_plus_half,
            'map_cos': map_cos,
            'map_sin': map_sin,
            'map_tanh': map_tanh
            }

for name, mapping in mappings.items():
    mapped_values = mapping(values)
    ulps = ulp(mapped_values)
    print(f"Mapping {name}:")
    print(f"Min ULP: {np.min(ulps)}")
    print(f"Max ULP: {np.max(ulps)}")
    print(f"Mean ULP: {np.mean(ulps)}")
    print()

Clipped the results.

@darksylinc
Copy link
Contributor

That's not exactly what I meant (although it's not bad, and it's close in spirit).

What I meant is that the average use case is going to use certain ranges more than others; thus the testing would need to generate several lightmaps and compare using real world data.

@lyuma
Copy link
Contributor

lyuma commented May 13, 2024

My understanding is BC6H and ASTC hdr allow negative values.
I don't think we should change the format of dir lightmaps. We should debug the betsy shader code.

We should double check it isn't a shader optimizer issue. For example, in HLSL (some versions of)? fxc / dx11 will assume that reads from Texture2D<float4> return positive values. Something like float4 x = tex2D(...); if (x.x <0) { ... } will be optimized out or float foo = max(0.0, tex2D(...).x) will be optimized to float foo = tex2D(...)
Maybe GLSL optimizers have a similar issue

@clayjohn
Copy link
Member

Once this is added, we should look into compressing ReflectionProbes and Skys when the update mode is set to "update once"

@Calinou
Copy link
Member

Calinou commented May 15, 2024

Once this is added, we should look into compressing ReflectionProbes and Skys when the update mode is set to "update once"

For binary size reasons, VRAM compression is not available in export template builds (only decompression is available). I don't see a way of doing this unless we add ReflectionProbe prerendering (and saving to disk so it can be included in the PCK).

There are good arguments for including VRAM compression libraries in export templates (such as user-generated content or runtime lightmap baking), but it should be discussed in its own proposal.

Besides, the current Once update mode for reflection and sky rendering is not a "true" Once update mode as described in godotengine/godot-proposals#2934. This means that if a ReflectionProbe with the Once update mode is moving continuously, VRAM compression would need to be performed every frame, which is too slow.

Doing this for sky shaders that don't use TIME is an interesting idea, but it would also limit the resolution of the sky "texture" as you'd no longer have a sky with true per-pixel rendering. Instead, you'd be looking at a rasterized version of the sky shader.

@BlueCube3310
Copy link
Contributor Author

My understanding is BC6H and ASTC hdr allow negative values. I don't think we should change the format of dir lightmaps. We should debug the betsy shader code.

That is true (for BC6 at least, not sure about ASTC), although we would also have to modify astcenc, since it clamps the values to 0. The required modifications for both astcenc and Betsy will also take significantly more time.

@darksylinc
Copy link
Contributor

Hi.

BC6H supports negative values, but it is betsy's encoder which does not.

ASTC has two variants: LDR & HDR. The LDR variant does not support negative values. I don't know about the HDR variant. However most GPUs do not support the HDR variant.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet