New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Betsy to speed up BC6 compression #91535
base: master
Are you sure you want to change the base?
Conversation
This is going to help immensely with our lightmapper! Currently the lightmapper doesn't compress the results since it is so slow. But with this, we can finally re-enable compression by default |
Update: After optimizing RGB-RGBA conversion and disabling the forced auto mipmap generation for layered textures (is there any reason why it was enabled in the first place?), the compression time goes down to ~7 seconds, and that's still mostly CPU bound. I'm hoping to bring it down to around 2, 3 seconds. |
Signed compression is now supported, though because the sign detection happens on individual layers, directional lightmaps break when recombined. The same issue happens in master. |
Is there a way of bc6 using sign detection on individual layers or the entire set at once without too much research time? This seems like we need to implement a batch system? |
This is tricky since each layer is compressed as an independent image, and as such one layer isn't aware of the other ones. IMO the layered/3d/cube texture import pipeline needs a light rework as the current approach also causes other problems due to this issue. |
Did some benchmarks, the difference in performance is incredible (up to 200 times faster on some large images). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested locally, it works as expected.
Benchmark
PC specifications
- CPU: Intel Core i9-13900K
- GPU: NVIDIA GeForce RTX 4090
- RAM: 64 GB (2×32 GB DDR5-5800 C30)
- SSD: Solidigm P44 Pro 2 TB
- OS: Linux (Fedora 39)
Using a Linux optimized (optimize=speed lto=full
) editor build.
image | CVTT | Betsy |
---|---|---|
French graden 8k with mipmaps | 52.3s | 4.2s |
German town at night 4k with mipmaps | 14.1s | 1.1s |
Church at starry night 8k with mipmaps | 56.7s | 4.1s |
This is on average a speedup of 13.1×.
It's strange how my import speeds are quite a bit slower than @BlueCube3310's though, considering my GPU is significantly faster. I measured the times by starting the editor with --verbose
, which prints the time taken after every reimport. The texture was assigned to a StandardMaterial3D so it was detected as used in 3D, which enables VRAM compression and mipmaps.
Output quality seems pretty much identical – I can't spot any differences.
Impact on stripped binary size:
115,089,512 godot.linuxbsd.editor.x86_64
115,118,184 godot.linuxbsd.editor.x86_64.betsy-bc6h
This is only 30 KB and only affects editor builds, so it's pretty reasonable.
Great work 🙂
PS: hdri-haven.com is a typosquatting site, use hdrihaven.com (which redirects to polyhaven.com) instead.
My results only contain the time it takes to execute the Edit: Both CVTT and Betsy now print the compression time with verbose.
Redirected the sources to polyhaven. |
35cee57
to
b567ad0
Compare
I was thinking about |
The GPU compressor can now be toggled from the project settings and the compressor works in rendering modes other than Forward+. |
Ohh, that is a shame. Directional Lightmaps are so heavy. Maybe for moment you can use compression on non-directional lightmaps until de issue is solved |
Can you compress some thing else that has signed floats for comparison and against some stable signed float bc6 encoder? |
Is there a quality knob we can maximize for Betsy before we try too hard on debugging? There is a "quality" knob mentioned here https://github.com/knarkowicz/GPURealTimeBC6H?tab=readme-ov-file#gpurealtimebc6h |
#include "UavCrossPlatform_piece_all.glsl" | ||
|
||
#VERSION_DEFINES | ||
#define QUALITY |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BlueCube3310 Here is the quality define.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Disabling the quality setting gets rid of the artifacts, but it also removes the negative values. Similarly, DirectXTex also seems to clamp the values to 0.
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BlueCube3310 What GPU and OS are you using? Are you able to try other GPUs?
Sometimes the artifacts are caused by genuine driver bugs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm on Windows 10 64bit, I've tested an RTX 4060TI and a GTX 1660TI on the latest drivers, both exhibit the artifacting.
CC @darksylinc, FYI. Finally putting Betsy to good use in Godot! |
EncodeP1(block, blockMSLE, texels); | ||
#ifdef QUALITY | ||
for (uint i = 0u; i < 32u; ++i) { | ||
EncodeP2Pattern(block, blockMSLE, i, texels); | ||
} | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
EncodeP1(block, blockMSLE, texels); | |
#ifdef QUALITY | |
for (uint i = 0u; i < 32u; ++i) { | |
EncodeP2Pattern(block, blockMSLE, i, texels); | |
} | |
#endif | |
EncodeP1(block, blockMSLE, texels); | |
#ifdef QUALITY | |
for (uint i = 0u; i < 64u; ++i) { | |
EncodeP2Pattern(block, blockMSLE, i, texels); | |
} | |
#endif |
I wonder if I'll cause a driver hang if I double the search space for an encoding value. What's your setup to output the three images for bc6?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm using this image as a source: testout.zip.
For CVTT I disable the GPU compressor from the settings, for Betsy I leave it enabled, and for DirectXTex I use a custom program to convert the EXR to DDS with BC6SF compression.
Edit:
It looks like NVIDIA Texture Tools can preserve the negative data without artifacts:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll stop poking at it, but one approach is to search more bc6 block modes but I have to go now.
It looks like ASTC doesn't handle negative values either. |
Could we bias directional lightmaps when baking (and reverse this biasing in the shader) so that we don't need the texture to use signed compression? This would make them ASTC-friendly as well. Doing so will impact precision somewhat, so I assume we'll need to find a reasonable bias value that is flexible enough for very bright lights that you might use in a real world project (e.g. the equivalent of |
Yes. But several biasing strategies would have to be tested to see which one provides overall the best precision. e.g. val = val * 0.5f + 0.5f; // Mapping [-1; 1] -> [0; 1]
val = 1.0f - val * 0.5f + 0.5f; // Mapping [-1; 1] -> [1; 0]
val = exp( val ); // Or exp2
val = exp( 1 - val ); |
Been playing around with a python script and chatgpt4. import numpy as np
def ulp(x):
return np.nextafter(x, np.inf) - x
def map_minus_one_to_one(val):
return val * 0.5 + 0.5 # Mapping [-1; 1] -> [0; 1]
def map_one_to_zero(val):
return 1.0 - (val * 0.5 + 0.5); # Mapping [-1; 1] -> [1; 0]
def map_exp(val):
return np.exp(val)
def map_exp2(val):
return np.exp2(val)
def map_exp_inverse(val):
return np.exp(1 - val)
def map_abs(val):
return np.abs(val)
def map_half_plus_half(val):
return 0.5 * val + 0.5
def map_cos(val):
return (np.cos(val) + 1) / 2
def map_sin(val):
return (np.sin(val) + 1) / 2
def map_tanh(val):
return (np.tanh(val) + 1) / 2
values = np.linspace(-1, 1, 1000)
mappings = {'map_minus_one_to_one': map_minus_one_to_one,
'map_one_to_zero': map_one_to_zero,
'map_exp': map_exp,
'map_exp2': map_exp,
'map_exp_inverse': map_exp_inverse,
'map_abs': map_abs,
'map_half_plus_half': map_half_plus_half,
'map_cos': map_cos,
'map_sin': map_sin,
'map_tanh': map_tanh
}
for name, mapping in mappings.items():
mapped_values = mapping(values)
ulps = ulp(mapped_values)
print(f"Mapping {name}:")
print(f"Min ULP: {np.min(ulps)}")
print(f"Max ULP: {np.max(ulps)}")
print(f"Mean ULP: {np.mean(ulps)}")
print() Clipped the results. |
That's not exactly what I meant (although it's not bad, and it's close in spirit). What I meant is that the average use case is going to use certain ranges more than others; thus the testing would need to generate several lightmaps and compare using real world data. |
My understanding is BC6H and ASTC hdr allow negative values. We should double check it isn't a shader optimizer issue. For example, in HLSL (some versions of)? fxc / dx11 will assume that reads from |
Once this is added, we should look into compressing ReflectionProbes and Skys when the update mode is set to "update once" |
For binary size reasons, VRAM compression is not available in export template builds (only decompression is available). I don't see a way of doing this unless we add ReflectionProbe prerendering (and saving to disk so it can be included in the PCK). There are good arguments for including VRAM compression libraries in export templates (such as user-generated content or runtime lightmap baking), but it should be discussed in its own proposal. Besides, the current Once update mode for reflection and sky rendering is not a "true" Once update mode as described in godotengine/godot-proposals#2934. This means that if a ReflectionProbe with the Once update mode is moving continuously, VRAM compression would need to be performed every frame, which is too slow. Doing this for sky shaders that don't use |
That is true (for BC6 at least, not sure about ASTC), although we would also have to modify astcenc, since it clamps the values to 0. The required modifications for both astcenc and Betsy will also take significantly more time. |
Hi. BC6H supports negative values, but it is betsy's encoder which does not. ASTC has two variants: LDR & HDR. The LDR variant does not support negative values. I don't know about the HDR variant. However most GPUs do not support the HDR variant. |
Excerpt of #91150.
Implements the Betsy GPU texture compressor to drastically speed up BC6 compression, which is essential for compressed baked lightmaps.
Performance (compression ONLY):
Build - Production 64 bit,
CPU - Ryzen 9 5900x,
GPU - RTX 4060 TI
TODO: