Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Excessive memory footprint in gltf_loader #942

Open
jherico opened this issue Feb 24, 2024 · 6 comments
Open

Excessive memory footprint in gltf_loader #942

jherico opened this issue Feb 24, 2024 · 6 comments
Labels
framework This is relevant to the framework

Comments

@jherico
Copy link
Contributor

jherico commented Feb 24, 2024

The current gltf loader code uses threading to load images from disk into memory prior to pushing them to the GPU.

However, it does this by creating a thread-pool of std::thread::hardware_concurrency() size.

auto thread_count = std::thread::hardware_concurrency();
thread_count = thread_count == 0 ? 1 : thread_count;
ctpl::thread_pool thread_pool(thread_count);
auto image_count = to_u32(model.images.size());
std::vector<std::future<std::unique_ptr<sg::Image>>> image_component_futures;
for (size_t image_index = 0; image_index < image_count; image_index++)
{
auto fut = thread_pool.push(
[this, image_index](size_t) {
auto image = parse_image(model.images[image_index]);
LOGI("Loaded gltf image #{} ({})", image_index, model.images[image_index].uri.c_str());
return image;
});
image_component_futures.push_back(std::move(fut));
}

The code attempts to mitigate this later on by limiting the number of staging buffers that will be created to 64 MB at a time:

// Upload images to GPU. We do this in batches of 64MB of data to avoid needing
// double the amount of memory (all the images and all the corresponding buffers).
// This helps keep memory footprint lower which is helpful on smaller devices.
size_t image_index = 0;
while (image_index < image_count)
{
std::vector<core::Buffer> transient_buffers;
auto &command_buffer = device.request_command_buffer();
command_buffer.begin(VK_COMMAND_BUFFER_USAGE_ONE_TIME_SUBMIT_BIT, 0);
size_t batch_size = 0;
// Deal with 64MB of image data at a time to keep memory footprint low
while (image_index < image_count && batch_size < 64 * 1024 * 1024)
{

but IMO the "damage" is already done. My machine peaks at over 5 GB of RAM usage at loading the large scenes used in performance examples.

While the threading might be useful for speeding up the overall load, consuming all the CPUs on the machine to do so feels excessive, and I would suggest that at least 1 CPU be left in reserve for the main thread.

Additionally, it should be possible to limit the number of images loaded concurrently by using a blocking queue or a (C++) semaphore to block the thread pool from doing more work when there pending images waiting for upload.

@tomadamatkinson
Copy link
Collaborator

Does your machine natively support ASTC?

One of the issues with the loader is how we decode ASTC inline which may be adding to your excessive memory usage. Some of the scenes are also 4x duplicates to simulate load on a device which doesn't help the loader.

In some samples the goal is to demonstrate the performance delta when using a specific technique - hence the large inefficient scenes. The goal is not to make every sample 100% efficient in all aspects (although an efficient GLTF Loader is definitely a goal)

The loader is on the chopping block soon and will likely be refactored. This includes sorting out the image loading a decoding phase (see framework/scene_graph/images/* if you are interested in those issues)

Completely agree with your assessment though, it is currently not in the best state!

@SaschaWillems
Copy link
Collaborator

See #860. ASTC is mobile only, so pretty much all desktop platforms take ages to load larger assets using ASTC compressed textures.

@SaschaWillems SaschaWillems added the framework This is relevant to the framework label Feb 26, 2024
@jherico
Copy link
Contributor Author

jherico commented Feb 26, 2024

I get that the ASTC conversion exacerbates memory overhead, and that this code may be a target for refactoring at some point, but I still think it would be a relatively small engineering challenge to not potentially load every image into memory while still allowing it to use threading to pre-load the assets for better performance. I'll probably try to make a small PR for this.

@tomadamatkinson
Copy link
Collaborator

@jherico apologies, I wasn't trying to dismiss improvements. Just letting you know what we know at this point

More than happy for you to take a shot at it!

@jherico
Copy link
Contributor Author

jherico commented Feb 27, 2024

My initial work did a little to improve the memory footprint, but not as much as I expected, and it drastically slowed down the load time.

I did some profiling and found a lot of memory being allocated and not released in surprising places deep inside the astc library. I'm not 100% certain that this accounts for the amount of memory being consumed, but the astc code, at least the version in use by this repo, is very not thread safe, and it leaves a ton of performance on the table by not having any SIMD support.

I've started working on a branch to at least update astc to a recent version, but getting it set up properly will probably be a chunk of work. It does, however, fix the threading issue and also includes support for NEON, SSE2, SSE4.1 and AVX2 instruction sets, but I would need to find a cmake friendly method of determining which one to enable for a given build environment, but the "native" ISA is always available so there's a fallback for people working in obscure environments.

Alternatively, I could put some effort into refactoring the assets to convert he astc assets to basisu, and include the mips in the files so they don't have to be generated at runtime.

Yet another option for improving performance would be to use a temp directory and whenever an astc resource is uncompressed, write the uncompressed version and its mips to the temp folder, so that the next time they could be loaded directly.

@SaschaWillems
Copy link
Collaborator

We've been discussing this several times, and we should replace the astc files. basisu is the way to go, I already did a sample for it, so we could leverage the code from that. Will discuss this on our next meeting. It's probably best to have the original asset authors update them for basisu instead of converting the astc (which would probably result in a quality loss).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
framework This is relevant to the framework
Projects
None yet
Development

No branches or pull requests

3 participants