Excessive memory footprint in gltf_loader #942

jherico · 2024-02-24T23:19:00Z

The current gltf loader code uses threading to load images from disk into memory prior to pushing them to the GPU.

However, it does this by creating a thread-pool of std::thread::hardware_concurrency() size.

Vulkan-Samples/framework/gltf_loader.cpp

Lines 545 to 564 in 216e065

    
           auto thread_count = std::thread::hardware_concurrency(); 
        
           thread_count      = thread_count == 0 ? 1 : thread_count; 
        
           ctpl::thread_pool thread_pool(thread_count); 
        
           auto image_count = to_u32(model.images.size()); 
        
           std::vector<std::future<std::unique_ptr<sg::Image>>> image_component_futures; 
        
           for (size_t image_index = 0; image_index < image_count; image_index++) 
        
           { 
        
           	auto fut = thread_pool.push( 
        
           	    [this, image_index](size_t) { 
        
           		    auto image = parse_image(model.images[image_index]); 
        
           		    LOGI("Loaded gltf image #{} ({})", image_index, model.images[image_index].uri.c_str()); 
        
           		    return image; 
        
           	    }); 
        
           	image_component_futures.push_back(std::move(fut)); 
        
           }

The code attempts to mitigate this later on by limiting the number of staging buffers that will be created to 64 MB at a time:

Vulkan-Samples/framework/gltf_loader.cpp

Lines 568 to 584 in 216e065

    
           // Upload images to GPU. We do this in batches of 64MB of data to avoid needing 
        
           // double the amount of memory (all the images and all the corresponding buffers). 
        
           // This helps keep memory footprint lower which is helpful on smaller devices. 
        
           size_t image_index = 0; 
        
           while (image_index < image_count) 
        
           { 
        
           	std::vector<core::Buffer> transient_buffers; 
        
           	auto &command_buffer = device.request_command_buffer(); 
        
           	command_buffer.begin(VK_COMMAND_BUFFER_USAGE_ONE_TIME_SUBMIT_BIT, 0); 
        
           	size_t batch_size = 0; 
        
           	// Deal with 64MB of image data at a time to keep memory footprint low 
        
           	while (image_index < image_count && batch_size < 64 * 1024 * 1024) 
        
           	{

but IMO the "damage" is already done. My machine peaks at over 5 GB of RAM usage at loading the large scenes used in performance examples.

While the threading might be useful for speeding up the overall load, consuming all the CPUs on the machine to do so feels excessive, and I would suggest that at least 1 CPU be left in reserve for the main thread.

Additionally, it should be possible to limit the number of images loaded concurrently by using a blocking queue or a (C++) semaphore to block the thread pool from doing more work when there pending images waiting for upload.

The text was updated successfully, but these errors were encountered:

tomadamatkinson · 2024-02-24T23:29:26Z

Does your machine natively support ASTC?

One of the issues with the loader is how we decode ASTC inline which may be adding to your excessive memory usage. Some of the scenes are also 4x duplicates to simulate load on a device which doesn't help the loader.

In some samples the goal is to demonstrate the performance delta when using a specific technique - hence the large inefficient scenes. The goal is not to make every sample 100% efficient in all aspects (although an efficient GLTF Loader is definitely a goal)

The loader is on the chopping block soon and will likely be refactored. This includes sorting out the image loading a decoding phase (see framework/scene_graph/images/* if you are interested in those issues)

Completely agree with your assessment though, it is currently not in the best state!

SaschaWillems · 2024-02-25T10:23:27Z

See #860. ASTC is mobile only, so pretty much all desktop platforms take ages to load larger assets using ASTC compressed textures.

jherico · 2024-02-26T20:47:36Z

I get that the ASTC conversion exacerbates memory overhead, and that this code may be a target for refactoring at some point, but I still think it would be a relatively small engineering challenge to not potentially load every image into memory while still allowing it to use threading to pre-load the assets for better performance. I'll probably try to make a small PR for this.

tomadamatkinson · 2024-02-26T21:13:24Z

@jherico apologies, I wasn't trying to dismiss improvements. Just letting you know what we know at this point

More than happy for you to take a shot at it!

jherico · 2024-02-27T04:18:47Z

My initial work did a little to improve the memory footprint, but not as much as I expected, and it drastically slowed down the load time.

I did some profiling and found a lot of memory being allocated and not released in surprising places deep inside the astc library. I'm not 100% certain that this accounts for the amount of memory being consumed, but the astc code, at least the version in use by this repo, is very not thread safe, and it leaves a ton of performance on the table by not having any SIMD support.

I've started working on a branch to at least update astc to a recent version, but getting it set up properly will probably be a chunk of work. It does, however, fix the threading issue and also includes support for NEON, SSE2, SSE4.1 and AVX2 instruction sets, but I would need to find a cmake friendly method of determining which one to enable for a given build environment, but the "native" ISA is always available so there's a fallback for people working in obscure environments.

Alternatively, I could put some effort into refactoring the assets to convert he astc assets to basisu, and include the mips in the files so they don't have to be generated at runtime.

Yet another option for improving performance would be to use a temp directory and whenever an astc resource is uncompressed, write the uncompressed version and its mips to the temp folder, so that the next time they could be loaded directly.

SaschaWillems · 2024-02-28T19:12:59Z

We've been discussing this several times, and we should replace the astc files. basisu is the way to go, I already did a sample for it, so we could leverage the code from that. Will discuss this on our next meeting. It's probably best to have the original asset authors update them for basisu instead of converting the astc (which would probably result in a quality loss).

SaschaWillems added the framework This is relevant to the framework label Feb 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Excessive memory footprint in gltf_loader #942

Excessive memory footprint in gltf_loader #942

jherico commented Feb 24, 2024 •

edited

tomadamatkinson commented Feb 24, 2024

SaschaWillems commented Feb 25, 2024

jherico commented Feb 26, 2024

tomadamatkinson commented Feb 26, 2024

jherico commented Feb 27, 2024

SaschaWillems commented Feb 28, 2024

Excessive memory footprint in gltf_loader #942

Excessive memory footprint in gltf_loader #942

Comments

jherico commented Feb 24, 2024 • edited

tomadamatkinson commented Feb 24, 2024

SaschaWillems commented Feb 25, 2024

jherico commented Feb 26, 2024

tomadamatkinson commented Feb 26, 2024

jherico commented Feb 27, 2024

SaschaWillems commented Feb 28, 2024

jherico commented Feb 24, 2024 •

edited