Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skinned Skeleton3D calling add_child() with specific classes cause crash on Android with directional shadow in forward+/mobile #90459

Closed
TCROC opened this issue Apr 10, 2024 · 81 comments · Fixed by #91514

Comments

@TCROC
Copy link
Contributor

TCROC commented Apr 10, 2024

NOTE:

Significant research has gone into this issue. The initial report is a bit dated. See Summary of current findings below the Original Issue Report for more up to date details.

Original Issue Report

Tested versions

  • Reproducible in: v4.3.dev.mono.custom_build [a7b8602]

System information

Godot v4.3.dev.mono (a7b8602) - Pop!_OS 22.04 LTS - Wayland - Vulkan (Forward+) - dedicated Intel(R) Arc(tm) A750 Graphics (DG2) () - AMD Ryzen 7 5800X 8-Core Processor (16 Threads)

Issue description

When running on Android, if shadows are enabled on the directional light while my model is in the scene, it will crash. I tested with a simple CsgBox3d and it was fine, but my Blocky_Guy.blend model causes it to crash. Disabling shadows works fine tho.

Previous to the current master, it was actually working with shadows, but crashing when I eqipped additional items to the player's skeleton at runtime. So I am unsure if its specifically an issue with shadows, or an issue with how Godot handles my model on Android.

Steps to reproduce

  1. Download the MRP
  2. Build godot with double and mono support We reproduced this without double and mono. Just standard godot causes it.
  3. Export to Android
  4. Run the scene on Android and see that it crashes

This was specifically tested against Galaxy S10e model number: SM-G970U1

Edit: Also reproduced on Google Pixel

Minimal reproduction project (MRP)

example-crash.zip

Edit:

Here is an MRP using a standard t-pose from mixamo. So it does not appear to be specific to my model: #90459 (comment)

Edit 2:

Here is a current summary of the findings for easy reading

Summary of current findings

Renderers Tested:

forward+ = crash ❌
mobile = crash ❌
gl_compatibility = works ✅

Bug introducing PRs:

  1. Implement a base class SkeletonModifier3D as refactoring for nodes that may modify Skeleton3D #87888
  2. Acyclic Command Graph for Rendering Device #84976

MRP test summaries

  1. Skeleton mesh + directional light + shadows = crash ❌
  2. Skeleton mesh + directional light + shadows + changing skeleton + add_child in _process = crash ❌
  3. Skeleton mesh + directional light + shadows + bone attachment + add_child gpu particles 3d = crash ❌
  4. Skeleton mesh + directional light + shadows + add_child + bone attachment + child mesh with skeleton = crash ❌
  5. Skeleton mesh + directional light + shadows + add_child + bone attachment + child mesh without skeleton = works ✅

MRP Details

Skeleton mesh + directional light + shadows

Pr that introduced the bug: #87888

Steps to reproduce:

  1. Export any of the below MRPs to Android.
  2. Open the application
  3. Watch it instantly crash when it tries to render the character

Known workarounds:

  1. Build with DISABLED_DEPRECATED as suggested by @TokageItLab here: Skinned Skeleton3D calling add_child() with specific classes cause crash on Android with directional shadow in forward+/mobile #90459 (comment)
  2. Disable shadows on the directional light
  3. use gl_compatibility renderer

MRP with Blocky Ball model: https://github.com/godotengine/godot/files/14925776/example-crash.zip

MRP with Mixamo T Pose model: https://github.com/godotengine/godot/files/14948327/example-crash2.zip

MRP with Godot Character Model from examples repo: https://github.com/godotengine/godot/files/14951312/example-crash3.zip

IMPORTANT:

The following crash examples require #87888 to be reverted, fixed, or a commit checked out before it was in the repo. Otherwise, the following examples will crash on startup due to the fact they all require a skeleton and mesh present.

Skeleton mesh + directional light + shadows + changing skeleton + add_child in _process

Pr that introduced the bug: #84976
Steps to reproduce:

  1. Export any of the below MRPs to Android.
  2. Open the application
  3. Watch it crash after 1 second when it attempts to equip the "crown" to the Blocky Ball character

Known Workarounds

  1. Disable shadows on directional light
  2. use gl_compatibility renderer

MRP with Blocky Ball model: https://github.com/godotengine/godot/files/14967690/example-crash-change-skeleton.zip

And here is the code for equipping:

extends Node

var secs: float = 0
var is_equipped: bool = false

# # This works
# func _ready() -> void:
# 	var skeleton = get_node("Cube_Guy/Cube_Guy_Armature/Skeleton3D")
# 	var crown = (load("res://Cosmetics_Hat_Crown.blend") as PackedScene).instantiate()
# 	var mesh = crown.get_child(0).get_child(0).get_child(0) as MeshInstance3D
# 	mesh.get_parent().remove_child(mesh)
# 	crown.queue_free()
# 	skeleton.add_child(mesh)

# This crashes
func _process(delta: float) -> void:
	if secs < 1:
		secs += delta
		return

	if is_equipped:
		return

	is_equipped = true
	var skeleton = get_node("Cube_Guy/Cube_Guy_Armature/Skeleton3D")
	var crown = (load("res://Cosmetics_Hat_Crown.blend") as PackedScene).instantiate()
	var mesh = crown.get_child(0).get_child(0).get_child(0) as MeshInstance3D
	mesh.get_parent().remove_child(mesh)
	crown.queue_free()
	skeleton.add_child(mesh)

Note that equipping in _ready works. Equipping in _process after waiting for a second crashes. I tried equipping in _process right away without a delay and that also worked.

Skeleton mesh + directional light + shadows + bone attachment + add_child gpu particles 3d

Pr that introduced the bug: #84976

Steps to reproduce:

  1. Export any of the below MRPs to Android.
  2. Open the application
  3. Watch it crash after 1 second when it attempts to equip the "crown" to the Blocky Ball character

MRP:
example-crash-particles-3d.zip

Code for equipping:

extends Node

var secs: float = 0
var is_equipped: bool = false

# This crashes
#func _ready() -> void:
	#var skeleton = get_node("Cube_Guy/Cube_Guy_Armature/Skeleton3D")
	#var particles = load("res://gpu_particles_3d.tscn") as PackedScene
	#skeleton.get_node("Foot_Front_R").add_child(particles.instantiate())
	#skeleton.get_node("Foot_Front_L").add_child(particles.instantiate())
	#skeleton.get_node("Foot_Back_R").add_child(particles.instantiate())
	#skeleton.get_node("Foot_Back_L").add_child(particles.instantiate())

# This crashes
func _process(delta: float) -> void:
	if secs < 1:
		secs += delta
		return

	if is_equipped:
		return

	is_equipped = true
	var skeleton = get_node("Cube_Guy/Cube_Guy_Armature/Skeleton3D")
	var particles = load("res://gpu_particles_3d.tscn") as PackedScene
	skeleton.get_node("Foot_Front_R").add_child(particles.instantiate())
	skeleton.get_node("Foot_Front_L").add_child(particles.instantiate())
	skeleton.get_node("Foot_Back_R").add_child(particles.instantiate())
	skeleton.get_node("Foot_Back_L").add_child(particles.instantiate())

Note: In this situation, equipping in _ready and _process both cause a crash

Known Workarounds

  1. Disable shadows on directional light
  2. use gl_compatibility renderer

Skeleton mesh + directional light + shadows + add_child + bone attachment + child mesh with skeleton

Pr that introduced the bug: #84976

Steps to reproduce:

  1. Export any of the below MRPs to Android.
  2. Open the application
  3. Watch it crash after 1 second when it attempts to equip another Blocky Ball character as a hat on the main Blocky Ball character

MRP:
example-crash-equip-mesh.zip

Code for equipping:

extends Node

var secs: float = 0
var is_equipped: bool = false

# This works
# func _ready() -> void:
# 	var skeleton = get_node("Cube_Guy/Cube_Guy_Armature/Skeleton3D")
# 	var hat = load("res://Cube_Guy_Hat.tscn") as PackedScene
# 	skeleton.add_child(hat.instantiate())

# This crashes
func _process(delta: float) -> void:
	if secs < 1:
		secs += delta
		return

	if is_equipped:
		return

	is_equipped = true
	var skeleton = get_node("Cube_Guy/Cube_Guy_Armature/Skeleton3D")
	var hat = load("res://Cube_Guy_Hat.tscn") as PackedScene
	skeleton.add_child(hat.instantiate())

Note: In this situation, equipping in _ready works

Known Workarounds

  1. Disable shadows on directional light
  2. use gl_compatibility renderer

Skeleton mesh + directional light + shadows + add_child + bone attachment + child mesh without skeleton

This situation actually works! ✅
Here is an MRP demonstrating it working:
example-working-mesh.zip

So if the model being added as a child of a bone attachment doesn't itself have a skeleton, it works fine. In our game, we allow users to equip pets as "hats" for fun. We want to play cool animations with them. Hence the reason we are equipping items with skeletons.

@TCROC TCROC changed the title Android directional light shadows causes crash with blender model Android directional light shadow causes crash with blender model Apr 10, 2024
@TCROC
Copy link
Contributor Author

TCROC commented Apr 10, 2024

Note that I'm not sure if it's mono and double related. That's just what I tested with.

@TCROC
Copy link
Contributor Author

TCROC commented Apr 10, 2024

Here is a stack trace captured from debugging in Android Studio:

art_sigsegv_fault 0x0000006ca7f12544
art::FaultManager::HandleFault(int, siginfo *, void *) 0x0000006ca7f12aa4
art::SignalChain::Handler(int, siginfo *, void *) 0x0000006faa1dd340
__kernel_rt_sigreturn 0x0000006fca6c5668
!!!0000!7ecf83797c9f36db9a6d042c7e4763!3dad7f8ed7! 0x0000006c12b02250
!!!0000!84c95bb772bf7dd08fc240b7dd7f78!3dad7f8ed7! 0x0000006c12af5648
!!!0000!2a418ddbe06bac73e12cf0f41836dd!3dad7f8ed7! 0x0000006c12ab9420
vulkan::api::CmdDispatch(VkCommandBuffer_T *, unsigned int, unsigned int, unsigned int) 0x0000006f9eae27c0
RenderingDeviceGraph::_run_compute_list_command(RenderingDeviceDriver::CommandBufferID, const unsigned char *, unsigned int) rendering_device_graph.cpp:615
RenderingDeviceGraph::_run_render_commands(RenderingDeviceDriver::CommandBufferID, int, const RenderingDeviceGraph::RecordedCommandSort *, unsigned int, int &, int &) rendering_device_graph.cpp:784
RenderingDeviceGraph::end(RenderingDeviceDriver::CommandBufferID, bool, bool) rendering_device_graph.cpp:1949
RenderingDevice::_end_frame() rendering_device.cpp:4875
RenderingDevice::swap_buffers() rendering_device.cpp:4707
RenderingServerDefault::_draw(bool, double) rendering_server_default.cpp:95
Main::iteration() main.cpp:4023
OS_Android::main_loop_iterate(bool *) os_android.cpp:303
Java_org_godotengine_godot_GodotLib_step(JNIEnv *, jclass) java_godot_lib_jni.cpp:269
art_quick_generic_jni_trampoline 0x0000006ca7e22248
art_quick_invoke_static_stub 0x0000006ca7e18bec
art::ArtMethod::Invoke(art::Thread *, unsigned int *, unsigned int, art::JValue *, const char *) 0x0000006ca7e86010
art::interpreter::ArtInterpreterToCompiledCodeBridge(art::Thread *, art::ArtMethod *, art::ShadowFrame *, unsigned short, art::JValue *) 0x0000006ca7fea4cc
art::interpreter::DoCall<…>(art::ArtMethod *, art::Thread *, art::ShadowFrame &, const art::Instruction *, unsigned short, art::JValue *) 0x0000006ca7fe5068
art::interpreter::ExecuteSwitchImplCpp<…>(art::interpreter::SwitchImplContext *) 0x0000006ca7e2ce70
ExecuteSwitchImplAsm 0x0000006ca7e24bdc
art::interpreter::ExecuteSwitch(art::Thread *, const art::CodeItemDataAccessor &, art::ShadowFrame &, art::JValue, bool) 0x0000006ca7fe4a14
art::interpreter::Execute(art::Thread *, const art::CodeItemDataAccessor &, art::ShadowFrame &, art::JValue, bool, bool) 0x0000006ca7fdcdb4
art::interpreter::ArtInterpreterToInterpreterBridge(art::Thread *, const art::CodeItemDataAccessor &, art::ShadowFrame *, art::JValue *) 0x0000006ca7fe4588
art::interpreter::DoCall<…>(art::ArtMethod *, art::Thread *, art::ShadowFrame &, const art::Instruction *, unsigned short, art::JValue *) 0x0000006ca7fe5044
art::interpreter::ExecuteSwitchImplCpp<…>(art::interpreter::SwitchImplContext *) 0x0000006ca7e2dc70
ExecuteSwitchImplAsm 0x0000006ca7e24bdc
art::interpreter::ExecuteSwitch(art::Thread *, const art::CodeItemDataAccessor &, art::ShadowFrame &, art::JValue, bool) 0x0000006ca7fe4a14
art::interpreter::Execute(art::Thread *, const art::CodeItemDataAccessor &, art::ShadowFrame &, art::JValue, bool, bool) 0x0000006ca7fdcdb4
artQuickToInterpreterBridge 0x0000006ca834ed08
art_quick_to_interpreter_bridge 0x0000006ca7e2237c
art_quick_invoke_stub 0x0000006ca7e18968
art::ArtMethod::Invoke(art::Thread *, unsigned int *, unsigned int, art::JValue *, const char *) 0x0000006ca7e85ff4
art::InvokeVirtualOrInterfaceWithJValues<…>(const art::ScopedObjectAccessAlreadyRunnable &, _jobject *, art::ArtMethod *, const jvalue *) 0x0000006ca822e430
art::Thread::CreateCallback(void *) 0x0000006ca827e2ec
__pthread_start(void *) 0x0000006fad65ee48
__start_thread 0x0000006fad5fb458

@clayjohn
Copy link
Member

As we discussed on RocketChat. It seems like the actual crash is happening within the Vulkan drivers on the device.

Unfortunately, you can't really debug drivers on Android devices. So we need to figure out the cause of the crash from outside the driver. We can do that in two ways:

  1. Put a breakpoint at vkCmdDispatch() and check that the values being passed into the function make sense
  2. Run the application with validation layers enabled and check for any errors that arise.

I think between the two, we should get a pretty good idea of if we are doing something wrong. If we aren't doing something wrong, then it should at least help point us towards a workaround.

@TCROC
Copy link
Contributor Author

TCROC commented Apr 10, 2024

Sounds good. I'm currently working on building the validation layers from source so we can also step through those and look for interesting info. But I can right now check what is being passed to vkCmdDispatch. I'll set that break point real quick

@TCROC
Copy link
Contributor Author

TCROC commented Apr 10, 2024

Here are the values:

This is the last line of godot code before entering the driver stack:

driver->command_compute_dispatch(p_command_buffer, dispatch_instruction->x_groups, dispatch_instruction->y_groups, dispatch_instruction->z_groups);

driver->command_compute_dispatch(p_command_buffer, dispatch_instruction->x_groups, dispatch_instruction->y_groups, dispatch_instruction->z_groups);

And then these are the values of the parameters:

p_command_buffer
image
12970367393844968752

dispatch_instruction->x_groups
image
74

dispatch_instruction->y_groups
image
1

dispatch_instruction->z_groups
image
1

And then here is the entire dispatch_instruction struct:

image

And then here is the driver struct:

image
image

I wonder if Android Studio has a means of exporting structs as text. Someting akin to:

SomeStruct
|
- SomeField = "some val"

Would be nice to take a "snapshot" so to speak instead of sending snips

@TCROC
Copy link
Contributor Author

TCROC commented Apr 10, 2024

Here is the stacktrace with vulkan validation enabled:

T-CROC - Wed Apr 10 2024 18 58 25 GMT-0400 (Eastern Daylight Time).txt

And here are the key points of interest (copied in from the chat in the godot dev server):

Very first error is just indicating that I built validation layers in debug mode. So we can just ignore that. Then we get to the first error of interest:

USER WARNING: Vulkan Debug Report: object - -5476376679864662384
Validation Performance Warning: [ UNASSIGNED-CoreValidation-SwapchainPreTransform ] Object 0: handle = 0xb400006cbc6f0690, type = VK_OBJECT_TYPE_PHYSICAL_DEVICE; | MessageID = 0x214b6dd0 | vkCreateSwapchainKHR(): pCreateInfo->preTransform (VK_SURFACE_TRANSFORM_IDENTITY_BIT_KHR) doesn't match the currentTransform (VK_SURFACE_TRANSFORM_ROTATE_90_BIT_KHR) returned by vkGetPhysicalDeviceSurfaceCapabilitiesKHR, the presentation engine will transform the image content as part of the presentation operation.
2024-04-10 18:56:15.691  3247-3353  godot                   com.example.examplecrash             E     at: _debug_report_callback (drivers/vulkan/rendering_context_driver_vulkan.cpp:324)

And then not too long after that, this repeats over and over again until it crashes:

USER ERROR: Vulkan Debug Report: object - -6869287954972407602
Validation Error: [ VUID-vkCmdDrawIndexed-magFilter-04553 ] Object 0: handle = 0xa0ab6300000008ce, type = VK_OBJECT_TYPE_DESCRIPTOR_SET; Object 1: handle = 0x47610f00000007e1, type = VK_OBJECT_TYPE_SAMPLER; Object 2: handle = 0x980b0000000002e, type = VK_OBJECT_TYPE_IMAGE_VIEW; | MessageID = 0x9c7248ee | vkCmdDrawIndexed: Descriptor set VkDescriptorSet 0xa0ab6300000008ce[] Sampler (VkSampler 0x47610f00000007e1[]) is set to use VK_FILTER_LINEAR with compareEnable is set to VK_FALSE, but image view's (VkImageView 0x980b0000000002e[]) format (VK_FORMAT_D16_UNORM) does not contain VK_FORMAT_FEATURE_SAMPLED_IMAGE_FILTER_LINEAR_BIT in its format features. The Vulkan spec states: If a VkSampler created with magFilter or minFilter equal to VK_FILTER_LINEAR and compareEnable equal to VK_FALSE is used to sample a VkImageView as a result of this command, then the image view's format features must contain VK_FORMAT_FEATURE_SAMPLED_IMAGE_FILTER_LINEAR_BIT (ht
2024-04-10 18:56:24.204  3247-3353  godot                   com.example.examplecrash             E     at: _debug_report_callback (drivers/vulkan/rendering_context_driver_vulkan.cpp:327)

@clayjohn
Copy link
Member

clayjohn commented Apr 11, 2024

The Vulkan profile for the affected device is here: https://vulkan.gpuinfo.org/displayreport.php?id=16301

I can reproduce the validation layer error when using this profile.

It appears that the crash is coming from the skeleton compute shader. I think there is a good chance that the crash is happening due to an issue with this specific skeleton. But we can't rule out that the Vulkan error is somehow related. So let's try to fix the validation layer error first and then re-evaluate and see if the crash goes away.

For the validation error, it seems that it comes from not supporting linear samplers with VK_FORMAT_D16_UNORM. On my device the validation layer is slightly different, but I think it is because it is falling back to another code path it thinks will be better support. I get the following warning print: format VK_FORMAT_D16_UNORM is simulating unsupported features! and then the validation layer error I get is:

ERROR: VALIDATION - Message Id Number: 1397510317 | Message Id Name: VUID-vkCmdDrawIndexed-None-06479
	Validation Error: [ VUID-vkCmdDrawIndexed-None-06479 ] Object 0: handle = 0x587167d18ee1, name = RID:644395418256182, type = VK_OBJECT_TYPE_DESCRIPTOR_SET; Object 1: handle = 0x58714d4c5890, name = RID:30064771077 View, type = VK_OBJECT_TYPE_IMAGE_VIEW; | MessageID = 0x534c50ad | vkCmdDrawIndexed: Descriptor set VkDescriptorSet 0x587167d18ee1[RID:644395418256182] in binding #5 index 0, VkImageView 0x58714d4c5890[RID:30064771077 View], image view format VK_FORMAT_D16_UNORM feature flags (VK_FORMAT_FEATURE_2_SAMPLED_IMAGE_BIT|VK_FORMAT_FEATURE_2_DEPTH_STENCIL_ATTACHMENT_BIT|VK_FORMAT_FEATURE_2_BLIT_SRC_BIT|VK_FORMAT_FEATURE_2_TRANSFER_SRC_BIT|VK_FORMAT_FEATURE_2_TRANSFER_DST_BIT) doesn't contain VK_FORMAT_FEATURE_2_SAMPLED_IMAGE_DEPTH_COMPARISON_BIT The Vulkan spec states: If a VkImageView is sampled with depth comparison, the image view's format features must contain VK_FORMAT_FEATURE_2_SAMPLED_IMAGE_DEPTH_COMPARISON_BIT (https://vulkan.lunarg.com/doc/view/1.3.250.1/linux/1.3-extensions/vkspec.html#VUID-vkCmdDrawIndexed-None-06479)
	Objects - 2
		Object[0] - VK_OBJECT_TYPE_DESCRIPTOR_SET, Handle 97244096335585, Name "RID:644395418256182"
		Object[1] - VK_OBJECT_TYPE_IMAGE_VIEW, Handle 97243651397776, Name "RID:30064771077 View"

@clayjohn
Copy link
Member

For further testing, can you try with this diff:

diff --git a/servers/rendering/renderer_rd/environment/gi.cpp b/servers/rendering/renderer_rd/environment/gi.cpp
index c7752f8a86..80a83b761b 100644
--- a/servers/rendering/renderer_rd/environment/gi.cpp
+++ b/servers/rendering/renderer_rd/environment/gi.cpp
@@ -2672,7 +2672,7 @@ void GI::VoxelGIInstance::update(bool p_update_light_instances, const Vector<RID
                                        if (dynamic_maps.size() == 0) {
                                                // Render depth for first one.
                                                // Use 16-bit depth when supported to improve performance.
-                                               dtf.format = RD::get_singleton()->texture_is_format_supported_for_usage(RD::DATA_FORMAT_D16_UNORM, RD::TEXTURE_USAGE_DEPTH_STENCIL_A
TTACHMENT_BIT) ? RD::DATA_FORMAT_D16_UNORM : RD::DATA_FORMAT_X8_D24_UNORM_PACK32;
+                                               dtf.format = RD::get_singleton()->texture_is_format_supported_for_usage(RD::DATA_FORMAT_D16_UNORM, RD::TEXTURE_USAGE_DEPTH_STENCIL_A
TTACHMENT_BIT) ? RD::DATA_FORMAT_X8_D24_UNORM_PACK32 : RD::DATA_FORMAT_X8_D24_UNORM_PACK32;
                                                dtf.usage_bits = RD::TEXTURE_USAGE_DEPTH_STENCIL_ATTACHMENT_BIT;
                                                dmap.fb_depth = RD::get_singleton()->texture_create(dtf, RD::TextureView());
                                                RD::get_singleton()->set_resource_name(dmap.fb_depth, "VoxelGI Instance DMap FB Depth");
diff --git a/servers/rendering/renderer_rd/shaders/forward_clustered/scene_forward_clustered.glsl b/servers/rendering/renderer_rd/shaders/forward_clustered/scene_forward_clustered.
glsl
index 359d7799e5..372b63bb83 100644
--- a/servers/rendering/renderer_rd/shaders/forward_clustered/scene_forward_clustered.glsl
+++ b/servers/rendering/renderer_rd/shaders/forward_clustered/scene_forward_clustered.glsl
@@ -1975,7 +1975,7 @@ void fragment_shader(in SceneData scene_data) {
                                        vec4 trans_coord = directional_lights.data[i].shadow_matrix1 * trans_vertex;
                                        trans_coord /= trans_coord.w;
 
-                                       float shadow_z = textureLod(sampler2D(directional_shadow_atlas, SAMPLER_LINEAR_CLAMP), trans_coord.xy, 0.0).r;
+                                       float shadow_z = textureLod(sampler2D(directional_shadow_atlas, SAMPLER_NEAREST_CLAMP), trans_coord.xy, 0.0).r;
                                        shadow_z *= directional_lights.data[i].shadow_z_range.x;
                                        float z = trans_coord.z * directional_lights.data[i].shadow_z_range.x;
 
@@ -1985,7 +1985,7 @@ void fragment_shader(in SceneData scene_data) {
                                        vec4 trans_coord = directional_lights.data[i].shadow_matrix2 * trans_vertex;
                                        trans_coord /= trans_coord.w;
 
-                                       float shadow_z = textureLod(sampler2D(directional_shadow_atlas, SAMPLER_LINEAR_CLAMP), trans_coord.xy, 0.0).r;
+                                       float shadow_z = textureLod(sampler2D(directional_shadow_atlas, SAMPLER_NEAREST_CLAMP), trans_coord.xy, 0.0).r;
                                        shadow_z *= directional_lights.data[i].shadow_z_range.y;
                                        float z = trans_coord.z * directional_lights.data[i].shadow_z_range.y;
 
@@ -1995,7 +1995,7 @@ void fragment_shader(in SceneData scene_data) {
                                        vec4 trans_coord = directional_lights.data[i].shadow_matrix3 * trans_vertex;
                                        trans_coord /= trans_coord.w;
 
-                                       float shadow_z = textureLod(sampler2D(directional_shadow_atlas, SAMPLER_LINEAR_CLAMP), trans_coord.xy, 0.0).r;
+                                       float shadow_z = textureLod(sampler2D(directional_shadow_atlas, SAMPLER_NEAREST_CLAMP), trans_coord.xy, 0.0).r;
                                        shadow_z *= directional_lights.data[i].shadow_z_range.z;
                                        float z = trans_coord.z * directional_lights.data[i].shadow_z_range.z;
 
@@ -2006,7 +2006,7 @@ void fragment_shader(in SceneData scene_data) {
                                        vec4 trans_coord = directional_lights.data[i].shadow_matrix4 * trans_vertex;
                                        trans_coord /= trans_coord.w;
                                        
-                                       float shadow_z = textureLod(sampler2D(directional_shadow_atlas, SAMPLER_LINEAR_CLAMP), trans_coord.xy, 0.0).r;
+                                       float shadow_z = textureLod(sampler2D(directional_shadow_atlas, SAMPLER_NEAREST_CLAMP), trans_coord.xy, 0.0).r;
                                        shadow_z *= directional_lights.data[i].shadow_z_range.w;
                                        float z = trans_coord.z * directional_lights.data[i].shadow_z_range.w;
 
diff --git a/servers/rendering/renderer_rd/storage_rd/light_storage.cpp b/servers/rendering/renderer_rd/storage_rd/light_storage.cpp
index d1ff9fc362..666a1ace06 100644
--- a/servers/rendering/renderer_rd/storage_rd/light_storage.cpp
+++ b/servers/rendering/renderer_rd/storage_rd/light_storage.cpp
@@ -1995,7 +1995,7 @@ void LightStorage::shadow_atlas_free(RID p_atlas) {
 void LightStorage::_update_shadow_atlas(ShadowAtlas *shadow_atlas) {
        if (shadow_atlas->size > 0 && shadow_atlas->depth.is_null()) {
                RD::TextureFormat tf;
-               tf.format = shadow_atlas->use_16_bits ? RD::DATA_FORMAT_D16_UNORM : RD::DATA_FORMAT_D32_SFLOAT;
+               tf.format = shadow_atlas->use_16_bits ? RD::DATA_FORMAT_D32_SFLOAT : RD::DATA_FORMAT_D32_SFLOAT;
                tf.width = shadow_atlas->size;
                tf.height = shadow_atlas->size;
                tf.usage_bits = RD::TEXTURE_USAGE_SAMPLING_BIT | RD::TEXTURE_USAGE_DEPTH_STENCIL_ATTACHMENT_BIT;
@@ -2388,7 +2388,7 @@ void LightStorage::shadow_atlas_update(RID p_atlas) {
 void LightStorage::update_directional_shadow_atlas() {
        if (directional_shadow.depth.is_null() && directional_shadow.size > 0) {
                RD::TextureFormat tf;
-               tf.format = directional_shadow.use_16_bits ? RD::DATA_FORMAT_D16_UNORM : RD::DATA_FORMAT_D32_SFLOAT;
+               tf.format = directional_shadow.use_16_bits ? RD::DATA_FORMAT_D32_SFLOAT : RD::DATA_FORMAT_D32_SFLOAT;
                tf.width = directional_shadow.size;
                tf.height = directional_shadow.size;
                tf.usage_bits = RD::TEXTURE_USAGE_SAMPLING_BIT | RD::TEXTURE_USAGE_DEPTH_STENCIL_ATTACHMENT_BIT;

@TCROC
Copy link
Contributor Author

TCROC commented Apr 11, 2024

I just tested with those changes. I still get the same error leading up to a crash:

Validation Error: [ VUID-vkCmdDrawIndexed-magFilter-04553 ] Object 0: handle = 0xa0ab6300000008ce, type = VK_OBJECT_TYPE_DESCRIPTOR_SET; Object 1: handle = 0x47610f00000007e1, type = VK_OBJECT_TYPE_SAMPLER; Object 2: handle = 0x980b0000000002e, type = VK_OBJECT_TYPE_IMAGE_VIEW; | MessageID = 0x9c7248ee | vkCmdDrawIndexed: Descriptor set VkDescriptorSet 0xa0ab6300000008ce[] Sampler (VkSampler 0x47610f00000007e1[]) is set to use VK_FILTER_LINEAR with compareEnable is set to VK_FALSE, but image view's (VkImageView 0x980b0000000002e[]) format (VK_FORMAT_D16_UNORM) does not contain VK_FORMAT_FEATURE_SAMPLED_IMAGE_FILTER_LINEAR_BIT in its format features. The Vulkan spec states: If a VkSampler created with magFilter or minFilter equal to VK_FILTER_LINEAR and compareEnable equal to VK_FALSE is used to sample a VkImageView as a result of this command, then the image view's format features must contain VK_FORMAT_FEATURE_SAMPLED_IMAGE_FILTER_LINEAR_BIT (ht
2024-04-11 08:40:39.710 25574-25614 godot                   com.example.examplecrash             E     at: _debug_report_callback (drivers/vulkan/rendering_context_driver_vulkan.cpp:327)

@TCROC
Copy link
Contributor Author

TCROC commented Apr 11, 2024

Ok so I found that in drivers/vulkan/render_device_vulkan.cpp lines 1799 and 1800, if I manually set the magFilter to VK_FILTER_NEAREST, the validation errors go away. But it still crashes. Lets see if we can get anything interesting out of these crash logs now.

rendering_device_driver_vulkan.cpp

	sampler_create_info.magFilter = p_state.mag_filter == SAMPLER_FILTER_LINEAR ? VK_FILTER_NEAREST : VK_FILTER_NEAREST;
	sampler_create_info.minFilter = p_state.min_filter == SAMPLER_FILTER_LINEAR ? VK_FILTER_NEAREST : VK_FILTER_NEAREST;

Here is the stack trace:

art_sigsegv_fault 0x0000006ca7f12544
art::FaultManager::HandleFault(int, siginfo *, void *) 0x0000006ca7f12aa4
art::SignalChain::Handler(int, siginfo *, void *) 0x0000006faa1dd340
__kernel_rt_sigreturn 0x0000006fca6c5668
!!!0000!7ecf83797c9f36db9a6d042c7e4763!3dad7f8ed7! 0x0000006c1090b250
!!!0000!84c95bb772bf7dd08fc240b7dd7f78!3dad7f8ed7! 0x0000006c108fe648
!!!0000!2a418ddbe06bac73e12cf0f41836dd!3dad7f8ed7! 0x0000006c108c2420
DispatchCmdDispatch(VkCommandBuffer_T *, unsigned int, unsigned int, unsigned int) layer_chassis_dispatch.cpp:3491
vulkan_layer_chassis::CmdDispatch(VkCommandBuffer_T *, unsigned int, unsigned int, unsigned int) chassis.cpp:3147
vulkan::api::CmdDispatch(VkCommandBuffer_T *, unsigned int, unsigned int, unsigned int) 0x0000006f9eae27c0
<unknown> 0x0000006c24e89108
<unknown> 0x0000006c24e89910
<unknown> 0x0000006c24e8dd00
<unknown> 0x0000006c24e0f55c
<unknown> 0x0000006c24e0f460
<unknown> 0x0000006c24e8f2f8
<unknown> 0x0000006c22efae68
<unknown> 0x0000006c22eb1010
Java_org_godotengine_godot_GodotLib_step 0x0000006c22ec5038
art_quick_generic_jni_trampoline 0x0000006ca7e22248
art_quick_invoke_static_stub 0x0000006ca7e18bec
art::ArtMethod::Invoke(art::Thread *, unsigned int *, unsigned int, art::JValue *, const char *) 0x0000006ca7e86010
art::interpreter::ArtInterpreterToCompiledCodeBridge(art::Thread *, art::ArtMethod *, art::ShadowFrame *, unsigned short, art::JValue *) 0x0000006ca7fea4cc
art::interpreter::DoCall<…>(art::ArtMethod *, art::Thread *, art::ShadowFrame &, const art::Instruction *, unsigned short, art::JValue *) 0x0000006ca7fe5068
art::interpreter::ExecuteSwitchImplCpp<…>(art::interpreter::SwitchImplContext *) 0x0000006ca7e2ce70
ExecuteSwitchImplAsm 0x0000006ca7e24bdc
art::interpreter::ExecuteSwitch(art::Thread *, const art::CodeItemDataAccessor &, art::ShadowFrame &, art::JValue, bool) 0x0000006ca7fe4a14
art::interpreter::Execute(art::Thread *, const art::CodeItemDataAccessor &, art::ShadowFrame &, art::JValue, bool, bool) 0x0000006ca7fdcdb4
artQuickToInterpreterBridge 0x0000006ca834ed08
art_quick_to_interpreter_bridge 0x0000006ca7e2237c
<unknown> 0x000000009d0309e8
art::interpreter::ExecuteSwitchImplCpp<…>(art::interpreter::SwitchImplContext *) 0x0000006ca7e2c380
ExecuteSwitchImplAsm 0x0000006ca7e24bdc
art::interpreter::ExecuteSwitch(art::Thread *, const art::CodeItemDataAccessor &, art::ShadowFrame &, art::JValue, bool) 0x0000006ca7fe4a14
art::interpreter::Execute(art::Thread *, const art::CodeItemDataAccessor &, art::ShadowFrame &, art::JValue, bool, bool) 0x0000006ca7fdcdb4
artQuickToInterpreterBridge 0x0000006ca834ed08
art_quick_to_interpreter_bridge 0x0000006ca7e2237c
art_quick_invoke_stub 0x0000006ca7e18968
art::ArtMethod::Invoke(art::Thread *, unsigned int *, unsigned int, art::JValue *, const char *) 0x0000006ca7e85ff4
art::InvokeVirtualOrInterfaceWithJValues<…>(const art::ScopedObjectAccessAlreadyRunnable &, _jobject *, art::ArtMethod *, const jvalue *) 0x0000006ca822e430
art::Thread::CreateCallback(void *) 0x0000006ca827e2ec
__pthread_start(void *) 0x0000006fad65ee48
__start_thread 0x0000006fad5fb458

@TCROC
Copy link
Contributor Author

TCROC commented Apr 11, 2024

I just tested with a t-pose model from mixamo and it also crashes: https://www.mixamo.com

So it appears that skeletons in general are causing godot to crash on Android. Should we update the title of the issue to reflect our findings?

@TCROC
Copy link
Contributor Author

TCROC commented Apr 11, 2024

Here is the MRP with the humanoid from mixamo that also causes the crash:

example-crash2.zip

@TCROC
Copy link
Contributor Author

TCROC commented Apr 11, 2024

And this is the logcat trace:

12:42:52.915  I  Late-enabling -Xcheck:jni
12:42:52.935  E  USNET: appName: com.example.examplecrash
12:42:52.937  D  Binder ioctl to enable oneway spam detection failed: Invalid argument
12:42:52.939  D  setConscryptValidator
12:42:52.939  D  setConscryptValidator - put
12:42:52.975  W  DexFile /data/data/com.example.examplecrash/code_cache/.studio/instruments-54177d1b.jar is in boot class path but is not in a known location
12:42:53.092  W  Current dex file has more than one class in it. Calling RetransformClasses on this class might fail if no transformations are applied to it!
12:42:53.154  W  Current dex file has more than one class in it. Calling RetransformClasses on this class might fail if no transformations are applied to it!
12:42:53.155  W  Redefining intrinsic method java.lang.Thread java.lang.Thread.currentThread(). This may cause the unexpected use of the original definition of java.lang.Thread java.lang.Thread.currentThread()in methods that have already been compiled.
12:42:53.155  W  Redefining intrinsic method boolean java.lang.Thread.interrupted(). This may cause the unexpected use of the original definition of boolean java.lang.Thread.interrupted()in methods that have already been compiled.
12:42:53.166  W  Current dex file has more than one class in it. Calling RetransformClasses on this class might fail if no transformations are applied to it!
12:42:53.295  W  Current dex file has more than one class in it. Calling RetransformClasses on this class might fail if no transformations are applied to it!
12:42:53.446  W  Current dex file has more than one class in it. Calling RetransformClasses on this class might fail if no transformations are applied to it!
12:42:53.451  D  handleBindApplication()++ app=com.example.examplecrash
12:42:53.452  D  Compat change id reported: 171979766; UID 10375; state: ENABLED
12:42:53.458  W  Application com.example.examplecrash is waiting for the debugger on port 8100...
12:42:53.458  I  Sending WAIT chunk
12:42:54.942  W  A resource failed to call close. 
12:42:57.854  I  Debugger has connected
12:42:57.855  I  waiting for debugger to settle...
12:42:58.057  I  waiting for debugger to settle...
12:42:58.259  I  waiting for debugger to settle...
12:42:58.461  I  waiting for debugger to settle...
12:42:58.665  I  waiting for debugger to settle...
12:42:58.868  I  waiting for debugger to settle...
12:42:59.070  I  waiting for debugger to settle...
12:42:59.274  I  waiting for debugger to settle...
12:42:59.478  I  debugger has settled (1332)
12:42:59.488  W  Slow operation: 6036ms so far, now at handleBindApplication: Before HardwareRenderer
12:42:59.514  W  Slow operation: 6063ms so far, now at handleBindApplication: After HardwareRenderer
12:43:00.593  V  ANGLE Developer option for 'com.example.examplecrash' set to: 'default'
12:43:00.594  V  App is not on the allowlist for updatable production driver.
12:43:00.615  D  LoadedApk::makeApplication() appContext.mOpPackageName=com.example.examplecrash appContext.mBasePackageName=com.example.examplecrash
12:43:00.615  D  No Network Security Config specified, using platform default
12:43:00.753  D  No Network Security Config specified, using platform default
12:43:00.776  D  handleBindApplication() --
12:43:00.798  D  RenderThread::requireGlContext()
12:43:00.800  I  QUALCOMM build                   : 3dad7f8ed7, I593c16c433
                 Build Date                       : 10/01/21
                 OpenGL ES Shader Compiler Version: EV031.32.02.02
                 Local Branch                     : 
                 Remote Branch                    : refs/tags/AU_LINUX_ANDROID_LA.UM.9.1.R1.11.00.00.604.073
                 Remote Branch                    : NONE
                 Reconstruct Branch               : NOTHING
12:43:00.800  I  Build Config                     : S P 10.0.7 AArch64
12:43:00.800  I  Driver Path                      : /vendor/lib64/egl/libGLESv2_adreno.so
12:43:00.949  I  PFP: 0x016ee190, ME: 0x00000000
12:43:00.958  D  RenderThread::setGrContext()
12:43:01.032  I  [INFO] isPopOver=false, config=true
12:43:01.032  I  updateCaptionType >> DecorView@ca403e0[], isFloating=false, isApplication=true, hasWindowControllerCallback=true, hasWindowDecorCaption=false
12:43:01.033  D  setCaptionType = 0, this = DecorView@ca403e0[]
12:43:01.044  I  getCurrentDensityDpi: from real metrics. densityDpi=480 msg=resources_loaded
12:43:01.050  V  Creating new Godot fragment instance.
12:43:03.738  D  registerListener :: 11, LSM6DSO Accelerometer, 20000, 0, 
12:43:03.750  D  registerListener :: 91, gravity  Non-wakeup, 20000, 0, org.godotengine.godot.Godot@45e16cc
12:43:03.770  D  registerListener :: 21, AK09918 Magnetometer, 20000, 0, org.godotengine.godot.Godot@45e16cc
12:43:03.803  D  registerListener :: 41, LSM6DSO Gyroscope, 20000, 0, org.godotengine.godot.Godot@45e16cc
12:43:03.846  I  notifyKeepScreenOnChanged: keepScreenOn=false
12:43:03.851  D  setRequestedVisible: visible=false, type=21, host=com.example.examplecrash/com.godot.game.GodotApp, from=android.view.InsetsSourceConsumer.hide:242 android.view.InsetsController.collectSourceControls:1215 android.view.InsetsController.controlAnimationUnchecked:1077 android.view.InsetsController.applyAnimation:1456 android.view.InsetsController.applyAnimation:1437 android.view.InsetsController.hide:1006 android.view.InsetsController.hide:981 android.view.ViewRootImpl.controlInsetsForCompatibility:3142 android.view.ViewRootImpl.setView:1555 android.view.WindowManagerGlobal.addView:509 
12:43:03.853  D  setRequestedVisible: visible=false, type=20, host=com.example.examplecrash/com.godot.game.GodotApp, from=android.view.InsetsSourceConsumer.hide:242 android.view.InsetsController.collectSourceControls:1215 android.view.InsetsController.controlAnimationUnchecked:1077 android.view.InsetsController.applyAnimation:1456 android.view.InsetsController.applyAnimation:1437 android.view.InsetsController.hide:1006 android.view.InsetsController.hide:981 android.view.ViewRootImpl.controlInsetsForCompatibility:3142 android.view.ViewRootImpl.setView:1555 android.view.WindowManagerGlobal.addView:509 
12:43:03.855  D  setRequestedVisible: visible=false, type=1, host=com.example.examplecrash/com.godot.game.GodotApp, from=android.view.InsetsSourceConsumer.hide:242 android.view.InsetsController.collectSourceControls:1215 android.view.InsetsController.controlAnimationUnchecked:1077 android.view.InsetsController.applyAnimation:1456 android.view.InsetsController.applyAnimation:1437 android.view.InsetsController.hide:1006 android.view.InsetsController.hide:981 android.view.ViewRootImpl.controlInsetsForCompatibility:3142 android.view.ViewRootImpl.setView:1555 android.view.WindowManagerGlobal.addView:509 
12:43:03.857  D  setRequestedVisible: visible=false, type=0, host=com.example.examplecrash/com.godot.game.GodotApp, from=android.view.InsetsSourceConsumer.hide:242 android.view.InsetsController.collectSourceControls:1215 android.view.InsetsController.controlAnimationUnchecked:1077 android.view.InsetsController.applyAnimation:1456 android.view.InsetsController.applyAnimation:1437 android.view.InsetsController.hide:1006 android.view.InsetsController.hide:981 android.view.ViewRootImpl.controlInsetsForCompatibility:3142 android.view.ViewRootImpl.setView:1555 android.view.WindowManagerGlobal.addView:509 
12:43:03.876  I  setView = com.android.internal.policy.DecorView@ca403e0 TM=true
12:43:03.917  I  onWindowVisibilityChanged(0) true org.godotengine.godot.GodotVulkanRenderView{3176227 VFE...C.. .F....I. 0,0-0,0} of ViewRootImpl@ebbd3b8[GodotApp]
12:43:03.966  I  Relayout returned: old=(0,0,2280,1080) new=(0,0,2280,1080) req=(2280,1080)0 dur=20 res=0x7 s={true -5476376673690511552} ch=true fn=-1
12:43:03.993  D  Binder ioctl to enable oneway spam detection failed: Invalid argument
12:43:03.994  D  eglCreateWindowSurface
12:43:03.997  I  windowStopped(false) true org.godotengine.godot.GodotVulkanRenderView{3176227 VFE...C.. .F....ID 0,0-2280,1080} of ViewRootImpl@ebbd3b8[GodotApp]
12:43:03.999  I  [DP] dp(1) 1 android.view.ViewRootImpl.reportNextDraw:11420 android.view.ViewRootImpl.performTraversals:4193 android.view.ViewRootImpl.doTraversal:2919 
12:43:04.011  I  pST: sr = Rect(0, 0 - 2280, 1080) sw = 2280 sh = 1080
12:43:04.011  I  onSSPAndSRT: pl = 0 pt = 0 sx = 1.0 sy = 1.0
12:43:04.012  I  pST: mTmpTransaction.apply, mTmpTransaction = android.view.SurfaceControl$Transaction@effd501
12:43:04.012  I  updateSurface: mVisible = true mSurface.isValid() = true
12:43:04.012  I  updateSurface: mSurfaceCreated = false surfaceChanged = true visibleChanged = true
12:43:04.013  I  surfaceCreated 1 #8 org.godotengine.godot.GodotVulkanRenderView{3176227 VFE...C.. .F....ID 0,0-2280,1080}
12:43:04.014  I  surfaceChanged (2280,1080) 1 #8 org.godotengine.godot.GodotVulkanRenderView{3176227 VFE...C.. .F....ID 0,0-2280,1080}
12:43:04.015  I  [DP] dp(2) 1 android.view.SurfaceView.updateSurface:1375 android.view.SurfaceView.lambda$new$1$SurfaceView:254 android.view.SurfaceView$$ExternalSyntheticLambda2.onPreDraw:2 
12:43:04.016  I  [DP] pdf(1) 1 android.view.SurfaceView.notifyDrawFinished:599 android.view.SurfaceView.performDrawFinished:586 android.view.SurfaceView.$r8$lambda$st27mCkd9jfJkTrN_P3qIGKX6NY:0 
12:43:04.016  I  Godot Engine v4.3.dev.mono.custom_build.a7b860250 (2024-04-09 08:42:45 UTC) - https://godotengine.org
12:43:04.030  I  uSP: rtp = Rect(0, 0 - 2280, 1080) rtsw = 2280 rtsh = 1080
12:43:04.030  I  onSSPAndSRT: pl = 0 pt = 0 sx = 1.0 sy = 1.0
12:43:04.031  I  aOrMT: uB = true t = android.view.SurfaceControl$Transaction@3cf7d94 fN = 1 android.view.SurfaceView.access$500:124 android.view.SurfaceView$SurfaceViewPositionUpdateListener.positionChanged:1728 android.graphics.RenderNode$CompositePositionUpdateListener.positionChanged:319 
12:43:04.031  I  aOrMT: vR.mWNT, vR = ViewRootImpl@ebbd3b8[GodotApp]
12:43:04.031  I  mWNT: t = android.view.SurfaceControl$Transaction@3cf7d94 fN = 1 android.view.SurfaceView.applyOrMergeTransaction:1628 android.view.SurfaceView.access$500:124 android.view.SurfaceView$SurfaceViewPositionUpdateListener.positionChanged:1728 
12:43:04.031  I  mWNT: merge t to BBQ
12:43:04.038  I  [ViewRootImpl@ebbd3b8[GodotApp]#0(BLAST Consumer)0](id:dbe00000000,api:1,p:3518,c:3518) queueBuffer: queued for the first time.
12:43:04.068  I  [DP] pdf(0) 1 android.view.ViewRootImpl.lambda$addFrameCompleteCallbackIfNeeded$3$ViewRootImpl:4995 android.view.ViewRootImpl$$ExternalSyntheticLambda16.run:6 android.os.Handler.handleCallback:938 
12:43:04.068  I  [DP] rdf()
12:43:04.068  D  reportDrawFinished (fn: -1) 
12:43:04.070  I  notifyKeepScreenOnChanged: keepScreenOn=true
12:43:04.284  D  searching for layers in '/data/app/~~lIfLcd4UqnnumILWcmjqMg==/com.example.examplecrash-Jt1xXKzin5CdtIGFYHTsfQ==/lib/arm64'
12:43:04.285  D  searching for layers in '/data/app/~~lIfLcd4UqnnumILWcmjqMg==/com.example.examplecrash-Jt1xXKzin5CdtIGFYHTsfQ==/base.apk!/lib/arm64-v8a'
12:43:05.519  I  Relayout returned: old=(0,0,2280,1080) new=(0,0,2280,1080) req=(2280,1080)0 dur=198 res=0x1 s={true -5476376673690511552} ch=false fn=2
12:43:05.520  I  updateBoundsLayer: t = android.view.SurfaceControl$Transaction@a34a183 sc = Surface(name=Bounds for - com.example.examplecrash/com.godot.game.GodotApp@0)/@0xff0df00 frame = 2
12:43:05.521  I  mWNT: t = android.view.SurfaceControl$Transaction@a34a183 fN = 2 android.view.ViewRootImpl.prepareSurfaces:2778 android.view.ViewRootImpl.performTraversals:4024 android.view.ViewRootImpl.doTraversal:2919 
12:43:05.521  I  mWNT: merge t to BBQ
12:43:05.537  I  mWNT: t = android.view.SurfaceControl$Transaction@7812339 fN = 2 android.view.SyncRtSurfaceTransactionApplier.applyTransaction:94 android.view.SyncRtSurfaceTransactionApplier.lambda$scheduleApply$0$SyncRtSurfaceTransactionApplier:71 android.view.SyncRtSurfaceTransactionApplier$$ExternalSyntheticLambda0.onFrameDraw:4 
12:43:05.537  I  mWNT: merge t to BBQ
12:43:05.554  I  Davey! duration=1634ms; Flags=0, FrameTimelineVsyncId=769797, IntendedVsync=205966706523332, Vsync=205966856523326, InputEventId=0, HandleInputStart=205966867697053, AnimationStart=205966867716480, PerformTraversalsStart=205966867747001, DrawStart=205968329020646, FrameDeadline=205966739856664, FrameInterval=205966867608876, FrameStartTime=16666666, SyncQueued=205968330364292, SyncStart=205968330777574, IssueDrawCommandsStart=205968330857417, SwapBuffers=205968333785803, FrameCompleted=205968341058667, DequeueBufferDuration=2365260, QueueBufferDuration=995261, GpuCompleted=205968341058667, SwapBuffersCompleted=205968334961376, DisplayPresentTime=0, 
12:43:05.611  I  mWNT: t = android.view.SurfaceControl$Transaction@315d2df fN = 3 android.view.SyncRtSurfaceTransactionApplier.applyTransaction:94 android.view.SyncRtSurfaceTransactionApplier.lambda$scheduleApply$0$SyncRtSurfaceTransactionApplier:71 android.view.SyncRtSurfaceTransactionApplier$$ExternalSyntheticLambda0.onFrameDraw:4 
12:43:05.611  I  mWNT: merge t to BBQ
12:43:05.620  D  added global layer 'VK_LAYER_KHRONOS_validation' from library '/data/app/~~lIfLcd4UqnnumILWcmjqMg==/com.example.examplecrash-Jt1xXKzin5CdtIGFYHTsfQ==/base.apk!/lib/arm64-v8a/libVkLayer_khronos_validation.so'
12:43:05.999  I  mWNT: t = android.view.SurfaceControl$Transaction@e04eb2c fN = 4 android.view.SyncRtSurfaceTransactionApplier.applyTransaction:94 android.view.SyncRtSurfaceTransactionApplier.lambda$scheduleApply$0$SyncRtSurfaceTransactionApplier:71 android.view.SyncRtSurfaceTransactionApplier$$ExternalSyntheticLambda0.onFrameDraw:4 
12:43:05.999  I  mWNT: merge t to BBQ
12:43:06.005  I  MSG_WINDOW_FOCUS_CHANGED 1 1
12:43:06.010  D  startInputInner - Id : 0
12:43:06.010  I  startInputInner - mService.startInputOrWindowGainedFocus
12:43:06.018  D  startInputInner - Id : 0
12:43:06.083  I  Loaded layer VK_LAYER_KHRONOS_validation
12:43:06.087  I  ===== BEGIN DUMP OF OVERRIDDEN SETTINGS =====
12:43:06.087  I  ===== END DUMP OF OVERRIDDEN SETTINGS =====
12:43:06.087  I  QUALCOMM build          : 3dad7f8ed7, I593c16c433
                 Build Date              : 10/01/21
                 Shader Compiler Version : EV031.32.02.02
                 Local Branch            : 
                 Remote Branch           : refs/tags/AU_LINUX_ANDROID_LA.UM.9.1.R1.11.00.00.604.073
                 Remote Branch           : NONE
                 Reconstruct Branch      : NOTHING
12:43:06.087  I  Build Config            : S P 10.0.7 AArch64
12:43:06.088  I  Driver Path             : /vendor/lib64/hw/vulkan.adreno.so
12:43:06.090  I  Vulkan Debug Report: object - -5476376679864662576
                 Validation Information: [ UNASSIGNED-CreateInstance-status-message ] Object 0: handle = 0xb400006cbc6f05d0, type = VK_OBJECT_TYPE_INSTANCE; | MessageID = 0x87d47f57 | Khronos Validation Layer Active:
                     Settings File: None. Default location is /\vk_layer_settings.txt.
                     Current Enables: None.
                     Current Disables: None.
12:43:06.090  E  USER WARNING: Vulkan Debug Report: object - -5476376679864662576
                 Validation Performance Warning: [ UNASSIGNED-CreateInstance-debug-warning ] Object 0: handle = 0xb400006cbc6f05d0, type = VK_OBJECT_TYPE_INSTANCE; | MessageID = 0xe8d1a9fe | VALIDATION LAYERS WARNING: Using debug builds of the validation layers *will* adversely affect performance.
12:43:06.090  E     at: _debug_report_callback (drivers/vulkan/rendering_context_driver_vulkan.cpp:324)
12:43:06.098  I  Vulkan Debug Report: object - -5476376679864653008
                 Validation Information: [ UNASSIGNED-cache-file-error ] Object 0: handle = 0xb400006cbc6f2b30, type = VK_OBJECT_TYPE_DEVICE; | MessageID = 0xf0bb3995 | Cannot open shader validation cache at /tmp/shader_validation_cache-10375.bin for reading (it may not exist yet)
12:43:06.100  I  Vulkan 1.1.128 - Forward Mobile - Using Device #0: Qualcomm - Adreno (TM) 640
12:43:06.104  I  [SurfaceView - com.example.examplecrash/com.godot.game.GodotApp@3176227@0#1(BLAST Consumer)1](id:dbe00000001,api:1,p:3518,c:3518) FrameBooster: VULKAN surface was catched
12:43:06.105  D  FrameBooster: InterpolationGui: UID 10375 detected as using Vulkan
12:43:06.105  E  USER WARNING: Vulkan Debug Report: object - -5476376679864662800
                 Validation Performance Warning: [ UNASSIGNED-CoreValidation-SwapchainPreTransform ] Object 0: handle = 0xb400006cbc6f04f0, type = VK_OBJECT_TYPE_PHYSICAL_DEVICE; | MessageID = 0x214b6dd0 | vkCreateSwapchainKHR(): pCreateInfo->preTransform (VK_SURFACE_TRANSFORM_IDENTITY_BIT_KHR) doesn't match the currentTransform (VK_SURFACE_TRANSFORM_ROTATE_90_BIT_KHR) returned by vkGetPhysicalDeviceSurfaceCapabilitiesKHR, the presentation engine will transform the image content as part of the presentation operation.
12:43:06.105  E     at: _debug_report_callback (drivers/vulkan/rendering_context_driver_vulkan.cpp:324)
12:43:09.271  D  Installing profile for com.example.examplecrash
12:43:09.886  D  PlayerBase::PlayerBase()
12:43:09.889  D  TrackPlayerBase::TrackPlayerBase()
12:43:09.889  I  Emulating old channel mask behavior (ignoring positional mask 0x3, using default mask 0x3 based on channel count of 2)
12:43:09.902  D  createTrack_l(0): AUDIO_OUTPUT_FLAG_FAST denied by server; frameCount 0 -> 1764
12:43:09.902  I  Need throttle time for OpenSLES player
12:43:09.911  I  
12:43:13.538  D  OnGodotSetupCompleted
12:43:13.541  D  OnGodotMainLoopStarted
12:43:14.925  I  [SurfaceView - com.example.examplecrash/com.godot.game.GodotApp@3176227@0#1(BLAST Consumer)1](id:dbe00000001,api:1,p:3518,c:3518) queueBuffer: queued for the first time.
12:46:13.097  W  restartIfDisabled(325): releaseBuffer() track 0xb400006dbc881040 disabled due to previous underrun, restarting
12:46:13.098  A  Fatal signal 11 (SIGSEGV), code 1 (SEGV_MAPERR), fault addr 0xf0 in tid 3677 (VkThread), pid 3518 (le.examplecrash)
12:46:13.100  E  USER ERROR: BUG: Unreferenced static string to 0: quarter_texture
12:46:13.100  E     at: unref (core/string/string_name.cpp:127)
12:46:13.100  E  USER ERROR: BUG: Unreferenced static string to 0: half_texture
12:46:13.100  E     at: unref (core/string/string_name.cpp:127)
12:46:13.100  E  USER ERROR: BUG: Unreferenced static string to 0: sky_buffers
12:46:13.100  E     at: unref (core/string/string_name.cpp:127)
12:46:13.100  E  USER ERROR: BUG: Unreferenced static string to 0: blur_0
12:46:13.100  E     at: unref (core/string/string_name.cpp:127)
12:46:13.100  E  USER ERROR: BUG: Unreferenced static string to 0: back_color
12:46:13.100  E     at: unref (core/string/string_name.cpp:127)
12:46:13.100  E  USER ERROR: BUG: Unreferenced static string to 0: back_depth
12:46:13.100  E     at: unref (core/string/string_name.cpp:127)
12:46:13.100  E  USER ERROR: BUG: Unreferenced static string to 0: Fog
12:46:13.100  E     at: unref (core/string/string_name.cpp:127)
12:46:13.100  E  USER ERROR: BUG: Unreferenced static string to 0: VRS
12:46:13.100  E     at: unref (core/string/string_name.cpp:127)
12:46:13.100  E  USER ERROR: BUG: Unreferenced static string to 0: _compression
12:46:13.100  E     at: unref (core/string/string_name.cpp:127)
12:46:13.230  A  FORTIFY: pthread_mutex_lock called on a destroyed mutex (0x6c1f5dde24)

@TCROC
Copy link
Contributor Author

TCROC commented Apr 11, 2024

I just tested this with godot's player model here: https://github.com/godotengine/godot-demo-projects/blob/master/3d/platformer/player/player.glb

And it also crashed

Here's the new MRP using it
example-crash3.zip

@TCROC
Copy link
Contributor Author

TCROC commented Apr 12, 2024

After A LOT of debugging and tireless help from @clayjohn , I am reasonably confident to have bisected and determined this PR to be the root cause of the crash: #87888

Yay!!! :) Now I can sleep peacefully!

Edit:

Literally, its almost midnight and I'm very tired 😅 Excited to see what new discoveries await tomorrow!

@akien-mga
Copy link
Member

CC @TokageItLab @reduz

@akien-mga akien-mga changed the title Android directional light shadow causes crash with blender model 3D scene with Skeletons crashes on Android after SkeletonModifier3D refactoring Apr 12, 2024
@akien-mga akien-mga added this to the 4.3 milestone Apr 12, 2024
@TokageItLab
Copy link
Member

TokageItLab commented Apr 12, 2024

The only change related to the #87888 renderer was to reduce the duplicated Skin drawings, so perhaps a change in the BoneAttachment process is the problem. BoneAttachment has an abnormal process with overuse of signals compared to all other modules, and it may not be able to fully handle it on a particular platform.

@TCROC If there is no BoneAttachment, does it work correctly?

@TokageItLab
Copy link
Member

I can come up with several possible fixes to BoneAttachment, but I have sent #90575 for now. Please test to see if it works.

Other possible fixes:

  1. Leave the existing implementation as it is and connect all signals in the BoneAttachment with deferred
  2. If Remove bone_pose_updated signal and replace it with the skeleton_updated signal #90575 has no effect, further Remove bone_pose_updated signal and replace it with the skeleton_updated signal #90575 is connected with deferred

@TCROC
Copy link
Contributor Author

TCROC commented Apr 12, 2024

Just woke up. I'm gonna grab some coffee and check these out. Thanks @TokageItLab :)

@TCROC
Copy link
Contributor Author

TCROC commented Apr 13, 2024

To summarize:

forward+ = crash
mobile = crash
gl_compatibility = works

@TCROC
Copy link
Contributor Author

TCROC commented Apr 14, 2024

Hello @TokageItLab . I believe I finished updating the original issue description with all of the details we found. I also tested a few additional things and added to the report while I was at it. Here are the ones to note. (All of these are in the main issue description now as well).

Above you asked:

If that doesn't work, I would like you to test what happens when you place a PhysicalBody that does not use Skeleton.

I tested this after reverting the PR that causes skeletons with meshes in general to crash. I simply duplicated my character, removed the skeleton, and equipped him as a hat to the character with a skeleton. Interestingly it worked! ✅ :) MRP: https://github.com/godotengine/godot/files/14971317/example-working-mesh.zip

Then I also tested a couple more things:

  1. Adding particle3ds as a child of a bone attachment. This crashed ❌. MRP: https://github.com/godotengine/godot/files/14971175/example-crash-particles-3d.zip

  2. Adding an item with a skeleton as a child of bone attachment BUT NOT CHANGING THAT ITEM'S SKELETON. This crashed ❌. MRP: https://github.com/godotengine/godot/files/14971312/example-crash-equip-mesh.zip

Let me know if you have any questions, would like anything else from me, or would like me to test something. Thanks again! :)

@demolke
Copy link
Contributor

demolke commented Apr 15, 2024

(deleting this for now as it wasn't relevant)

@TokageItLab
Copy link
Member

TokageItLab commented Apr 16, 2024

@demolke I can't test it because blender is broken to begin with. Can you provide a project output as glTF to isolate the problem? If you do encounter a crash there, I'm guessing from the error log that it probably has nothing to do with this issue, and you'll need to send it as another issue.

@TokageItLab TokageItLab changed the title 3D scene with Skeletons crashes on Android after SkeletonModifier3D refactoring Skinned Skeleton3D crashes on Android with directional shadow in forward+/mobile Apr 16, 2024
@TokageItLab TokageItLab changed the title Skinned Skeleton3D crashes on Android with directional shadow in forward+/mobile Skinned Skeleton3D calling add_child() cause crash on Android with directional shadow in forward+/mobile Apr 16, 2024
@TokageItLab
Copy link
Member

TokageItLab commented Apr 16, 2024

@TCROC I have changed the title based on the results of the previous tests, can you confirm that this is true?

I would expect a crash on the Android in mobile renderer when calling add_child() in skinned Skeleton with regardless of any type of Node the child to be added (...maybe only Node3D? You can experiment with adding Node or Node2D as children).

If you can confirm that in a disable_deprecated build or revert #87888 build (it would be great if you could check each separately), we can be sure of that. FYI, #87888 does add_child internally on enter_tree, which appears to be what caused #87888 to crash, but the fundamental cause of this issue is definitely #84976.

@TCROC
Copy link
Contributor Author

TCROC commented Apr 16, 2024

Yes I think this is a good title change. With the exception of the first MRP where skeletons + mesh in general are crashing. Everything else is add_child.

I will do those tests tomorrow as well! :)

@TCROC
Copy link
Contributor Author

TCROC commented Apr 16, 2024

@TokageItLab Here are the results of the requested tests:

I tested the "disable_deprecated" and "revert #87888" separately. Both gave the same results. They are as follows:

✅ = works
❌ = crash

add_child() in skinned Skeleton + Node type child = ✅
add_child() in skinned Skeleton + Node2D type child = ✅
add_child() in skinned Skeleton + Node3D type child = ✅
add_child() in skinned Skeleton + Mesh Without Skeleton child = ✅
add_child() in skinned Skeleton + Mesh With Skeleton child = ❌
add_child() in skinned Skeleton + change skeleton of child to be parent = ❌
add_child() in skinned Skeleton + GPUParticles3D type child = ❌

✅ MRP Mesh Without Skeleton Child: https://github.com/godotengine/godot/files/14971317/example-working-mesh.zip
❌ MRP Mesh With Skeleton Child: https://github.com/godotengine/godot/files/14971312/example-crash-equip-mesh.zip
❌ MRP Particles Child: https://github.com/godotengine/godot/files/14971175/example-crash-particles-3d.zip
❌ MRP Change Skeleton Child: https://github.com/godotengine/godot/files/14967690/example-crash-change-skeleton.zip

So I'm beginning to think the MRPs that causes the crash may actually be parent skeleton adding child with skeleton AND parent skeleton adding child of ParticlesGPU3D

@TokageItLab TokageItLab changed the title Skinned Skeleton3D calling add_child() cause crash on Android with directional shadow in forward+/mobile Skinned Skeleton3D calling add_child() with specific classes cause crash on Android with directional shadow in forward+/mobile Apr 16, 2024
@demolke
Copy link
Contributor

demolke commented Apr 17, 2024

@demolke I can't test it because blender is broken to begin with. Can you provide a project output as glTF to isolate the problem? If you do encounter a crash there, I'm guessing from the error log that it probably has nothing to do with this issue, and you'll need to send it as another issue.

Thanks, I've removed it for now and will file an issue once I understand it better

@TCROC
Copy link
Contributor Author

TCROC commented Apr 17, 2024

@demolke I can't test it because blender is broken to begin with. Can you provide a project output as glTF to isolate the problem? If you do encounter a crash there, I'm guessing from the error log that it probably has nothing to do with this issue, and you'll need to send it as another issue.

Thanks, I've removed it for now and will file an issue once I understand it better

It may not be a bad idea to file a new issue with the blender example causing a crash. I filed this issue even with only a broad understanding and then we have slowly narrowed in on it. I think its useful to at least have things tracked so they don't get lost or forgotten.

@DarioSamo
Copy link
Contributor

DarioSamo commented Apr 25, 2024

I figure I should give my two cents since it is suspected #84976 might be the cause of the issue.

I gave some of the MRPs linked here a test on desktop at least in master with the strictest possible validation enabled. None of the projects reported anything back, so we can probably rule out synchronization as far as the VLL goes.

However, there's some flags you can play around with if you feel it should be more paranoid about it:

// When true, the command graph will attempt to reorder the rendering commands submitted by the user based on the dependencies detected from
// the commands automatically. This should improve rendering performance in most scenarios at the cost of some extra CPU overhead.
//
// This behavior can be disabled if it's suspected that the graph is not detecting dependencies correctly and more control over the order of
// the commands is desired (e.g. debugging).

#define RENDER_GRAPH_REORDER 1

// Synchronization barriers are issued between the graph's levels only with the necessary amount of detail to achieve the correct result. If
// it's suspected that the graph is not doing this correctly, full barriers can be issued instead that will block all types of operations
// between the synchronization levels. This setting will have a very negative impact on performance when enabled, so it's only intended for
// debugging purposes.

#define RENDER_GRAPH_FULL_BARRIERS 0

When reordering is disabled, the ARG will basically fall back to the behavior of the previous version and submit commands in their original order. When full barriers are enabled, then it'll basically be completely paranoid about synchronization. However, given we have no reliance on stuff like indirect drawing or dispatches, I'm very suspicious of synchronization being a problem here that could cause a crash. At worst I could imagine reordering having some effect significant enough for a crash.

There's a lot of degrees of separation here between what was originally reported and the issue that could be causing a crash so it's a bit tough to understand the possible cause to me yet. Instancing elements from scripts or what they're attached to shouldn't cause any visible difference to the renderer, as no hierarchy is visible from its side nor are the objects treated differently depending on their origin. So I think pretty much anything should be reproducible on a static version of the scene if it exists.

My suggestion for now would be to see if you see any behavior change whatsoever messing with these options, but I figured it'd be worth noting that the pattern has usually been that the introduction of the ARG has exposed some long-standing bugs due to minimizing the amount of barriers and transitions necessary. Either Godot's own issues were exposed or more obscure driver issues that end up requiring workarounds (like how @akien-mga's setup required the introduction of buffer barriers as the driver seemingly did not follow global barriers very well).

@TCROC
Copy link
Contributor Author

TCROC commented Apr 25, 2024

@DarioSamo I'm curious. Were you able to reproduce the crashes with the MRPs? And did you run into any issues with those MRPs or were they good?

@clayjohn
Copy link
Member

clayjohn commented Apr 25, 2024

I just tested the "Skeleton mesh + directional light + shadows + bone attachment + add_child gpu particles 3d" on a pixel 4 with the following results:
4.2.2: works fine
4.3-dev2: crash
4.3-dev4: crash

So clearly this is a mobile specific issue and something is getting through both our validation code and Vulkan's validation layers

@TCROC
Copy link
Contributor Author

TCROC commented Apr 25, 2024

Thanks @clayjohn ! :) Should we update the label with "confirmed" instead of "needs testing"? Or is that not what those labels mean?

@DarioSamo
Copy link
Contributor

DarioSamo commented Apr 25, 2024

@DarioSamo I'm curious. Were you able to reproduce the crashes with the MRPs? And did you run into any issues with those MRPs or were they good?

Given Clay just got a crash I figure we can investigate it from here. And no, I did not run into any visual issues with them either.

It's been brought to my attention from the call stack posted before the crash is actually at the driver level but not the GPU execution, so it's probably some situation the renderer causes on this particular driver.

@clayjohn
Copy link
Member

clayjohn commented Apr 26, 2024

Interesting progress. I can't find anything unique about the dispatch that crashes. I have also tested out forcing the engine to rebind the pipeline and all uniform sets, but the crash persists.

I noticed that both devices (the Samsung S10e and my device, the Pixel 4) have an Adreno 640 GPU. So I tested on a device with a Mali GPU and confirmed that the crash is not happening there. I have also tested on a device with an Adreno 530 GPU and it didn't crash there either.

Looks like this is a driver bug after all that we are somehow triggering

To refine my thoughts. Given that:

  1. The issue appeared after the introduction of the ARG and
  2. The issue requires a skeleton, directional light shadows and GPUParticles or another skeleton to be added at runtime
    It would seem that the issue comes from barriers. But I enabled FORCE_FULL_ACCESS_BITS and RENDER_GRAPH_FULL_BARRIERS and they didn't resolve the crash

@clayjohn
Copy link
Member

Small update on the ARG side of things. We have confirmed that this is indeed a driver bug in certain Adreno drivers (seems to be 6XX series). The exact bug is that if a scissor rect is set in a given command buffer, any compute commands dispatched after will crash. Technically compute should ignore the scissor rect, but on this driver version it doesn't.

We are still investigating solutions. The reason this didn't crash previously is that in the mobile renderer the only compute work we do is skeletons and particles which are submitted at the beginning of the frame. So they always ran before the scissor rect was set. But now that we re-order work, it is possible for a graphics command to set the scissor rect before certain compute commands and cause this crash.

@TCROC
Copy link
Contributor Author

TCROC commented Apr 30, 2024

Amazing work @clayjohn and everyone investigating!!! That sounds like an EXTREMELY hard edge case to detect! Very impressive! I look forward to seeing what solution or potential solutions we come up with! :)

@DarioSamo
Copy link
Contributor

DarioSamo commented May 3, 2024

I've submitted #91514 for fixing this issue.

At the moment however, the detection logic is not implemented. I pointed out the relevant line where it can be forced to use the workaround if you wish to check if it fixes the crash. It might be worth gathering the exact driver versions and device names that require this workaround.

@clayjohn
Copy link
Member

clayjohn commented May 3, 2024

Judging from this report, it seems at least 620, 630, 640, and 650 are affected. I think that it is probably safe to just use the workaround for all Adreno 6XX devices

@TCROC
Copy link
Contributor Author

TCROC commented May 3, 2024

Great work @DarioSamo and @clayjohn !! :) @DarioSamo is your PR ready to be tested yet? If so I can merge it into my fork and see how it goes. Or r you still touching some things up on it?

@DarioSamo
Copy link
Contributor

Great work @DarioSamo and @clayjohn !! :) @DarioSamo is your PR ready to be tested yet? If so I can merge it into my fork and see how it goes. Or r you still touching some things up on it?

Clay just added the device name detection for devices for Adreno 6XX devices. Give it a shot.

@TCROC
Copy link
Contributor Author

TCROC commented May 3, 2024

I just tested and can confirm this PR fixes it! Great work everyone! Shoutout to @clayjohn @DarioSamo and @TokageItLab !! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants