Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vulkan Acceleration #50

Open
DifferentialityDevelopment opened this issue May 13, 2024 · 32 comments
Open

Vulkan Acceleration #50

DifferentialityDevelopment opened this issue May 13, 2024 · 32 comments

Comments

@DifferentialityDevelopment
Copy link
Contributor

Hi @b4rtaz

I was tinkering a bit over the weekend and figured it might be possible to create a version of worker/main that accelerates the inference by offloading some work to the GPU to handle.

I've never really worked with compute shaders or Vulkan for that matter but I put together a simple demo that successfully ran a compute shader using Vulkan
The compute shader currently just takes an input buffer and copies the data to an output buffer.

This is what I have so far
compute-shader-example.zip

My next step is to upgrade it to do a matmul on two matrices and do the same operation on CPU and compare the results, I'm hopeful that I could utilize the worker/root node's dedicated/integrated GPU to do the heavy lifting.

I'll do some experiments on my fork on integrating it once I have a matmul compute shader working and let you know how it goes.

Right now I just want to get something working where I give it two matrices and it computes the resulting matmul output.

@DifferentialityDevelopment
Copy link
Contributor Author

I've managed to upgrade it to do a dot product between two matrices of x * x

image
image

As a test I did a dot product between two matrices of 2048 * 2048 and it's using floats for each element.

Next I want to upgrade it to handle matrices that are not a multiple of 32 and who are not square, then try and build a compute shader than can do attention with softmax on qkv matrices.

The only real tricky thing is building the compute shader correctly, integrating it wouldn't be that hard, the workers/root node could selectively offload certain calculations to be processed in the compute shader by the GPU, and you can account for GPU memory restrictions by splitting the load up into sequential calls to the GPU.

As before here is a copy of the code as it is now:
compute-matmul.zip

@b4rtaz
Copy link
Owner

b4rtaz commented May 13, 2024

Great! I was wondering about CUDA as the first accelerator but for Raspberry Pi Vulcan may be a better choice.

Please check the llama.cpp repository, they have implemented the matrix multiplication already.

@DifferentialityDevelopment
Copy link
Contributor Author

Great! I was wondering about CUDA as the first accelerator but for Raspberry Pi Vulcan may be a better choice.

Please check the llama.cpp repository, they have implemented the matrix multiplication already.

Yeah Vulkan is nice because it has a wide range of support, not just for SBC's but also for computers with AMD or Nvidia cards.

I've basically got dot product multiplication done but there are some nitty gritty difficult issues that I still need to figure out.

Also yeah not a bad idea, busy looking at these shaders right now.
https://github.com/ggerganov/llama.cpp/blob/master/kompute-shaders/op_mul_mat_f16.comp

I've at least gained most of the knowledge now that I need to actually utilize their shaders.

@DifferentialityDevelopment
Copy link
Contributor Author

I've been working on how I'd integrate it into distributed-llama and I think I have an decent idea of how to go about it.
With the current way I have it I can offload certain things to vulkan compute shaders, so for instance I can start with just doing the llamaQkv task and where it does the 3 matmul calls, it instead calls matmulVulkan
Everything is seamlessly added with ifdef's
All you would need to do to enable the vulkan features is do a make main VULKAN=1

I am working towards getting atleast the 6 matmul functions offloaded to vulkan, then I'll submit a pull request for it, will have to see in practice how well it performs.

Ideally I'd have 1 compute shader for all 6, but for simplicity sake I'm going to use 6 different compute shaders, once for each: matmulF32, matmulF16, matmulQ40, matmulQ80, matmulQ40vQ80 & matmulQ80vQ80

@zhengpeirong
Copy link

Great! I want to share the news and video that show the Vulkan GPU hardware support is available on Raspberry Pi OS officially.

@DifferentialityDevelopment
Copy link
Contributor Author

Great! I want to share the news and video that show the Vulkan GPU hardware support is available on Raspberry Pi OS officially.

Very cool! Would be interesting to see what sort of speedup it has on the Raspberry PI.

@DifferentialityDevelopment
Copy link
Contributor Author

This is my in progress branch
https://github.com/DifferentialityDevelopment/distributed-llama/tree/vulkan-acceleration
Not quite working just yet, but it's mostly integrated, just need to sort out the kinks now

@b4rtaz Getting there...

@unclemusclez
Copy link

Great! I want to share the news and video that show the Vulkan GPU hardware support is available on Raspberry Pi OS officially.

no 3b native support :(

@DifferentialityDevelopment
Copy link
Contributor Author

Well I actually managed to get vulcan acceleration working!

./vulkan-test
WARNING: dzn is not a conformant Vulkan implementation, testing use only.
Created Vulkan Instance!
Device Name: Microsoft Direct3D12 (NVIDIA GeForce RTX 3060)
API Version: 1.2.274
Create the buffers
Get memory requirements for the buffers
Allocate and map memory for the buffers
Bind the memory to the buffers
Copy the weights to GPU memory
Copy the input to GPU memory
Copy the matmul info to GPU memory
Bind the buffers to the descriptor sets
Write and update the descriptor sets
Create a pointer to the commandBuffer member
Bind pipeline and descriptor sets to the command buffer
Wait for the compute shader to finish
Got the output from the compute shader
✅ matmulQ80
✅ matmulQ80vQ80

Only matmulF32 at the moment, want to these next
matmulF16, matmulQ40, matmulQ80, matmulQ40vQ80 and matmulQ80vQ80

Once I've got them done as well then I'll do some speed tests to see what kind of an uplift this has.

@DifferentialityDevelopment
Copy link
Contributor Author

Need to figure out what Vulkan extensions I need to enable to support the Q40 and Q80 data types in the compute shader.

The actual shader implementation isn't that complicated luckily, but I need to use specific data types.

@DifferentialityDevelopment
Copy link
Contributor Author

I'm very close now, just need to get the compute shader code to work correctly.
I wrote tests to be able to compare the matmul results between CPU and GPU to ensure correctness.

@DifferentialityDevelopment
Copy link
Contributor Author

I've successfully gotten matmulQ40Q80 to run via compute shader on Vulkan 🔥 and get results back that are nearly 1 = 1 with CPU calculated results.
You won't be able to run Vulkan mode behind WSL, not until int8 support comes via the drivers.
Going to do some speed tests now, have to install linux on a SSD and boot up direct to linux, only way to get native support for int8 on Vulkan.
However, this whole thing is only a problem as long as this project is linux only, if it can run natively in windows then it's a different story altogether, but that would require a threading implementation that works cross platform, Windows doesn't work with pthread.h

One last thing I need to figure out, is that if I run it in inference mode, and have more than 1 thread running at a time, then it bugs out, it's not yet multithread capable, but will sort that out soon.

In the meantime, I can check how fast it is compared to CPU on a single matmul pass.

Just printed the first 32 floats from coming from each:

CPU Results: 0.00731812 0.00678142 0.00680221 0.00671218 0.00693316 0.00689348 0.00708914 0.00695994 0.00664946 0.00704365 0.00665891 0.00735391 0.00661354 0.00689124 0.00729823 0.0068318 0.00696582 0.00684787 0.00673844 0.0071383 0.00692065 0.00697429 0.00682781 0.00695222 0.0068927 0.00702631 0.00696984 0.00717608 0.00726813 0.00741034 0.00734587 0.00691799

Vulkan Results: 0.00737184 0.0069693 0.00689848 0.00662084 0.00692127 0.00690478 0.00713998 0.00661354 0.00675865 0.00720126 0.00705041 0.00735674 0.00685534 0.00662042 0.00709984 0.00699035 0.00676422 0.00702015 0.00673056 0.00712065 0.00695963 0.00703757 0.00696171 0.00701772 0.00682509 0.00709918 0.00704923 0.00718651 0.00713319 0.00719681 0.00736942 0.00710319

✅ matmulQ40Q80

@DifferentialityDevelopment
Copy link
Contributor Author

I setup linux on another partition so that I could get native int8 GPU functionality without WSL ruining the party.

Then I ran some speed tests of a single pass, the shader runs fairly well, at very low matrix dimensions the CPU is actually faster, but I haven't yet been able to test large enough dimensions due to some weird bug that's causing a segfault that I can't figure out why yet.

n: number of rows
d: number of columns

n = 512, d = 256
CPU matmulQ40Q80 - Duration: 0.015148 ms
Shader execution time: 0.117442 ms
Vulkan matmulQ40Q80 - Duration: 0.965542 ms

n = 512, d = 1024
CPU matmulQ40Q80 - Duration: 0.07343 ms
Shader execution time: 0.207296 ms
Vulkan matmulQ40Q80 - Duration: 1.44746 ms

n = 512, d = 3072
CPU matmulQ40Q80 - Duration: 0.184752 ms
Shader execution time: 0.292468 ms
Vulkan matmulQ40Q80 - Duration: 2.18394 ms

@unclemusclez
Copy link

Raspberry Pi3B+ 22.04 Ubuntu Server Vulkan Information"

ubuntu@ubuntu:~$ vulkaninfo
WARNING: [Loader Message] Code 0 : terminator_CreateInstance: Failed to CreateInstance in ICD 4.  Skipping ICD.
'DISPLAY' environment variable not set... skipping surface info
==========
VULKANINFO
==========

Vulkan Instance Version: 1.3.204


Instance Extensions: count = 20
===============================
        VK_EXT_acquire_drm_display             : extension revision 1
        VK_EXT_acquire_xlib_display            : extension revision 1
        VK_EXT_debug_report                    : extension revision 10
        VK_EXT_debug_utils                     : extension revision 2
        VK_EXT_direct_mode_display             : extension revision 1
        VK_EXT_display_surface_counter         : extension revision 1
        VK_EXT_swapchain_colorspace            : extension revision 4
        VK_KHR_device_group_creation           : extension revision 1
        VK_KHR_display                         : extension revision 23
        VK_KHR_external_fence_capabilities     : extension revision 1
        VK_KHR_external_memory_capabilities    : extension revision 1
        VK_KHR_external_semaphore_capabilities : extension revision 1
        VK_KHR_get_display_properties2         : extension revision 1
        VK_KHR_get_physical_device_properties2 : extension revision 2
        VK_KHR_get_surface_capabilities2       : extension revision 1
        VK_KHR_surface                         : extension revision 25
        VK_KHR_surface_protected_capabilities  : extension revision 1
        VK_KHR_wayland_surface                 : extension revision 6
        VK_KHR_xcb_surface                     : extension revision 6
        VK_KHR_xlib_surface                    : extension revision 6

Layers: count = 2
=================
VK_LAYER_MESA_device_select (Linux device selection layer) Vulkan version 1.3.211, layer version 1:
        Layer Extensions: count = 0
        Devices: count = 1
                GPU id = 0 (llvmpipe (LLVM 15.0.7, 128 bits))
                Layer-Device Extensions: count = 0

VK_LAYER_MESA_overlay (Mesa Overlay layer) Vulkan version 1.3.211, layer version 1:
        Layer Extensions: count = 0
        Devices: count = 1
                GPU id = 0 (llvmpipe (LLVM 15.0.7, 128 bits))
                Layer-Device Extensions: count = 0

Device Groups:
==============
Group 0:
        Properties:
                physicalDevices: count = 1
                        llvmpipe (LLVM 15.0.7, 128 bits) (ID: 0)
                subsetAllocation = 0

        Present Capabilities:
                llvmpipe (LLVM 15.0.7, 128 bits) (ID: 0):
                        Can present images from the following devices: count = 1
                                llvmpipe (LLVM 15.0.7, 128 bits) (ID: 0)
                Present modes: count = 1
                        DEVICE_GROUP_PRESENT_MODE_LOCAL_BIT_KHR


Device Properties and Extensions:
=================================
GPU0:
VkPhysicalDeviceProperties:
---------------------------
        apiVersion        = 4206847 (1.3.255)
        driverVersion     = 1 (0x0001)
        vendorID          = 0x10005
        deviceID          = 0x0000
        deviceType        = PHYSICAL_DEVICE_TYPE_CPU
        deviceName        = llvmpipe (LLVM 15.0.7, 128 bits)
        pipelineCacheUUID = 32332e32-2e31-2d31-7562-756e7475332e

VkPhysicalDeviceLimits:
-----------------------
        maxImageDimension1D                             = 16384
        maxImageDimension2D                             = 16384
        maxImageDimension3D                             = 4096
        maxImageDimensionCube                           = 32768
        maxImageArrayLayers                             = 2048
        maxTexelBufferElements                          = 134217728
        maxUniformBufferRange                           = 65536
        maxStorageBufferRange                           = 134217728
        maxPushConstantsSize                            = 256
        maxMemoryAllocationCount                        = 4294967295
        maxSamplerAllocationCount                       = 32768
        bufferImageGranularity                          = 0x00000040
        sparseAddressSpaceSize                          = 0x00000000
        maxBoundDescriptorSets                          = 8
        maxPerStageDescriptorSamplers                   = 1000000
        maxPerStageDescriptorUniformBuffers             = 1000000
        maxPerStageDescriptorStorageBuffers             = 1000000
        maxPerStageDescriptorSampledImages              = 1000000
        maxPerStageDescriptorStorageImages              = 1000000
        maxPerStageDescriptorInputAttachments           = 1000000
        maxPerStageResources                            = 1000000
        maxDescriptorSetSamplers                        = 1000000
        maxDescriptorSetUniformBuffers                  = 1000000
        maxDescriptorSetUniformBuffersDynamic           = 1000000
        maxDescriptorSetStorageBuffers                  = 1000000
        maxDescriptorSetStorageBuffersDynamic           = 1000000
        maxDescriptorSetSampledImages                   = 1000000
        maxDescriptorSetStorageImages                   = 1000000
        maxDescriptorSetInputAttachments                = 1000000
        maxVertexInputAttributes                        = 32
        maxVertexInputBindings                          = 32
        maxVertexInputAttributeOffset                   = 2047
        maxVertexInputBindingStride                     = 2048
        maxVertexOutputComponents                       = 128
        maxTessellationGenerationLevel                  = 64
        maxTessellationPatchSize                        = 32
        maxTessellationControlPerVertexInputComponents  = 128
        maxTessellationControlPerVertexOutputComponents = 128
        maxTessellationControlPerPatchOutputComponents  = 128
        maxTessellationControlTotalOutputComponents     = 4096
        maxTessellationEvaluationInputComponents        = 128
        maxTessellationEvaluationOutputComponents       = 128
        maxGeometryShaderInvocations                    = 32
        maxGeometryInputComponents                      = 64
        maxGeometryOutputComponents                     = 128
        maxGeometryOutputVertices                       = 1024
        maxGeometryTotalOutputComponents                = 1024
        maxFragmentInputComponents                      = 128
        maxFragmentOutputAttachments                    = 8
        maxFragmentDualSrcAttachments                   = 2
        maxFragmentCombinedOutputResources              = 104
        maxComputeSharedMemorySize                      = 32768
        maxComputeWorkGroupCount: count = 3
                65535
                65535
                65535
        maxComputeWorkGroupInvocations                  = 1024
        maxComputeWorkGroupSize: count = 3
                1024
                1024
                1024
        subPixelPrecisionBits                           = 8
        subTexelPrecisionBits                           = 8
        mipmapPrecisionBits                             = 4
        maxDrawIndexedIndexValue                        = 4294967295
        maxDrawIndirectCount                            = 4294967295
        maxSamplerLodBias                               = 16
        maxSamplerAnisotropy                            = 16
        maxViewports                                    = 16
        maxViewportDimensions: count = 2
                16384
                16384
        viewportBoundsRange: count = 2
                -32768
                32768
        viewportSubPixelBits                            = 0
        minMemoryMapAlignment                           = 64
        minTexelBufferOffsetAlignment                   = 0x00000010
        minUniformBufferOffsetAlignment                 = 0x00000010
        minStorageBufferOffsetAlignment                 = 0x00000010
        minTexelOffset                                  = -32
        maxTexelOffset                                  = 31
        minTexelGatherOffset                            = -32
        maxTexelGatherOffset                            = 31
        minInterpolationOffset                          = -2
        maxInterpolationOffset                          = 2
        subPixelInterpolationOffsetBits                 = 8
        maxFramebufferWidth                             = 16384
        maxFramebufferHeight                            = 16384
        maxFramebufferLayers                            = 2048
        framebufferColorSampleCounts: count = 2
                SAMPLE_COUNT_1_BIT
                SAMPLE_COUNT_4_BIT
        framebufferDepthSampleCounts: count = 2
                SAMPLE_COUNT_1_BIT
                SAMPLE_COUNT_4_BIT
        framebufferStencilSampleCounts: count = 2
                SAMPLE_COUNT_1_BIT
                SAMPLE_COUNT_4_BIT
        framebufferNoAttachmentsSampleCounts: count = 2
                SAMPLE_COUNT_1_BIT
                SAMPLE_COUNT_4_BIT
        maxColorAttachments                             = 8
        sampledImageColorSampleCounts: count = 2
                SAMPLE_COUNT_1_BIT
                SAMPLE_COUNT_4_BIT
        sampledImageIntegerSampleCounts: count = 2
                SAMPLE_COUNT_1_BIT
                SAMPLE_COUNT_4_BIT
        sampledImageDepthSampleCounts: count = 2
                SAMPLE_COUNT_1_BIT
                SAMPLE_COUNT_4_BIT
        sampledImageStencilSampleCounts: count = 2
                SAMPLE_COUNT_1_BIT
                SAMPLE_COUNT_4_BIT
        storageImageSampleCounts: count = 2
                SAMPLE_COUNT_1_BIT
                SAMPLE_COUNT_4_BIT
        maxSampleMaskWords                              = 1
        timestampComputeAndGraphics                     = true
        timestampPeriod                                 = 1
        maxClipDistances                                = 8
        maxCullDistances                                = 8
        maxCombinedClipAndCullDistances                 = 8
        discreteQueuePriorities                         = 2
        pointSizeRange: count = 2
                0
                255
        lineWidthRange: count = 2
                1
                255
        pointSizeGranularity                            = 0.125
        lineWidthGranularity                            = 0.0078125
        strictLines                                     = true
        standardSampleLocations                         = true
        optimalBufferCopyOffsetAlignment                = 0x00000080
        optimalBufferCopyRowPitchAlignment              = 0x00000080
        nonCoherentAtomSize                             = 0x00000040

VkPhysicalDeviceSparseProperties:
---------------------------------
        residencyStandard2DBlockShape            = false
        residencyStandard2DMultisampleBlockShape = false
        residencyStandard3DBlockShape            = false
        residencyAlignedMipSize                  = false
        residencyNonResidentStrict               = false

VkPhysicalDeviceCustomBorderColorPropertiesEXT:
-----------------------------------------------
        maxCustomBorderColorSamplers = 32768

VkPhysicalDeviceDepthStencilResolveProperties:
----------------------------------------------
        supportedDepthResolveModes: count = 2
                RESOLVE_MODE_SAMPLE_ZERO_BIT
                RESOLVE_MODE_AVERAGE_BIT
        supportedStencilResolveModes: count = 1
                RESOLVE_MODE_SAMPLE_ZERO_BIT
        independentResolveNone = false
        independentResolve     = false

VkPhysicalDeviceDescriptorIndexingProperties:
---------------------------------------------
        maxUpdateAfterBindDescriptorsInAllPools              = 4294967295
        shaderUniformBufferArrayNonUniformIndexingNative     = true
        shaderSampledImageArrayNonUniformIndexingNative      = true
        shaderStorageBufferArrayNonUniformIndexingNative     = true
        shaderStorageImageArrayNonUniformIndexingNative      = true
        shaderInputAttachmentArrayNonUniformIndexingNative   = true
        robustBufferAccessUpdateAfterBind                    = true
        quadDivergentImplicitLod                             = true
        maxPerStageDescriptorUpdateAfterBindSamplers         = 1000000
        maxPerStageDescriptorUpdateAfterBindUniformBuffers   = 1000000
        maxPerStageDescriptorUpdateAfterBindStorageBuffers   = 1000000
        maxPerStageDescriptorUpdateAfterBindSampledImages    = 1000000
        maxPerStageDescriptorUpdateAfterBindStorageImages    = 1000000
        maxPerStageDescriptorUpdateAfterBindInputAttachments = 1000000
        maxPerStageUpdateAfterBindResources                  = 1000000
        maxDescriptorSetUpdateAfterBindSamplers              = 1000000
        maxDescriptorSetUpdateAfterBindUniformBuffers        = 1000000
        maxDescriptorSetUpdateAfterBindUniformBuffersDynamic = 1000000
        maxDescriptorSetUpdateAfterBindStorageBuffers        = 1000000
        maxDescriptorSetUpdateAfterBindStorageBuffersDynamic = 1000000
        maxDescriptorSetUpdateAfterBindSampledImages         = 1000000
        maxDescriptorSetUpdateAfterBindStorageImages         = 1000000
        maxDescriptorSetUpdateAfterBindInputAttachments      = 1000000

VkPhysicalDeviceDriverProperties:
---------------------------------
        driverID           = DRIVER_ID_MESA_LLVMPIPE
        driverName         = llvmpipe
        driverInfo         = Mesa 23.2.1-1ubuntu3.1~22.04.2 (LLVM 15.0.7)
        conformanceVersion = 1.3.1.1

VkPhysicalDeviceExternalMemoryHostPropertiesEXT:
------------------------------------------------
        minImportedHostPointerAlignment = 0x00001000

VkPhysicalDeviceFloatControlsProperties:
----------------------------------------
        denormBehaviorIndependence            = SHADER_FLOAT_CONTROLS_INDEPENDENCE_ALL
        roundingModeIndependence              = SHADER_FLOAT_CONTROLS_INDEPENDENCE_ALL
        shaderSignedZeroInfNanPreserveFloat16 = true
        shaderSignedZeroInfNanPreserveFloat32 = true
        shaderSignedZeroInfNanPreserveFloat64 = true
        shaderDenormPreserveFloat16           = false
        shaderDenormPreserveFloat32           = false
        shaderDenormPreserveFloat64           = false
        shaderDenormFlushToZeroFloat16        = false
        shaderDenormFlushToZeroFloat32        = false
        shaderDenormFlushToZeroFloat64        = false
        shaderRoundingModeRTEFloat16          = true
        shaderRoundingModeRTEFloat32          = true
        shaderRoundingModeRTEFloat64          = true
        shaderRoundingModeRTZFloat16          = false
        shaderRoundingModeRTZFloat32          = false
        shaderRoundingModeRTZFloat64          = false

VkPhysicalDeviceIDProperties:
-----------------------------
        deviceUUID      = 6d657361-3233-2e32-2e31-2d3175627500
        driverUUID      = 6c6c766d-7069-7065-5555-494400000000
        deviceNodeMask  = 0
        deviceLUIDValid = false

VkPhysicalDeviceInlineUniformBlockProperties:
---------------------------------------------
        maxInlineUniformBlockSize                               = 4096
        maxPerStageDescriptorInlineUniformBlocks                = 8
        maxPerStageDescriptorUpdateAfterBindInlineUniformBlocks = 8
        maxDescriptorSetInlineUniformBlocks                     = 8
        maxDescriptorSetUpdateAfterBindInlineUniformBlocks      = 8

VkPhysicalDeviceLineRasterizationPropertiesEXT:
-----------------------------------------------
        lineSubPixelPrecisionBits = 8

VkPhysicalDeviceMaintenance3Properties:
---------------------------------------
        maxPerSetDescriptors    = 1000000
        maxMemoryAllocationSize = 0x80000000

VkPhysicalDeviceMaintenance4Properties:
---------------------------------------
        maxBufferSize = 0xffffffff

VkPhysicalDeviceMultiDrawPropertiesEXT:
---------------------------------------
        maxMultiDrawCount = 2048

VkPhysicalDeviceMultiviewProperties:
------------------------------------
        maxMultiviewViewCount     = 6
        maxMultiviewInstanceIndex = 2147483647

VkPhysicalDevicePointClippingProperties:
----------------------------------------
        pointClippingBehavior = POINT_CLIPPING_BEHAVIOR_ALL_CLIP_PLANES

VkPhysicalDeviceProtectedMemoryProperties:
------------------------------------------
        protectedNoFault = false

VkPhysicalDeviceProvokingVertexPropertiesEXT:
---------------------------------------------
        provokingVertexModePerPipeline                       = true
        transformFeedbackPreservesTriangleFanProvokingVertex = true

VkPhysicalDevicePushDescriptorPropertiesKHR:
--------------------------------------------
        maxPushDescriptors = 32

VkPhysicalDeviceRobustness2PropertiesEXT:
-----------------------------------------
        robustStorageBufferAccessSizeAlignment = 0x00000001
        robustUniformBufferAccessSizeAlignment = 0x00000001

VkPhysicalDeviceSamplerFilterMinmaxProperties:
----------------------------------------------
        filterMinmaxSingleComponentFormats = true
        filterMinmaxImageComponentMapping  = true

VkPhysicalDeviceShaderIntegerDotProductProperties:
--------------------------------------------------
        integerDotProduct8BitUnsignedAccelerated                                      = false
        integerDotProduct8BitSignedAccelerated                                        = false
        integerDotProduct8BitMixedSignednessAccelerated                               = false
        integerDotProduct4x8BitPackedUnsignedAccelerated                              = false
        integerDotProduct4x8BitPackedSignedAccelerated                                = false
        integerDotProduct4x8BitPackedMixedSignednessAccelerated                       = false
        integerDotProduct16BitUnsignedAccelerated                                     = false
        integerDotProduct16BitSignedAccelerated                                       = false
        integerDotProduct16BitMixedSignednessAccelerated                              = false
        integerDotProduct32BitUnsignedAccelerated                                     = false
        integerDotProduct32BitSignedAccelerated                                       = false
        integerDotProduct32BitMixedSignednessAccelerated                              = false
        integerDotProduct64BitUnsignedAccelerated                                     = false
        integerDotProduct64BitSignedAccelerated                                       = false
        integerDotProduct64BitMixedSignednessAccelerated                              = false
        integerDotProductAccumulatingSaturating8BitUnsignedAccelerated                = false
        integerDotProductAccumulatingSaturating8BitSignedAccelerated                  = false
        integerDotProductAccumulatingSaturating8BitMixedSignednessAccelerated         = false
        integerDotProductAccumulatingSaturating4x8BitPackedUnsignedAccelerated        = false
        integerDotProductAccumulatingSaturating4x8BitPackedSignedAccelerated          = false
        integerDotProductAccumulatingSaturating4x8BitPackedMixedSignednessAccelerated = false
        integerDotProductAccumulatingSaturating16BitUnsignedAccelerated               = false
        integerDotProductAccumulatingSaturating16BitSignedAccelerated                 = false
        integerDotProductAccumulatingSaturating16BitMixedSignednessAccelerated        = false
        integerDotProductAccumulatingSaturating32BitUnsignedAccelerated               = false
        integerDotProductAccumulatingSaturating32BitSignedAccelerated                 = false
        integerDotProductAccumulatingSaturating32BitMixedSignednessAccelerated        = false
        integerDotProductAccumulatingSaturating64BitUnsignedAccelerated               = false
        integerDotProductAccumulatingSaturating64BitSignedAccelerated                 = false
        integerDotProductAccumulatingSaturating64BitMixedSignednessAccelerated        = false

VkPhysicalDeviceSubgroupProperties:
-----------------------------------
        subgroupSize              = 4
        supportedStages: count = 6
                SHADER_STAGE_FRAGMENT_BIT
                SHADER_STAGE_COMPUTE_BIT
                SHADER_STAGE_ALL_GRAPHICS
                SHADER_STAGE_ALL
                SHADER_STAGE_TASK_BIT_NV
                SHADER_STAGE_MESH_BIT_NV
        supportedOperations: count = 7
                SUBGROUP_FEATURE_BASIC_BIT
                SUBGROUP_FEATURE_VOTE_BIT
                SUBGROUP_FEATURE_ARITHMETIC_BIT
                SUBGROUP_FEATURE_BALLOT_BIT
                SUBGROUP_FEATURE_SHUFFLE_BIT
                SUBGROUP_FEATURE_SHUFFLE_RELATIVE_BIT
                SUBGROUP_FEATURE_QUAD_BIT
        quadOperationsInAllStages = false

VkPhysicalDeviceSubgroupSizeControlProperties:
----------------------------------------------
        minSubgroupSize              = 4
        maxSubgroupSize              = 4
        maxComputeWorkgroupSubgroups = 32
        requiredSubgroupSizeStages: count = 4
                SHADER_STAGE_FRAGMENT_BIT
                SHADER_STAGE_COMPUTE_BIT
                SHADER_STAGE_ALL_GRAPHICS
                SHADER_STAGE_ALL

VkPhysicalDeviceTexelBufferAlignmentProperties:
-----------------------------------------------
        storageTexelBufferOffsetAlignmentBytes       = 0x00000010
        storageTexelBufferOffsetSingleTexelAlignment = true
        uniformTexelBufferOffsetAlignmentBytes       = 0x00000010
        uniformTexelBufferOffsetSingleTexelAlignment = true

VkPhysicalDeviceTimelineSemaphoreProperties:
--------------------------------------------
        maxTimelineSemaphoreValueDifference = 18446744073709551615

VkPhysicalDeviceTransformFeedbackPropertiesEXT:
-----------------------------------------------
        maxTransformFeedbackStreams                = 4
        maxTransformFeedbackBuffers                = 4
        maxTransformFeedbackBufferSize             = 0xffffffff
        maxTransformFeedbackStreamDataSize         = 512
        maxTransformFeedbackBufferDataSize         = 512
        maxTransformFeedbackBufferDataStride       = 512
        transformFeedbackQueries                   = true
        transformFeedbackStreamsLinesTriangles     = false
        transformFeedbackRasterizationStreamSelect = false
        transformFeedbackDraw                      = true

VkPhysicalDeviceVertexAttributeDivisorPropertiesEXT:
----------------------------------------------------
        maxVertexAttribDivisor = 4294967295

VkPhysicalDeviceVulkan11Properties:
-----------------------------------
        deviceUUID                        = 6d657361-3233-2e32-2e31-2d3175627500
        driverUUID                        = 6c6c766d-7069-7065-5555-494400000000
        deviceNodeMask                    = 0
        deviceLUIDValid                   = false
        subgroupSize                      = 4
        subgroupSupportedStages: count = 6
                SHADER_STAGE_FRAGMENT_BIT
                SHADER_STAGE_COMPUTE_BIT
                SHADER_STAGE_ALL_GRAPHICS
                SHADER_STAGE_ALL
                SHADER_STAGE_TASK_BIT_NV
                SHADER_STAGE_MESH_BIT_NV
        subgroupSupportedOperations: count = 7
                SUBGROUP_FEATURE_BASIC_BIT
                SUBGROUP_FEATURE_VOTE_BIT
                SUBGROUP_FEATURE_ARITHMETIC_BIT
                SUBGROUP_FEATURE_BALLOT_BIT
                SUBGROUP_FEATURE_SHUFFLE_BIT
                SUBGROUP_FEATURE_SHUFFLE_RELATIVE_BIT
                SUBGROUP_FEATURE_QUAD_BIT
        subgroupQuadOperationsInAllStages = false
        pointClippingBehavior             = POINT_CLIPPING_BEHAVIOR_ALL_CLIP_PLANES
        maxMultiviewViewCount             = 6
        maxMultiviewInstanceIndex         = 2147483647
        protectedNoFault                  = false
        maxPerSetDescriptors              = 1000000
        maxMemoryAllocationSize           = 0x80000000

VkPhysicalDeviceVulkan12Properties:
-----------------------------------
        driverID                                             = DRIVER_ID_MESA_LLVMPIPE
        driverName                                           = llvmpipe
        driverInfo                                           = Mesa 23.2.1-1ubuntu3.1~22.04.2 (LLVM 15.0.7)
        conformanceVersion                                   = 1.3.1.1
        denormBehaviorIndependence                           = SHADER_FLOAT_CONTROLS_INDEPENDENCE_ALL
        roundingModeIndependence                             = SHADER_FLOAT_CONTROLS_INDEPENDENCE_ALL
        shaderSignedZeroInfNanPreserveFloat16                = true
        shaderSignedZeroInfNanPreserveFloat32                = true
        shaderSignedZeroInfNanPreserveFloat64                = true
        shaderDenormPreserveFloat16                          = false
        shaderDenormPreserveFloat32                          = false
        shaderDenormPreserveFloat64                          = false
        shaderDenormFlushToZeroFloat16                       = false
        shaderDenormFlushToZeroFloat32                       = false
        shaderDenormFlushToZeroFloat64                       = false
        shaderRoundingModeRTEFloat16                         = true
        shaderRoundingModeRTEFloat32                         = true
        shaderRoundingModeRTEFloat64                         = true
        shaderRoundingModeRTZFloat16                         = false
        shaderRoundingModeRTZFloat32                         = false
        shaderRoundingModeRTZFloat64                         = false
        maxUpdateAfterBindDescriptorsInAllPools              = 4294967295
        shaderUniformBufferArrayNonUniformIndexingNative     = true
        shaderSampledImageArrayNonUniformIndexingNative      = true
        shaderStorageBufferArrayNonUniformIndexingNative     = true
        shaderStorageImageArrayNonUniformIndexingNative      = true
        shaderInputAttachmentArrayNonUniformIndexingNative   = true
        robustBufferAccessUpdateAfterBind                    = true
        quadDivergentImplicitLod                             = true
        maxPerStageDescriptorUpdateAfterBindSamplers         = 1000000
        maxPerStageDescriptorUpdateAfterBindUniformBuffers   = 1000000
        maxPerStageDescriptorUpdateAfterBindStorageBuffers   = 1000000
        maxPerStageDescriptorUpdateAfterBindSampledImages    = 1000000
        maxPerStageDescriptorUpdateAfterBindStorageImages    = 1000000
        maxPerStageDescriptorUpdateAfterBindInputAttachments = 1000000
        maxPerStageUpdateAfterBindResources                  = 1000000
        maxDescriptorSetUpdateAfterBindSamplers              = 1000000
        maxDescriptorSetUpdateAfterBindUniformBuffers        = 1000000
        maxDescriptorSetUpdateAfterBindUniformBuffersDynamic = 1000000
        maxDescriptorSetUpdateAfterBindStorageBuffers        = 1000000
        maxDescriptorSetUpdateAfterBindStorageBuffersDynamic = 1000000
        maxDescriptorSetUpdateAfterBindSampledImages         = 1000000
        maxDescriptorSetUpdateAfterBindStorageImages         = 1000000
        maxDescriptorSetUpdateAfterBindInputAttachments      = 1000000
        supportedDepthResolveModes: count = 2
                RESOLVE_MODE_SAMPLE_ZERO_BIT
                RESOLVE_MODE_AVERAGE_BIT
        supportedStencilResolveModes: count = 1
                RESOLVE_MODE_SAMPLE_ZERO_BIT
        independentResolveNone                               = false
        independentResolve                                   = false
        filterMinmaxSingleComponentFormats                   = true
        filterMinmaxImageComponentMapping                    = true
        maxTimelineSemaphoreValueDifference                  = 18446744073709551615
        framebufferIntegerColorSampleCounts: count = 1
                SAMPLE_COUNT_1_BIT

VkPhysicalDeviceVulkan13Properties:
-----------------------------------
        minSubgroupSize                                                               = 4
        maxSubgroupSize                                                               = 4
        maxComputeWorkgroupSubgroups                                                  = 32
        requiredSubgroupSizeStages: count = 4
                SHADER_STAGE_FRAGMENT_BIT
                SHADER_STAGE_COMPUTE_BIT
                SHADER_STAGE_ALL_GRAPHICS
                SHADER_STAGE_ALL
        maxInlineUniformBlockSize                                                     = 4096
        maxPerStageDescriptorInlineUniformBlocks                                      = 8
        maxPerStageDescriptorUpdateAfterBindInlineUniformBlocks                       = 8
        maxDescriptorSetInlineUniformBlocks                                           = 8
        maxDescriptorSetUpdateAfterBindInlineUniformBlocks                            = 8
        maxInlineUniformTotalSize                                                     = 262144
        integerDotProduct8BitUnsignedAccelerated                                      = false
        integerDotProduct8BitSignedAccelerated                                        = false
        integerDotProduct8BitMixedSignednessAccelerated                               = false
        integerDotProduct4x8BitPackedUnsignedAccelerated                              = false
        integerDotProduct4x8BitPackedSignedAccelerated                                = false
        integerDotProduct4x8BitPackedMixedSignednessAccelerated                       = false
        integerDotProduct16BitUnsignedAccelerated                                     = false
        integerDotProduct16BitSignedAccelerated                                       = false
        integerDotProduct16BitMixedSignednessAccelerated                              = false
        integerDotProduct32BitUnsignedAccelerated                                     = false
        integerDotProduct32BitSignedAccelerated                                       = false
        integerDotProduct32BitMixedSignednessAccelerated                              = false
        integerDotProduct64BitUnsignedAccelerated                                     = false
        integerDotProduct64BitSignedAccelerated                                       = false
        integerDotProduct64BitMixedSignednessAccelerated                              = false
        integerDotProductAccumulatingSaturating8BitUnsignedAccelerated                = false
        integerDotProductAccumulatingSaturating8BitSignedAccelerated                  = false
        integerDotProductAccumulatingSaturating8BitMixedSignednessAccelerated         = false
        integerDotProductAccumulatingSaturating4x8BitPackedUnsignedAccelerated        = false
        integerDotProductAccumulatingSaturating4x8BitPackedSignedAccelerated          = false
        integerDotProductAccumulatingSaturating4x8BitPackedMixedSignednessAccelerated = false
        integerDotProductAccumulatingSaturating16BitUnsignedAccelerated               = false
        integerDotProductAccumulatingSaturating16BitSignedAccelerated                 = false
        integerDotProductAccumulatingSaturating16BitMixedSignednessAccelerated        = false
        integerDotProductAccumulatingSaturating32BitUnsignedAccelerated               = false
        integerDotProductAccumulatingSaturating32BitSignedAccelerated                 = false
        integerDotProductAccumulatingSaturating32BitMixedSignednessAccelerated        = false
        integerDotProductAccumulatingSaturating64BitUnsignedAccelerated               = false
        integerDotProductAccumulatingSaturating64BitSignedAccelerated                 = false
        integerDotProductAccumulatingSaturating64BitMixedSignednessAccelerated        = false
        storageTexelBufferOffsetAlignmentBytes                                        = 0x00000010
        storageTexelBufferOffsetSingleTexelAlignment                                  = true
        uniformTexelBufferOffsetAlignmentBytes                                        = 0x00000010
        uniformTexelBufferOffsetSingleTexelAlignment                                  = true
        maxBufferSize                                                                 = 0xffffffff


Device Extensions: count = 114
        VK_ARM_rasterization_order_attachment_access  : extension revision 1
        VK_EXT_4444_formats                           : extension revision 1
        VK_EXT_attachment_feedback_loop_dynamic_state : extension revision 1
        VK_EXT_attachment_feedback_loop_layout        : extension revision 2
        VK_EXT_border_color_swizzle                   : extension revision 1
        VK_EXT_calibrated_timestamps                  : extension revision 2
        VK_EXT_color_write_enable                     : extension revision 1
        VK_EXT_conditional_rendering                  : extension revision 2
        VK_EXT_custom_border_color                    : extension revision 12
        VK_EXT_depth_clip_control                     : extension revision 1
        VK_EXT_depth_clip_enable                      : extension revision 1
        VK_EXT_depth_range_unrestricted               : extension revision 1
        VK_EXT_descriptor_buffer                      : extension revision 1
        VK_EXT_descriptor_indexing                    : extension revision 2
        VK_EXT_dynamic_rendering_unused_attachments   : extension revision 1
        VK_EXT_extended_dynamic_state                 : extension revision 1
        VK_EXT_extended_dynamic_state2                : extension revision 1
        VK_EXT_extended_dynamic_state3                : extension revision 2
        VK_EXT_external_memory_host                   : extension revision 1
        VK_EXT_graphics_pipeline_library              : extension revision 1
        VK_EXT_host_query_reset                       : extension revision 1
        VK_EXT_image_2d_view_of_3d                    : extension revision 1
        VK_EXT_image_robustness                       : extension revision 1
        VK_EXT_image_sliced_view_of_3d                : extension revision 1
        VK_EXT_index_type_uint8                       : extension revision 1
        VK_EXT_inline_uniform_block                   : extension revision 1
        VK_EXT_line_rasterization                     : extension revision 1
        VK_EXT_memory_budget                          : extension revision 1
        VK_EXT_memory_priority                        : extension revision 1
        VK_EXT_mesh_shader                            : extension revision 1
        VK_EXT_multi_draw                             : extension revision 1
        VK_EXT_multisampled_render_to_single_sampled  : extension revision 1
        VK_EXT_mutable_descriptor_type                : extension revision 1
        VK_EXT_non_seamless_cube_map                  : extension revision 1
        VK_EXT_pageable_device_local_memory           : extension revision 1
        VK_EXT_pipeline_creation_cache_control        : extension revision 3
        VK_EXT_pipeline_creation_feedback             : extension revision 1
        VK_EXT_post_depth_coverage                    : extension revision 1
        VK_EXT_primitive_topology_list_restart        : extension revision 1
        VK_EXT_primitives_generated_query             : extension revision 1
        VK_EXT_private_data                           : extension revision 1
        VK_EXT_provoking_vertex                       : extension revision 1
        VK_EXT_rasterization_order_attachment_access  : extension revision 1
        VK_EXT_robustness2                            : extension revision 1
        VK_EXT_sampler_filter_minmax                  : extension revision 2
        VK_EXT_scalar_block_layout                    : extension revision 1
        VK_EXT_separate_stencil_usage                 : extension revision 1
        VK_EXT_shader_atomic_float                    : extension revision 1
        VK_EXT_shader_atomic_float2                   : extension revision 1
        VK_EXT_shader_demote_to_helper_invocation     : extension revision 1
        VK_EXT_shader_object                          : extension revision 1
        VK_EXT_shader_stencil_export                  : extension revision 1
        VK_EXT_shader_subgroup_ballot                 : extension revision 1
        VK_EXT_shader_subgroup_vote                   : extension revision 1
        VK_EXT_shader_viewport_index_layer            : extension revision 1
        VK_EXT_subgroup_size_control                  : extension revision 2
        VK_EXT_texel_buffer_alignment                 : extension revision 1
        VK_EXT_transform_feedback                     : extension revision 1
        VK_EXT_vertex_attribute_divisor               : extension revision 3
        VK_EXT_vertex_input_dynamic_state             : extension revision 2
        VK_GOOGLE_decorate_string                     : extension revision 1
        VK_GOOGLE_hlsl_functionality1                 : extension revision 1
        VK_KHR_16bit_storage                          : extension revision 1
        VK_KHR_8bit_storage                           : extension revision 1
        VK_KHR_bind_memory2                           : extension revision 1
        VK_KHR_buffer_device_address                  : extension revision 1
        VK_KHR_copy_commands2                         : extension revision 1
        VK_KHR_create_renderpass2                     : extension revision 1
        VK_KHR_dedicated_allocation                   : extension revision 3
        VK_KHR_depth_stencil_resolve                  : extension revision 1
        VK_KHR_descriptor_update_template             : extension revision 1
        VK_KHR_device_group                           : extension revision 4
        VK_KHR_draw_indirect_count                    : extension revision 1
        VK_KHR_driver_properties                      : extension revision 1
        VK_KHR_dynamic_rendering                      : extension revision 1
        VK_KHR_external_fence                         : extension revision 1
        VK_KHR_external_memory                        : extension revision 1
        VK_KHR_external_memory_fd                     : extension revision 1
        VK_KHR_external_semaphore                     : extension revision 1
        VK_KHR_format_feature_flags2                  : extension revision 2
        VK_KHR_get_memory_requirements2               : extension revision 1
        VK_KHR_image_format_list                      : extension revision 1
        VK_KHR_imageless_framebuffer                  : extension revision 1
        VK_KHR_incremental_present                    : extension revision 2
        VK_KHR_maintenance1                           : extension revision 2
        VK_KHR_maintenance2                           : extension revision 1
        VK_KHR_maintenance3                           : extension revision 1
        VK_KHR_maintenance4                           : extension revision 2
        VK_KHR_multiview                              : extension revision 1
        VK_KHR_pipeline_library                       : extension revision 1
        VK_KHR_push_descriptor                        : extension revision 2
        VK_KHR_relaxed_block_layout                   : extension revision 1
        VK_KHR_sampler_mirror_clamp_to_edge           : extension revision 3
        VK_KHR_separate_depth_stencil_layouts         : extension revision 1
        VK_KHR_shader_atomic_int64                    : extension revision 1
        VK_KHR_shader_clock                           : extension revision 1
        VK_KHR_shader_draw_parameters                 : extension revision 1
        VK_KHR_shader_float16_int8                    : extension revision 1
        VK_KHR_shader_float_controls                  : extension revision 4
        VK_KHR_shader_integer_dot_product             : extension revision 1
        VK_KHR_shader_non_semantic_info               : extension revision 1
        VK_KHR_shader_subgroup_extended_types         : extension revision 1
        VK_KHR_shader_terminate_invocation            : extension revision 1
        VK_KHR_spirv_1_4                              : extension revision 1
        VK_KHR_storage_buffer_storage_class           : extension revision 1
        VK_KHR_swapchain                              : extension revision 70
        VK_KHR_swapchain_mutable_format               : extension revision 1
        VK_KHR_synchronization2                       : extension revision 1
        VK_KHR_timeline_semaphore                     : extension revision 2
        VK_KHR_uniform_buffer_standard_layout         : extension revision 1
        VK_KHR_variable_pointers                      : extension revision 1
        VK_KHR_vulkan_memory_model                    : extension revision 3
        VK_KHR_zero_initialize_workgroup_memory       : extension revision 1
        VK_NV_device_generated_commands               : extension revision 3

VkQueueFamilyProperties:
========================
        queueProperties[0]:
        -------------------
                minImageTransferGranularity = (1,1,1)
                queueCount                  = 1
                queueFlags                  = QUEUE_GRAPHICS | QUEUE_COMPUTE | QUEUE_TRANSFER
                timestampValidBits          = 64
                present support             = false

VkPhysicalDeviceMemoryProperties:
=================================
memoryHeaps: count = 1
        memoryHeaps[0]:
                size   = 949276672 (0x3894d000) (905.30 MiB)
                budget = 949276672 (0x3894d000) (905.30 MiB)
                usage  = 264392704 (0x0fc25000) (252.14 MiB)
                flags: count = 1
                        MEMORY_HEAP_DEVICE_LOCAL_BIT
memoryTypes: count = 1
        memoryTypes[0]:
                heapIndex     = 0
                propertyFlags = 0x000f: count = 4
                        MEMORY_PROPERTY_DEVICE_LOCAL_BIT
                        MEMORY_PROPERTY_HOST_VISIBLE_BIT
                        MEMORY_PROPERTY_HOST_COHERENT_BIT
                        MEMORY_PROPERTY_HOST_CACHED_BIT
                usable for:
                        IMAGE_TILING_OPTIMAL:
                                color images
                                FORMAT_D16_UNORM
                                FORMAT_X8_D24_UNORM_PACK32
                                FORMAT_D32_SFLOAT
                                FORMAT_S8_UINT
                                FORMAT_D24_UNORM_S8_UINT
                                FORMAT_D32_SFLOAT_S8_UINT
                                (non-sparse)
                        IMAGE_TILING_LINEAR:
                                color images
                                (non-sparse)

VkPhysicalDeviceFeatures:
=========================
        robustBufferAccess                      = true
        fullDrawIndexUint32                     = true
        imageCubeArray                          = true
        independentBlend                        = true
        geometryShader                          = true
        tessellationShader                      = true
        sampleRateShading                       = true
        dualSrcBlend                            = true
        logicOp                                 = true
        multiDrawIndirect                       = true
        drawIndirectFirstInstance               = true
        depthClamp                              = true
        depthBiasClamp                          = true
        fillModeNonSolid                        = true
        depthBounds                             = false
        wideLines                               = true
        largePoints                             = true
        alphaToOne                              = true
        multiViewport                           = true
        samplerAnisotropy                       = true
        textureCompressionETC2                  = false
        textureCompressionASTC_LDR              = false
        textureCompressionBC                    = true
        occlusionQueryPrecise                   = true
        pipelineStatisticsQuery                 = true
        vertexPipelineStoresAndAtomics          = true
        fragmentStoresAndAtomics                = true
        shaderTessellationAndGeometryPointSize  = true
        shaderImageGatherExtended               = true
        shaderStorageImageExtendedFormats       = true
        shaderStorageImageMultisample           = true
        shaderStorageImageReadWithoutFormat     = true
        shaderStorageImageWriteWithoutFormat    = true
        shaderUniformBufferArrayDynamicIndexing = true
        shaderSampledImageArrayDynamicIndexing  = true
        shaderStorageBufferArrayDynamicIndexing = true
        shaderStorageImageArrayDynamicIndexing  = true
        shaderClipDistance                      = true
        shaderCullDistance                      = true
        shaderFloat64                           = true
        shaderInt64                             = true
        shaderInt16                             = true
        shaderResourceResidency                 = false
        shaderResourceMinLod                    = false
        sparseBinding                           = false
        sparseResidencyBuffer                   = false
        sparseResidencyImage2D                  = false
        sparseResidencyImage3D                  = false
        sparseResidency2Samples                 = false
        sparseResidency4Samples                 = false
        sparseResidency8Samples                 = false
        sparseResidency16Samples                = false
        sparseResidencyAliased                  = false
        variableMultisampleRate                 = false
        inheritedQueries                        = false

VkPhysicalDevice16BitStorageFeatures:
-------------------------------------
        storageBuffer16BitAccess           = true
        uniformAndStorageBuffer16BitAccess = true
        storagePushConstant16              = true
        storageInputOutput16               = false

VkPhysicalDevice4444FormatsFeaturesEXT:
---------------------------------------
        formatA4R4G4B4 = true
        formatA4B4G4R4 = true

VkPhysicalDevice8BitStorageFeatures:
------------------------------------
        storageBuffer8BitAccess           = true
        uniformAndStorageBuffer8BitAccess = true
        storagePushConstant8              = true

VkPhysicalDeviceBorderColorSwizzleFeaturesEXT:
----------------------------------------------
        borderColorSwizzle          = true
        borderColorSwizzleFromImage = true

VkPhysicalDeviceBufferDeviceAddressFeatures:
--------------------------------------------
        bufferDeviceAddress              = true
        bufferDeviceAddressCaptureReplay = false
        bufferDeviceAddressMultiDevice   = false

VkPhysicalDeviceColorWriteEnableFeaturesEXT:
--------------------------------------------
        colorWriteEnable = true

VkPhysicalDeviceConditionalRenderingFeaturesEXT:
------------------------------------------------
        conditionalRendering          = true
        inheritedConditionalRendering = false

VkPhysicalDeviceCustomBorderColorFeaturesEXT:
---------------------------------------------
        customBorderColors             = true
        customBorderColorWithoutFormat = true

VkPhysicalDeviceDepthClipControlFeaturesEXT:
--------------------------------------------
        depthClipControl = true

VkPhysicalDeviceDepthClipEnableFeaturesEXT:
-------------------------------------------
        depthClipEnable = true

VkPhysicalDeviceDescriptorIndexingFeatures:
-------------------------------------------
        shaderInputAttachmentArrayDynamicIndexing          = true
        shaderUniformTexelBufferArrayDynamicIndexing       = true
        shaderStorageTexelBufferArrayDynamicIndexing       = true
        shaderUniformBufferArrayNonUniformIndexing         = true
        shaderSampledImageArrayNonUniformIndexing          = true
        shaderStorageBufferArrayNonUniformIndexing         = true
        shaderStorageImageArrayNonUniformIndexing          = true
        shaderInputAttachmentArrayNonUniformIndexing       = true
        shaderUniformTexelBufferArrayNonUniformIndexing    = true
        shaderStorageTexelBufferArrayNonUniformIndexing    = true
        descriptorBindingUniformBufferUpdateAfterBind      = true
        descriptorBindingSampledImageUpdateAfterBind       = true
        descriptorBindingStorageImageUpdateAfterBind       = true
        descriptorBindingStorageBufferUpdateAfterBind      = true
        descriptorBindingUniformTexelBufferUpdateAfterBind = true
        descriptorBindingStorageTexelBufferUpdateAfterBind = true
        descriptorBindingUpdateUnusedWhilePending          = true
        descriptorBindingPartiallyBound                    = true
        descriptorBindingVariableDescriptorCount           = true
        runtimeDescriptorArray                             = true

VkPhysicalDeviceDynamicRenderingFeatures:
-----------------------------------------
        dynamicRendering = true

VkPhysicalDeviceExtendedDynamicState2FeaturesEXT:
-------------------------------------------------
        extendedDynamicState2                   = true
        extendedDynamicState2LogicOp            = true
        extendedDynamicState2PatchControlPoints = true

VkPhysicalDeviceExtendedDynamicStateFeaturesEXT:
------------------------------------------------
        extendedDynamicState = true

VkPhysicalDeviceHostQueryResetFeatures:
---------------------------------------
        hostQueryReset = true

VkPhysicalDeviceImageRobustnessFeatures:
----------------------------------------
        robustImageAccess = true

VkPhysicalDeviceImagelessFramebufferFeatures:
---------------------------------------------
        imagelessFramebuffer = true

VkPhysicalDeviceIndexTypeUint8FeaturesEXT:
------------------------------------------
        indexTypeUint8 = true

VkPhysicalDeviceInlineUniformBlockFeatures:
-------------------------------------------
        inlineUniformBlock                                 = true
        descriptorBindingInlineUniformBlockUpdateAfterBind = true

VkPhysicalDeviceLineRasterizationFeaturesEXT:
---------------------------------------------
        rectangularLines         = true
        bresenhamLines           = true
        smoothLines              = true
        stippledRectangularLines = true
        stippledBresenhamLines   = true
        stippledSmoothLines      = true

VkPhysicalDeviceMaintenance4Features:
-------------------------------------
        maintenance4 = true

VkPhysicalDeviceMemoryPriorityFeaturesEXT:
------------------------------------------
        memoryPriority = true

VkPhysicalDeviceMultiDrawFeaturesEXT:
-------------------------------------
        multiDraw = true

VkPhysicalDeviceMultiviewFeatures:
----------------------------------
        multiview                   = true
        multiviewGeometryShader     = true
        multiviewTessellationShader = true

VkPhysicalDevicePageableDeviceLocalMemoryFeaturesEXT:
-----------------------------------------------------
        pageableDeviceLocalMemory = true

VkPhysicalDevicePipelineCreationCacheControlFeatures:
-----------------------------------------------------
        pipelineCreationCacheControl = true

VkPhysicalDevicePrimitiveTopologyListRestartFeaturesEXT:
--------------------------------------------------------
        primitiveTopologyListRestart      = true
        primitiveTopologyPatchListRestart = true

VkPhysicalDevicePrivateDataFeatures:
------------------------------------
        privateData = true

VkPhysicalDeviceProtectedMemoryFeatures:
----------------------------------------
        protectedMemory = false

VkPhysicalDeviceProvokingVertexFeaturesEXT:
-------------------------------------------
        provokingVertexLast                       = true
        transformFeedbackPreservesProvokingVertex = true

VkPhysicalDeviceRobustness2FeaturesEXT:
---------------------------------------
        robustBufferAccess2 = true
        robustImageAccess2  = true
        nullDescriptor      = true

VkPhysicalDeviceSamplerYcbcrConversionFeatures:
-----------------------------------------------
        samplerYcbcrConversion = false

VkPhysicalDeviceScalarBlockLayoutFeatures:
------------------------------------------
        scalarBlockLayout = true

VkPhysicalDeviceSeparateDepthStencilLayoutsFeatures:
----------------------------------------------------
        separateDepthStencilLayouts = true

VkPhysicalDeviceShaderAtomicFloat2FeaturesEXT:
----------------------------------------------
        shaderBufferFloat16Atomics      = false
        shaderBufferFloat16AtomicAdd    = false
        shaderBufferFloat16AtomicMinMax = false
        shaderBufferFloat32AtomicMinMax = true
        shaderBufferFloat64AtomicMinMax = false
        shaderSharedFloat16Atomics      = false
        shaderSharedFloat16AtomicAdd    = false
        shaderSharedFloat16AtomicMinMax = false
        shaderSharedFloat32AtomicMinMax = true
        shaderSharedFloat64AtomicMinMax = false
        shaderImageFloat32AtomicMinMax  = true
        sparseImageFloat32AtomicMinMax  = false

VkPhysicalDeviceShaderAtomicFloatFeaturesEXT:
---------------------------------------------
        shaderBufferFloat32Atomics   = true
        shaderBufferFloat32AtomicAdd = true
        shaderBufferFloat64Atomics   = false
        shaderBufferFloat64AtomicAdd = false
        shaderSharedFloat32Atomics   = true
        shaderSharedFloat32AtomicAdd = true
        shaderSharedFloat64Atomics   = false
        shaderSharedFloat64AtomicAdd = false
        shaderImageFloat32Atomics    = true
        shaderImageFloat32AtomicAdd  = true
        sparseImageFloat32Atomics    = false
        sparseImageFloat32AtomicAdd  = false

VkPhysicalDeviceShaderAtomicInt64Features:
------------------------------------------
        shaderBufferInt64Atomics = true
        shaderSharedInt64Atomics = true

VkPhysicalDeviceShaderClockFeaturesKHR:
---------------------------------------
        shaderSubgroupClock = true
        shaderDeviceClock   = true

VkPhysicalDeviceShaderDemoteToHelperInvocationFeatures:
-------------------------------------------------------
        shaderDemoteToHelperInvocation = true

VkPhysicalDeviceShaderDrawParametersFeatures:
---------------------------------------------
        shaderDrawParameters = true

VkPhysicalDeviceShaderFloat16Int8Features:
------------------------------------------
        shaderFloat16 = true
        shaderInt8    = true

VkPhysicalDeviceShaderIntegerDotProductFeatures:
------------------------------------------------
        shaderIntegerDotProduct = true

VkPhysicalDeviceShaderSubgroupExtendedTypesFeatures:
----------------------------------------------------
        shaderSubgroupExtendedTypes = true

VkPhysicalDeviceShaderTerminateInvocationFeatures:
--------------------------------------------------
        shaderTerminateInvocation = true

VkPhysicalDeviceSubgroupSizeControlFeatures:
--------------------------------------------
        subgroupSizeControl  = true
        computeFullSubgroups = true

VkPhysicalDeviceSynchronization2Features:
-----------------------------------------
        synchronization2 = true

VkPhysicalDeviceTexelBufferAlignmentFeaturesEXT:
------------------------------------------------
        texelBufferAlignment = true

VkPhysicalDeviceTextureCompressionASTCHDRFeatures:
--------------------------------------------------
        textureCompressionASTC_HDR = false

VkPhysicalDeviceTimelineSemaphoreFeatures:
------------------------------------------
        timelineSemaphore = true

VkPhysicalDeviceTransformFeedbackFeaturesEXT:
---------------------------------------------
        transformFeedback = true
        geometryStreams   = true

VkPhysicalDeviceUniformBufferStandardLayoutFeatures:
----------------------------------------------------
        uniformBufferStandardLayout = true

VkPhysicalDeviceVariablePointersFeatures:
-----------------------------------------
        variablePointersStorageBuffer = true
        variablePointers              = true

VkPhysicalDeviceVertexAttributeDivisorFeaturesEXT:
--------------------------------------------------
        vertexAttributeInstanceRateDivisor     = true
        vertexAttributeInstanceRateZeroDivisor = true

VkPhysicalDeviceVertexInputDynamicStateFeaturesEXT:
---------------------------------------------------
        vertexInputDynamicState = true

VkPhysicalDeviceVulkan11Features:
---------------------------------
        storageBuffer16BitAccess           = true
        uniformAndStorageBuffer16BitAccess = true
        storagePushConstant16              = true
        storageInputOutput16               = false
        multiview                          = true
        multiviewGeometryShader            = true
        multiviewTessellationShader        = true
        variablePointersStorageBuffer      = true
        variablePointers                   = true
        protectedMemory                    = false
        samplerYcbcrConversion             = false
        shaderDrawParameters               = true

VkPhysicalDeviceVulkan12Features:
---------------------------------
        samplerMirrorClampToEdge                           = true
        drawIndirectCount                                  = true
        storageBuffer8BitAccess                            = true
        uniformAndStorageBuffer8BitAccess                  = true
        storagePushConstant8                               = true
        shaderBufferInt64Atomics                           = true
        shaderSharedInt64Atomics                           = true
        shaderFloat16                                      = true
        shaderInt8                                         = true
        descriptorIndexing                                 = true
        shaderInputAttachmentArrayDynamicIndexing          = true
        shaderUniformTexelBufferArrayDynamicIndexing       = true
        shaderStorageTexelBufferArrayDynamicIndexing       = true
        shaderUniformBufferArrayNonUniformIndexing         = true
        shaderSampledImageArrayNonUniformIndexing          = true
        shaderStorageBufferArrayNonUniformIndexing         = true
        shaderStorageImageArrayNonUniformIndexing          = true
        shaderInputAttachmentArrayNonUniformIndexing       = true
        shaderUniformTexelBufferArrayNonUniformIndexing    = true
        shaderStorageTexelBufferArrayNonUniformIndexing    = true
        descriptorBindingUniformBufferUpdateAfterBind      = true
        descriptorBindingSampledImageUpdateAfterBind       = true
        descriptorBindingStorageImageUpdateAfterBind       = true
        descriptorBindingStorageBufferUpdateAfterBind      = true
        descriptorBindingUniformTexelBufferUpdateAfterBind = true
        descriptorBindingStorageTexelBufferUpdateAfterBind = true
        descriptorBindingUpdateUnusedWhilePending          = true
        descriptorBindingPartiallyBound                    = true
        descriptorBindingVariableDescriptorCount           = true
        runtimeDescriptorArray                             = true
        samplerFilterMinmax                                = true
        scalarBlockLayout                                  = true
        imagelessFramebuffer                               = true
        uniformBufferStandardLayout                        = true
        shaderSubgroupExtendedTypes                        = true
        separateDepthStencilLayouts                        = true
        hostQueryReset                                     = true
        timelineSemaphore                                  = true
        bufferDeviceAddress                                = true
        bufferDeviceAddressCaptureReplay                   = false
        bufferDeviceAddressMultiDevice                     = false
        vulkanMemoryModel                                  = true
        vulkanMemoryModelDeviceScope                       = true
        vulkanMemoryModelAvailabilityVisibilityChains      = true
        shaderOutputViewportIndex                          = true
        shaderOutputLayer                                  = true
        subgroupBroadcastDynamicId                         = true

VkPhysicalDeviceVulkan13Features:
---------------------------------
        robustImageAccess                                  = true
        inlineUniformBlock                                 = true
        descriptorBindingInlineUniformBlockUpdateAfterBind = true
        pipelineCreationCacheControl                       = true
        privateData                                        = true
        shaderDemoteToHelperInvocation                     = true
        shaderTerminateInvocation                          = true
        subgroupSizeControl                                = true
        computeFullSubgroups                               = true
        synchronization2                                   = true
        textureCompressionASTC_HDR                         = false
        shaderZeroInitializeWorkgroupMemory                = true
        dynamicRendering                                   = true
        shaderIntegerDotProduct                            = true
        maintenance4                                       = true

VkPhysicalDeviceVulkanMemoryModelFeatures:
------------------------------------------
        vulkanMemoryModel                             = true
        vulkanMemoryModelDeviceScope                  = true
        vulkanMemoryModelAvailabilityVisibilityChains = true

VkPhysicalDeviceZeroInitializeWorkgroupMemoryFeatures:
------------------------------------------------------
        shaderZeroInitializeWorkgroupMemory = true

@DifferentialityDevelopment
Copy link
Contributor Author

Nice, so the raspberry pi should be able to support it as well, I will need to adjust workgroup size I think to max 3 from what I see.

@DifferentialityDevelopment
Copy link
Contributor Author

sudo nice -n -20 ./main inference --model ..dllama_original_q40.bin --tokenizer ..dllama-llama3-tokenizer.t --weights-float-type q40 --buffer-float-type q80 --prompt "Hello world" --steps 16 --nthreads 1
💡 arch: llama2
💡 dim: 4096
💡 hiddenDim: 14336
💡 nLayers: 32
💡 nHeads: 32
💡 nKvHeads: 8
💡 vocabSize: 128256
💡 seqLen: 2048
💡 nSlices: 1
💡 ropeTheta: 500000.0
📄 bosId: 128000
📄 eosId: 128001
🕒 ropeCache: 32768 kB
⏩ Loaded 6175568 kB
Created Vulkan Instance!
Device Name: NVIDIA GeForce RTX 3060
API Version: 1.3.242
Memory Heaps: 2
Heap 0: 12288 MB
Heap 1: 24004 MB
Memory Types: 5
Type 0: 1
Type 1: 0 Device Local
Type 2: 1 Host Visible Host Coherent
Type 3: 1 Host Visible Host Coherent
Type 4: 0 Device Local Host Visible Host Coherent
Created pipeline F32_F32
Created pipeline Q40_Q80
🔶 G 1307 ms I 1307 ms T 0 ms S 0 kB R 0 kB Hello
🔶 G 1315 ms I 1315 ms T 0 ms S 0 kB R 0 kB world
🔶 G 1384 ms I 1382 ms T 0 ms S 0 kB R 0 kB OO
🔶 G 1520 ms I 1518 ms T 0 ms S 0 kB R 0 kB AAAAAAAA
🔶 G 1595 ms I 1591 ms T 0 ms S 0 kB R 0 kB gambar
🔶 G 1583 ms I 1580 ms T 0 ms S 0 kB R 0 kB and
🔶 G 1605 ms I 1600 ms T 0 ms S 0 kB R 0 kB HUD
🔶 G 1623 ms I 1618 ms T 0 ms S 0 kB R 0 kB Sm
🔶 G 1616 ms I 1612 ms T 0 ms S 0 kB R 0 kB AG
🔶 G 1561 ms I 1556 ms T 0 ms S 0 kB R 0 kB asz
🔶 G 1578 ms I 1573 ms T 0 ms S 0 kB R 0 kB asse
🔶 G 1598 ms I 1593 ms T 0 ms S 0 kB R 0 kB imagination

Not quite there yet it seems

@DifferentialityDevelopment
Copy link
Contributor Author

It looks like the best approach would be to determine how many layers of weights can be loaded onto GPU memory and if a layer is on GPU memory then process it in Vulkan, else via CPU.
Reason why I say this is that I'm pretty sure now that the process of loading the weights/input to GPU memory, doing the calculation and then getting the results back will in most cases be slower than just doing it on the CPU.
Maybe for Llama 70B it might be faster though.

This is what my compute shader for Q40Q80 looks like right now, best result so far for a 4096 x 4096 weight matrix and 1 x 4096 input matrix has been 2ms, which is about the same as the CPU matmul.
That is without the overhead of setting up all the buffers, copying to/from GPU memory and dispatching the workload.

matmulQ40Q80.txt

There is a lot I still have to figure out, making good headway though.

@b4rtaz
Copy link
Owner

b4rtaz commented May 25, 2024

Probably I'm doing something wrong, but I wanted to compare the performance of CPU with Vulkan on my Mac. Llama.cpp has already implemented it so:

CPU:

llama_print_timings: prompt eval time =     148.06 ms /    31 tokens (    4.78 ms per token,   209.37 tokens per second)
llama_print_timings:        eval time =    7150.30 ms /   254 runs   (   28.15 ms per token,    35.52 tokens per second)

Vulkan:

llama_print_timings: prompt eval time =     517.43 ms /    31 tokens (   16.69 ms per token,    59.91 tokens per second)
llama_print_timings:        eval time =   18112.97 ms /   255 runs   (   71.03 ms per token,    14.08 tokens per second)

Additionaly I get some weird characters in the response, so maybe something is broken.

@DifferentialityDevelopment could you observe any speed up with Vulcan on llama.cpp?

@DifferentialityDevelopment
Copy link
Contributor Author

Probably I'm doing something wrong, but I wanted to compare the performance of CPU with Vulkan on my Mac. Llama.cpp has already implemented it so:

CPU:

llama_print_timings: prompt eval time =     148.06 ms /    31 tokens (    4.78 ms per token,   209.37 tokens per second)
llama_print_timings:        eval time =    7150.30 ms /   254 runs   (   28.15 ms per token,    35.52 tokens per second)

Vulkan:

llama_print_timings: prompt eval time =     517.43 ms /    31 tokens (   16.69 ms per token,    59.91 tokens per second)
llama_print_timings:        eval time =   18112.97 ms /   255 runs   (   71.03 ms per token,    14.08 tokens per second)

Additionaly I get some weird characters in the response, so maybe something is broken.

@DifferentialityDevelopment could you observe any speed up with Vulcan on llama.cpp?

Do you have one of those with the unified memory architecture?

For me Vulkan is much faster than just CPU inference, as it makes use of my RTX 3060.
Vulkan is almost just as fast as using the Cuda version of llama.cpp.

Also I know llama.cpp just had a patch that supposedly fixes some issues with Vulkan, so might be you just had an older version?

@DifferentialityDevelopment
Copy link
Contributor Author

I've been having a rough time getting Vulkan to work properly and efficiently with distributed Llama. I'm not sure exactly what I'm doing wrong yet. The tests I've run indicate that the Vulkan inference functions are within the margin when compared to the CPU matmul functions. However, when I use the main inference loop, the results are significantly different when using the Vulkan compute shader that handles the QKV using a Vulkan implementation of matmulQ40_Q80. So, I'm not quite sure what's going on.

My plan is to offload as many layers as possible to the GPU at startup (possibly configurable manually through a setting). During inference, it can then use the weights already in GPU memory. It will probably take me a month or more to get it working correctly. I will try to keep the Vulkan branch up to date with the main branch as much as possible so that it will be easier to merge when I eventually make a pull request.

It already offloads the layers to GPU memory on startup (sort of), but the computation results are still not correct.

@b4rtaz
Copy link
Owner

b4rtaz commented May 25, 2024

Do you have one of those with the unified memory architecture?

Yeah, maybe CPU is too fast.

My plan is to offload as many layers as possible to the GPU at startup (possibly configurable manually through a setting).

This is how it works in llama.cpp, there is -ngl argument.

@unclemusclez
Copy link

can i help with testing this? i have some gpus i want to throw into the mix that are not ROCm/CUDA capable, plus i want to try my pi's.

@b4rtaz
Copy link
Owner

b4rtaz commented Jun 4, 2024

After some time I achieved a tiny progress, I have a faster shader than a single M1 core:

// m1
CPU: 70 ms
GPU: 26 ms

Unfortunately the same shader on Raspberry Pi:

// raspberry pi 5
CPU: 58 ms
GPU: 1045 ms

🤯


The weird thing is that, I noticed vulkaninfo --summary on Rasberry Pi 5 returns:

Devices:
========
GPU0:
        apiVersion         = 1.2.255
        driverVersion      = 23.2.1
        vendorID           = 0x14e4
        deviceID           = 0x55701c33
        deviceType         = PHYSICAL_DEVICE_TYPE_INTEGRATED_GPU
        deviceName         = V3D 7.1.7
        driverID           = DRIVER_ID_MESA_V3DV
        driverName         = V3DV Mesa
        driverInfo         = Mesa 23.2.1-1~bpo12+rpt3
        conformanceVersion = 1.3.6.1
        deviceUUID         = 5fd8106e-741a-cafa-e080-fdb16cf11a80
        driverUUID         = 1698c6ef-161f-3213-5159-557202953ee9
GPU1:
        apiVersion         = 1.3.255
        driverVersion      = 0.0.1
        vendorID           = 0x10005
        deviceID           = 0x0000
        deviceType         = PHYSICAL_DEVICE_TYPE_CPU
        deviceName         = llvmpipe (LLVM 15.0.6, 128 bits)
        driverID           = DRIVER_ID_MESA_LLVMPIPE
        driverName         = llvmpipe
        driverInfo         = Mesa 23.2.1-1~bpo12+rpt3 (LLVM 15.0.6)
        conformanceVersion = 1.3.1.1
        deviceUUID         = 6d657361-3233-2e32-2e31-2d317e627000
        driverUUID         = 6c6c766d-7069-7065-5555-494400000000

And for both devices I have the same speed.

@DifferentialityDevelopment
Copy link
Contributor Author

Something I have learned is that just copying the data to vulkan buffers isn't the whole picture, there is a bit of a process of moving it to the GPU memory which is where it's much faster.
You have to move the data first to a staging buffer, and then it can be copied by vulkan to the gpu memory, you can't directly copy it from host memory to gpu memory.
Also apparently using vulkan memory allocator (VMA) can take over a lot of this, and reduce the amount of boilerplate code necessary.
https://gpuopen-librariesandsdks.github.io/VulkanMemoryAllocator/html/choosing_memory_type.html#choosing_memory_type_usage

@DifferentialityDevelopment
Copy link
Contributor Author

Also the warp size, local/global group size etc are also very important to fully utilizing the GPU.
I haven't had time to work on it further, well done on getting it to work!

Also thinking about it, raspberry pi has no dedicated GPU memory, it uses the system RAM, it kind of explains why your M1 was faster on GPU as the GPU cores can deal with the data in parallel much more efficiently than the CPU can and both the GPU and CPU has access to the memory at the same speed since it uses the unified memory architecture.

Still I'm sure the raspberry pi's GPU should be able to do the computations faster than it's CPU can, just wonder how to make it happen, I don't have a raspberry pi to test with myself, but I'm soon going to be able to upgrade to a 4 node setup (PC's)

@b4rtaz
Copy link
Owner

b4rtaz commented Jun 5, 2024

Some progress: 🫣

// raspberry pi 5
CPU: 51 ms
GPU: 303 ms

@DifferentialityDevelopment
Copy link
Contributor Author

Some progress: 🫣

// raspberry pi 5
CPU: 51 ms
GPU: 303 ms

What size matrices are you testing it with?

@b4rtaz
Copy link
Owner

b4rtaz commented Jun 5, 2024

n = 4096;
d = 14336;

This requires around 229448 kB in memory (total size of input, weights, output)

I'm trying to implement matrix x vector multiplication. The size is basically taken from Llama model.

@unclemusclez
Copy link

https://github.com/LostRuins/koboldcpp/tree/318d5b87fc1602ef16d8271bfdd937ef416a8182/include/vulkan
koboldcpp seems to work decently with windows. not sure how it differs from llama.cpp

https://github.com/Const-me/Cgml might offer some insight, although primarily for directx3d

https://github.com/CNugteren/CLBlast seems to be the consensus on embedded and amd hardware

@zhengpeirong
Copy link

After some time I achieved a tiny progress, I have a faster shader than a single M1 core:

// m1
CPU: 70 ms
GPU: 26 ms

Unfortunately the same shader on Raspberry Pi:

// raspberry pi 5
CPU: 58 ms
GPU: 1045 ms

🤯

The weird thing is that, I noticed vulkaninfo --summary on Rasberry Pi 5 returns:

Devices:
========
GPU0:
        apiVersion         = 1.2.255
        driverVersion      = 23.2.1
        vendorID           = 0x14e4
        deviceID           = 0x55701c33
        deviceType         = PHYSICAL_DEVICE_TYPE_INTEGRATED_GPU
        deviceName         = V3D 7.1.7
        driverID           = DRIVER_ID_MESA_V3DV
        driverName         = V3DV Mesa
        driverInfo         = Mesa 23.2.1-1~bpo12+rpt3
        conformanceVersion = 1.3.6.1
        deviceUUID         = 5fd8106e-741a-cafa-e080-fdb16cf11a80
        driverUUID         = 1698c6ef-161f-3213-5159-557202953ee9
GPU1:
        apiVersion         = 1.3.255
        driverVersion      = 0.0.1
        vendorID           = 0x10005
        deviceID           = 0x0000
        deviceType         = PHYSICAL_DEVICE_TYPE_CPU
        deviceName         = llvmpipe (LLVM 15.0.6, 128 bits)
        driverID           = DRIVER_ID_MESA_LLVMPIPE
        driverName         = llvmpipe
        driverInfo         = Mesa 23.2.1-1~bpo12+rpt3 (LLVM 15.0.6)
        conformanceVersion = 1.3.1.1
        deviceUUID         = 6d657361-3233-2e32-2e31-2d317e627000
        driverUUID         = 6c6c766d-7069-7065-5555-494400000000

And for both devices I have the same speed.

The reason why Vulkan is slow is here:
Tencent/ncnn#2435 (comment)

pi@raspberrypi:~/ncnn/benchmark $ ./benchncnn 10 4 0 0 -1 >> text.out
[0 V3D 7.1.7]  queueC=0[1]  queueG=0[1]  queueT=0[1]
[0 V3D 7.1.7]  bugsbn1=0  bugbilz=0  bugcopc=0  bugihfa=0
[0 V3D 7.1.7]  fp16-p/s/u/a=1/1/1/0  int8-p/s/u/a=1/1/1/0
[0 V3D 7.1.7]  subgroup=16  basic/vote/ballot/shuffle=1/0/0/0
[0 V3D 7.1.7]  fp16-8x8x16/16x8x8/16x8x16/16x16x16=0/0/0/0

Vulkan drivers for the Raspberry Pi lack the arithmetic support for 16-bit floating point and 8-bit integers.

@unclemusclez
Copy link

unclemusclez commented Jun 6, 2024

http://raspbian.raspberrypi.com/raspbian/pool/main/c/clblast/
https://forums.raspberrypi.com/viewtopic.php?t=11177 "What is hard-float?"

i asked about hard-float RPi earlier, not realizing that hf in this community means hugging face.

RPi in 32-bit/armhf might be more capable? https://cdimage.ubuntu.com/releases/22.04.4/release/
Preinstalled server image >> Raspberry Pi Generic (Hard-Float) preinstalled server image

https://launchpad.net/ubuntu/jammy/+source/clblast natively available in 64 and 32 bit distributions

https://en.wikipedia.org/wiki/ARM_architecture_family#Floating-point_(VFP):

"...armhf (ARM hard float) refers to the ARMv7 architecture including the additional VFP3-D16 floating-point hardware extension (and Thumb-2) above. Software packages and cross-compiler tools use the armhf vs. arm/armel suffixes to differentiate
VFPv4-D16"

"Implemented on most Cortex-A8 and A9 ARMv7 processors. It is backward-compatible with VFPv2, except that it cannot trap floating-point exceptions. VFPv3 has 32 64-bit FPU registers as standard, adds VCVT instructions to convert between scalar, float and double, adds immediate mode to VMOV such that constants can be loaded into FPU registers."
"As above, but it has only 16 64-bit FPU registers. Implemented on Cortex-A5 and A7 processors in the case of an FPU without Neon."

https://en.wikipedia.org/wiki/IEEE_754
https://blog.tensorflow.org/2023/11/half-precision-inference-doubles-on-device-inference-performance.html
https://github.com/petewarden/tensorflow_makefile/blob/master/tensorflow/core/framework/bfloat16.h

interesting comparison:
https://moocaholic.medium.com/fp64-fp32-fp16-bfloat16-tf32-and-other-members-of-the-zoo-a1ca7897d407

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants