diff --git a/content/learning-paths/mobile-graphics-and-gaming/kai_sme2_matmul_ukernel_explained/_index.md b/content/learning-paths/mobile-graphics-and-gaming/kai_sme2_matmul_ukernel_explained/_index.md new file mode 100644 index 0000000000..973967fa28 --- /dev/null +++ b/content/learning-paths/mobile-graphics-and-gaming/kai_sme2_matmul_ukernel_explained/_index.md @@ -0,0 +1,60 @@ +--- +title: KleidiAI SME2 matmul microkernel for quantized models explained + +minutes_to_complete: 40 + +who_is_this_for: This is an advanced topic for software developers, performance engineers, and AI practitioners + +learning_objectives: + - Learn how a KleidiAI matmual microkernel performs matrix multiplication with quantized data + - Learn how SME2 INT8 Outer Product Accumulate instructions are used for matrix multiplication + - Learn how a KleidiAI SME2 matmul microkernel accelerates matmul operators in a Large Lanague Model + - Learn how to integrate KleidiAI SME2 matmul microkernels to an AI framework or application + +prerequisites: + - Knowledge of KleidiAI and SME2 + +author: Zenon Zhilong Xiu + +### Tags +skilllevels: Advanced +subjects: ML +armips: + - Arm C1 CPU + - Arm SME2 unit +tools_software_languages: + - C++ + - KleidiAI + - llama.cpp +operatingsystems: + - Android + - Linux + + + +further_reading: + - resource: + title: part 1 Arm Scalable Matrix Extension Introduction + link: https://developer.arm.com/community/arm-community-blogs/b/architectures-and-processors-blog/posts/arm-scalable-matrix-extension-introduction + type: blog + - resource: + title: part 2 Arm Scalable Matrix Extension Instructions + link: https://developer.arm.com/community/arm-community-blogs/b/architectures-and-processors-blog/posts/arm-scalable-matrix-extension-introduction-p2 + type: blog + - resource: + title: part4 Arm SME2 Introduction + link: https://developer.arm.com/community/arm-community-blogs/b/architectures-and-processors-blog/posts/part4-arm-sme2-introduction + type: blog + - resource: + title: Profile llama.cpp performance with Arm Streamline and KleidiAI LLM kernels + link: https://learn.arm.com/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/ + type: blog + + + +### FIXED, DO NOT MODIFY +# ================================================================================ +weight: 1 # _index.md always has weight of 1 to order correctly +layout: "learningpathall" # All files under learning paths have this same wrapper +learning_path_main_page: "yes" # This should be surfaced when looking for related content. Only set for _index.md of learning path content. +--- \ No newline at end of file diff --git a/content/learning-paths/mobile-graphics-and-gaming/kai_sme2_matmul_ukernel_explained/_next-steps.md b/content/learning-paths/mobile-graphics-and-gaming/kai_sme2_matmul_ukernel_explained/_next-steps.md new file mode 100644 index 0000000000..727b395ddd --- /dev/null +++ b/content/learning-paths/mobile-graphics-and-gaming/kai_sme2_matmul_ukernel_explained/_next-steps.md @@ -0,0 +1,8 @@ +--- +# ================================================================================ +# FIXED, DO NOT MODIFY THIS FILE +# ================================================================================ +weight: 21 # The weight controls the order of the pages. _index.md always has weight 1. +title: "Next Steps" # Always the same, html page title. +layout: "learningpathall" # All files under learning paths have this same wrapper for Hugo processing. +--- diff --git a/content/learning-paths/mobile-graphics-and-gaming/kai_sme2_matmul_ukernel_explained/explain_with_an_example_p1.md b/content/learning-paths/mobile-graphics-and-gaming/kai_sme2_matmul_ukernel_explained/explain_with_an_example_p1.md new file mode 100644 index 0000000000..414c677fad --- /dev/null +++ b/content/learning-paths/mobile-graphics-and-gaming/kai_sme2_matmul_ukernel_explained/explain_with_an_example_p1.md @@ -0,0 +1,55 @@ +--- +title: Explain the SME2 matmul microkernel with an example - Part 1 +weight: 5 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +## Explain the SME2 matmul microkernel with an example - Part 1 +By integrating the SME2‑optimized KleidiAI kernels into llama.cpp, the heavy matrix‑multiplication workloads in the K, Q, and V computations of the attention blocks, as well as in the FFN layers, can be delegated to the SME2 matmul microkernel when running the Llama-3.2-3B-Q4_0.gguf model. +In these operators, the LHS (activation) data type is FP32, while RHS (weight) type uses GGML Q4_0 quantized type. + +To make the demonstration easier in this learning path, the LHS dimension [m, k] is simplified to [16, 64], the RHS dimension [n, k] is simplified to [64, 64], and the SME2 SVL is set as 512-bit. + +###Packing the RHS +Although the original Q4_0 RHS(weight) in the model uses INT4 quantization, it is signed INT4 quantization, rather than the unsigned INT4 quantization that the SME2 matmul microkernel requires. Moreover,the layout of the INT4 quantized data and the quantization scale does not meet the requirements of the SME2 matmul microkernel neither. Therefore, the LHS from the model needs to be converted from the signed INT4 data to unsigned INT4 and repacked. +Since the RHS(weight) remains unchanged during the inference, this conversion and packing only need to be performed only once when loading the model. + + +Let us have a close look at GGML Q4_0 quantization first to know how the orginal FP32 weight is quantized to Q4_0 format. +In the Q4_0 model, the Q4_0 weights are stored in layout of [n, k]. +GGML Q4_0 quantizes weights in blocks of 32 floats. For each block, it calculates a scale for the block and then converts each value into a signed 4-bit integer. The scale is stored as FP16. +Then GGML Q4_0 packs the values in a way of, +- the low nibble (bits 0–3) holds the first value (even index) +- and the high nibble (bits 4–7) holds the second value (odd index) +Thus, each byte contains a low/high pair. +The following diagram shows how GGML Q4_0 quantizes and packs the original [n, k] FP32 matrix into Q4_0 type with layout of [n, k]. +![Figure showing GGML Q4_0 quantization alt-text#center](images/q4_0_format.jpg "GGML Q4_0 quantization") + +Unfortunately, the Q4_0 format does not meet the requirements of the SME2 matmul microkernel. It needs to be converted to an unsigned INT4 quantization format and repacked using the *kai_run_rhs_pack_nxk_qsi4c32ps1s0scalef16_qsu4c32s16s0_neon* function. + +In this example, we use m=16 and k=64. +- The required mr value for the SME2 matmul kernel is obtained using *kai_get_mr_matmul_clamp_f32_qai8dxp1vlx4_qsi8cxp4vlx4_1vlx4vl_sme2_mopa*. Here, mr=16. +- The required nr value for the SME2 matmul kernel is obtained using *kai_get_nr_matmul_clamp_f32_qai8dxp1vlx4_qsi8cxp4vlx4_1vlx4vl_sme2_mopa*. Here, nr=64. +- The required kr value for the SME2 matmul kernel is obtained using *kai_get_kr_matmul_clamp_f32_qai8dxp1vlx4_qsi8cxp4vlx4_1vlx4vl_sme2_mopa*. Here, kr=4. +- The required sr value for the SME2 matmul kernel is obtained using *kai_get_sr_matmul_clamp_f32_qai8dxp1vlx4_qsi8cxp4vlx4_1vlx4vl_sme2_mopa*. Here, sr=2 (two INT4 elements in a byte). + +The function call stack for this process in llama.cpp when loading the model is as follows: +```text +llama_model_load + llama_model::load_tensors + llama_model_loader::load_all_data + ggml_backend_tensor_set + ggml_backend_cpu_kleidiai_buffer_set_tensor + ggml::cpu::kleidiai::tensor_traits::repack + kai_run_rhs_pack_nxk_qsi4c32ps1s0scalef16_qsu4c32s16s0_neon +``` +This process can be illustrated with the diagram below. +![Figure showing RHS packing with KleidiAI alt-text#center](images/kai_kernel_packed_rhs.jpg "RHS packing with KleidiAI") + +The numerical label of an element in the diagram is used to indicate its row and column number in the original matrix. For example , +![Figure showing Row_Col lable alt-text#center](images/row_col_lable.png "Row_Col lable") +it indicates that the element locates at row 01, column 02 in the original matrix. This row and column number remains unchanged in its quantized and packed matrix, so that the location of the element can be tracked easily. + +Now, the RHS is converted and packed into a format that can be handled by the SME2 matmul microkernel, allowing the packed RHS to be loaded into SME2 Z registers with sequential memory access. This improves memory access efficiency and reduces cache misses. \ No newline at end of file diff --git a/content/learning-paths/mobile-graphics-and-gaming/kai_sme2_matmul_ukernel_explained/explain_with_an_example_p2.md b/content/learning-paths/mobile-graphics-and-gaming/kai_sme2_matmul_ukernel_explained/explain_with_an_example_p2.md new file mode 100644 index 0000000000..4e4851dc64 --- /dev/null +++ b/content/learning-paths/mobile-graphics-and-gaming/kai_sme2_matmul_ukernel_explained/explain_with_an_example_p2.md @@ -0,0 +1,34 @@ +--- +title: Explain the SME2 matmul microkernel with an example - Part 2 +weight: 6 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +## Explain the SME2 matmul microkernel with an example - Part 2 +Next, the FP32 LHS (activation) needs to be quantized and packed when the llama.cpp graph runner computes the matmul nodes/operators. + +### Quantization and Packing of the LHS +Since the LHS (activation) keep changing, we need to dynamically quantize the original FP32 matrix and pack it into the qsi8d32p1vlx4 format. This can be achieved using the *kai_run_lhs_quant_pack_qsi8d32p_f32_neon* microkernel. + +The function call stack for this process in llama.cpp is as follows: +```text +llama_context::decode + llama_context::process_ubatch + llama_context::graph_compute + ggml_backend_sched_compute_splits + ggml_backend_cpu_graph_compute + ggml_graph_compute //tick off the compute thread + ggml_graph_compute_thread //the compute thread + ggml_compute_forward + ggml_cpu_extra_compute_forward + ggml::cpu::kleidiai::tensor_traits::compute_forward + ggml::cpu::kleidiai::tensor_traits::compute_forward_q4_0 + kai_run_lhs_quant_pack_qsi8d32p_f32_neon +``` +The diagram below illustrates how the RHS is quantized and packed by *kai_run_lhs_quant_pack_qsi8d32p_f32_neon*, +![Figure showing Quantization and Packing of the LHS alt-text#center](images/kai_run_lhs_quant_pack_qsi8d32p_f32_neon_for_sme2.jpg "Quantization and Packing of the LHS") + +The values of mr, nr, and kr can be obtained in the same way as described above. +The mr, nr, and kr together with the matrix dimensions m and k are passed as parameters to *kai_run_lhs_quant_pack_qsi8d32p_f32_neon*. This function quantizes the FP32 LHS to signed INT8 type and packed the quantized data and quantization scales as shown in the diagram above. It divides the m x n matrix into submatrices of size mr x kr (it is 16 x 4) as shown in blocks outlined by dashed lines in the upper matrix of the diagram, and then sequentially packs the rows within each submatrix. This allows the SME2 matmul kernel to load an entire submatrix into an SME2 Z register from contiguous memory, thus reducing cache misses by avoiding loading the submatrix across multiple rows. diff --git a/content/learning-paths/mobile-graphics-and-gaming/kai_sme2_matmul_ukernel_explained/explain_with_an_example_p3.md b/content/learning-paths/mobile-graphics-and-gaming/kai_sme2_matmul_ukernel_explained/explain_with_an_example_p3.md new file mode 100644 index 0000000000..caa702f4bd --- /dev/null +++ b/content/learning-paths/mobile-graphics-and-gaming/kai_sme2_matmul_ukernel_explained/explain_with_an_example_p3.md @@ -0,0 +1,78 @@ +--- +title: Explain the SME2 matmul microkernel with an example- Part 3 +weight: 7 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +## Explain the SME2 matmul microkernel with an example - Part 3 +Once the required LHS and RHS are both ready, *kai_matmul_clamp_f32_qsi8d32p1vlx4_qsi4c32p4vlx4_1vlx4vl_sme2_mopa* microkernel can run now. + +### Run the SME2 matmul microkernel +The operations performed to compute an 16x64 result submatrice (four 16x16 submatrices) (1VL x 4VL) are as follows: + +- Iterate along blocks along K dimension + - Iterate in a block with step of kr (kr=4) + - Load one SME2 SVL-length (512-bit) of data from the quantized and packed LHS (containing 64 INT8 values) into one SME2 Z register + - Load two SME2 SVL-lengths of data from the packed RHS (containing 2 x64x2 INT4 values) into two SME2 Z registers, then use the SME2 LUTI4 lookup table instruction to convert these INT4 values into INT8 type, extending them to four SME2 Z registers (4VL). + - Use the SME2 INT8 Outer Product Accumulate (MPOA) instruction to perform outer product operations with source from the Z register and each of the four Z registers, accumulates the results in four ZA tiles (which are initialized to zero). It produces intermediate results of four 16x16 output submatrices. + The processes of the first itration can be illustrated in the diagram below: +![Figure showing the first itration of the inner loop alt-text#center](images/run_matmul_sme2_step1.jpg "The first itration of the inner loop") + The diagram below illustrates the process of the second iteration along the K dimension, +![Figure showing the second itration of the inner loop alt-text#center](images/run_matmul_sme2_step2.jpg "The second itration of the inner loop") + - After completing the iterations in the block, the intermediate INT32 results of four 16x16 output submatrices are dequantized with the per-block LHS and RHS scale to FP32 floats, using Floating-point Multiply (FMUL), Floating-point Multiply and Accumulate (FMLA) and Signed fixed-point Convert to Floating-point (SCVTF) vector instructions. It produces the intermediate FP32 results of four 16x16 output submatrices. + - Accumulate the FP32 result above + +After completing itration along the K dimension, the FP32 results of four 16x16 output submatrices is ready. Then, save the result into memory. + +The code can be found [here](https://github.com/ARM-software/kleidiai/blob/main/kai/ukernels/matmul/matmul_clamp_f32_qsi8d32p_qai4c32p/kai_matmul_clamp_f32_qsi8d32p1vlx4_qai4c32p4vlx4_1vlx4vl_sme2_mopa_asm.S#L80) +Some comments are added to the code to help understanding the code. +```asm +KAI_ASM_LABEL(label_3) // K Loop + KAI_ASM_INST(0xc00800ff) // zero {za} , zeros the four ZA tile (za0.s, za1.s, za2.s, za3.s) + mov x11, x4 //Set block size +KAI_ASM_LABEL(label_4) // Block Loop + KAI_ASM_INST(0xa0404342) //ld1w {z2.s - z3.s}, pn8/z, [x26] // load two VLs packed RHS data (64x2x2 INT4 data) + addvl x26, x26, #2 // increase RHS address by two VLs + ld1h {z8.h}, p0/z, [x3] //load one VL quantized and packed LHS data (64 INT8 data) + addvl x3, x3, #1 // increase LHS address by one VLs + KAI_ASM_INST(0xc08a4044) // luti4 {z4.b - z5.b}, zt0, z2[0] //use LUT4I instruction to convert INT4 to INT8, one source VL produces two VLs result + KAI_ASM_INST(0xc08a4066) // luti4 {z6.b - z7.b}, zt0, z3[0] //use LUT4I instruction to convert INT4 to INT8, one source VL produces two VLs result + KAI_ASM_INST(0xa0840100) // smopa za0.s, p0/m, p0/m, z8.b, z4.b ] //Outer Product Accumulate with the VL of LHS, the first VL of RHS and ZA0.S + KAI_ASM_INST(0xa0850101) // smopa za1.s, p0/m, p0/m, z8.b, z5.b //Outer Product Accumulate with the VL of LHS, the second VL of RHS and ZA1.S + KAI_ASM_INST(0xa0860102) // smopa za2.s, p0/m, p0/m, z8.b, z6.b //Outer Product Accumulate with the VL of LHS, the third VL of RHS and ZA2.S + KAI_ASM_INST(0xa0870103) // smopa za3.s, p0/m, p0/m, z8.b, z7.b b //Outer Product Accumulate with the VL of LHS, the forth VL of RHS and ZA3.S + + subs x11, x11, #4 //block_index - 4 + b.gt label_4 //end of block iteration? + + // the code below performs per block dequantization of the four tiles with LHS and RHS scales + mov w12, #0 + mov x25, x24 + ld1b {z17.b}, p4/z, [x3] // lhs sum + ld1b {z16.b}, p4/z, [x3, #1, mul vl] // lhs scale + addvl x3, x3, #2 + KAI_ASM_INST(0xa040c354) // ld1w { z20.s - z23.s }, pn8/z, [x26] // rhs zp + KAI_ASM_INST(0xa041c340) // ld1w { z0.s - z3.s }, pn8/z, [x26, #4, mul vl ] // rhs scale + addvl x26, x26, #8 + pfalse p3.b +KAI_ASM_LABEL(label_5) + // omit some codes that perform the block quantization and save the result to memory + …… + blt label_5 + subs x10, x10, x4 //decrease the K index + b.gt label_3 //end of K loop? + +``` +In a single block loop, four pipelined SME2 INT8 MOPA instructions perform 4,096 MAC operations, calculating the intermediate results for the four 16x16 submatrices. It proves that SME2 MOPA can significantly improve matrix multiplication performance. + +To help understand the whole process, we map the first itration of LHS and RHS quantization and packing steps, as well as SME2 outer product accumulate operation and dequantization, back to the original FP32 LHS and RHS operations. Essentially, they equally perform the operation as shown below (there might be some quantization loss), +![Figure showing the original matrix representing of the first itration alt-text#center](images/run_matmul_sme2_original_present_step1.jpg "the original matrix representing of the first itration") + +The second iteration can be mapped back to the original FP32 LHS and RHS operations as below, +![Figure showing the original matrix representing of the second itration alt-text#center](images/run_matmul_sme2_original_present_step2.jpg "the original matrix representing of the second itration") + +**Note**: In this diagram, the RHS is laid out in the dimension of [N, K], which is different from the [K, N] dimension layout of the RHS in the video demonstration of 1VLx4VL. If you interpret the RHS in the diagrams above using the [K, N] dimension, you can match the previous video demonstration with the diagrams above. + +By repeating the submatrix computation across the M and N dimensions, the entire result matrix can be calculated. If a non-empty bias is passed to the SME2 matmul microkernel, it also adds the bias to the result matrix. diff --git a/content/learning-paths/mobile-graphics-and-gaming/kai_sme2_matmul_ukernel_explained/images/kai_kernel_packed_rhs.jpg b/content/learning-paths/mobile-graphics-and-gaming/kai_sme2_matmul_ukernel_explained/images/kai_kernel_packed_rhs.jpg new file mode 100644 index 0000000000..590e7595a0 Binary files /dev/null and b/content/learning-paths/mobile-graphics-and-gaming/kai_sme2_matmul_ukernel_explained/images/kai_kernel_packed_rhs.jpg differ diff --git a/content/learning-paths/mobile-graphics-and-gaming/kai_sme2_matmul_ukernel_explained/images/kai_matmul_kernel.jpg b/content/learning-paths/mobile-graphics-and-gaming/kai_sme2_matmul_ukernel_explained/images/kai_matmul_kernel.jpg new file mode 100644 index 0000000000..37800a20cc Binary files /dev/null and b/content/learning-paths/mobile-graphics-and-gaming/kai_sme2_matmul_ukernel_explained/images/kai_matmul_kernel.jpg differ diff --git a/content/learning-paths/mobile-graphics-and-gaming/kai_sme2_matmul_ukernel_explained/images/kai_run_lhs_quant_pack_qsi8d32p_f32_neon_for_sme2.jpg b/content/learning-paths/mobile-graphics-and-gaming/kai_sme2_matmul_ukernel_explained/images/kai_run_lhs_quant_pack_qsi8d32p_f32_neon_for_sme2.jpg new file mode 100644 index 0000000000..3d9de4e05e Binary files /dev/null and b/content/learning-paths/mobile-graphics-and-gaming/kai_sme2_matmul_ukernel_explained/images/kai_run_lhs_quant_pack_qsi8d32p_f32_neon_for_sme2.jpg differ diff --git a/content/learning-paths/mobile-graphics-and-gaming/kai_sme2_matmul_ukernel_explained/images/q4_0_format.jpg b/content/learning-paths/mobile-graphics-and-gaming/kai_sme2_matmul_ukernel_explained/images/q4_0_format.jpg new file mode 100644 index 0000000000..8a8e29af17 Binary files /dev/null and b/content/learning-paths/mobile-graphics-and-gaming/kai_sme2_matmul_ukernel_explained/images/q4_0_format.jpg differ diff --git a/content/learning-paths/mobile-graphics-and-gaming/kai_sme2_matmul_ukernel_explained/images/row_col_lable.png b/content/learning-paths/mobile-graphics-and-gaming/kai_sme2_matmul_ukernel_explained/images/row_col_lable.png new file mode 100644 index 0000000000..14497adcf7 Binary files /dev/null and b/content/learning-paths/mobile-graphics-and-gaming/kai_sme2_matmul_ukernel_explained/images/row_col_lable.png differ diff --git a/content/learning-paths/mobile-graphics-and-gaming/kai_sme2_matmul_ukernel_explained/images/run_matmul_sme2_original_present_step1.jpg b/content/learning-paths/mobile-graphics-and-gaming/kai_sme2_matmul_ukernel_explained/images/run_matmul_sme2_original_present_step1.jpg new file mode 100644 index 0000000000..e4df39ef0c Binary files /dev/null and b/content/learning-paths/mobile-graphics-and-gaming/kai_sme2_matmul_ukernel_explained/images/run_matmul_sme2_original_present_step1.jpg differ diff --git a/content/learning-paths/mobile-graphics-and-gaming/kai_sme2_matmul_ukernel_explained/images/run_matmul_sme2_original_present_step2.jpg b/content/learning-paths/mobile-graphics-and-gaming/kai_sme2_matmul_ukernel_explained/images/run_matmul_sme2_original_present_step2.jpg new file mode 100644 index 0000000000..ecc96e3b58 Binary files /dev/null and b/content/learning-paths/mobile-graphics-and-gaming/kai_sme2_matmul_ukernel_explained/images/run_matmul_sme2_original_present_step2.jpg differ diff --git a/content/learning-paths/mobile-graphics-and-gaming/kai_sme2_matmul_ukernel_explained/images/run_matmul_sme2_step1.jpg b/content/learning-paths/mobile-graphics-and-gaming/kai_sme2_matmul_ukernel_explained/images/run_matmul_sme2_step1.jpg new file mode 100644 index 0000000000..2cfddfb42f Binary files /dev/null and b/content/learning-paths/mobile-graphics-and-gaming/kai_sme2_matmul_ukernel_explained/images/run_matmul_sme2_step1.jpg differ diff --git a/content/learning-paths/mobile-graphics-and-gaming/kai_sme2_matmul_ukernel_explained/images/run_matmul_sme2_step2.jpg b/content/learning-paths/mobile-graphics-and-gaming/kai_sme2_matmul_ukernel_explained/images/run_matmul_sme2_step2.jpg new file mode 100644 index 0000000000..6f18b50b39 Binary files /dev/null and b/content/learning-paths/mobile-graphics-and-gaming/kai_sme2_matmul_ukernel_explained/images/run_matmul_sme2_step2.jpg differ diff --git a/content/learning-paths/mobile-graphics-and-gaming/kai_sme2_matmul_ukernel_explained/images/sme2_mopa.jpg b/content/learning-paths/mobile-graphics-and-gaming/kai_sme2_matmul_ukernel_explained/images/sme2_mopa.jpg new file mode 100644 index 0000000000..84c790710e Binary files /dev/null and b/content/learning-paths/mobile-graphics-and-gaming/kai_sme2_matmul_ukernel_explained/images/sme2_mopa.jpg differ diff --git a/content/learning-paths/mobile-graphics-and-gaming/kai_sme2_matmul_ukernel_explained/kai_matmul_kernel_overview.md b/content/learning-paths/mobile-graphics-and-gaming/kai_sme2_matmul_ukernel_explained/kai_matmul_kernel_overview.md new file mode 100644 index 0000000000..f7d9ebdafe --- /dev/null +++ b/content/learning-paths/mobile-graphics-and-gaming/kai_sme2_matmul_ukernel_explained/kai_matmul_kernel_overview.md @@ -0,0 +1,57 @@ +--- +title: How does a KleidiAI matmual microkernel perform matrix multiplication with quantized data? +weight: 2 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +## How does a KleidiAI matmual microkernel perform matrix multiplication with quantized data? +Essentially, a KleidiAI matmul microkernel uses tile-based matrix multiplication(matmul) where small submatrices of the output are computed one by one. +- **mr**: number of rows of Matrix C (and Matrix A) computed at once +- **nr**: number of columns of Matrix C (and Matrix B) computed at once +- **bl**: number of elements from the K dimension processed per block at once +- **kr**: number of elements from the K dimension processed per inner step + +The video below demonstrates how matrix multiplication is carried out using this method. +![Figure showing Tile-Based matrix multiplication with KleidiAI alt-text#center](videos/matrix_tile.gif "Tile-Based matrix multiplication with KleidiAI") + +This process can be denoted with the following pseudocode, +```c +// RHS N LOOP +for(n_idx = 0; n_idx < n; n_idx+=nr){ + // LHS M LOOP + for(m_idx = 0; m_idx < m; m_idx+=mr){ + // K LOOP, break K into blocks first + blocks_in_K= K/bl; // bl is the block length + //Block Loop + for(bl_idx = 0; bl_idx< blocks_in_K; bl_idx += 1) { + //Loop inside a block + krs_in_block= bl/kr; //kr is the number of elements in K dimension per inner loop + for(k_idx = 0; k_idx < krs_in_block; k_idx +=1) { + // Perform the matrix multiplication with source submatrices of size [mr, kr] and [kr, nr] + // Accumulate the matrix multiplication result above into per block level result. + … + } + // Accumulate per block level results along K dimension. When iteration on K dimension is completed,a submatrix of size [mr, nr] of the output matrix is ready + } + //Continue computing a submatrix of size [mr, nr] of the output matrix along M dimension + } + //Continue computing a submatrix of size [mr, nr] of the output matrix along N dimension +} +``` +In general, KleidiAI matmul microkernels implement matrix mulitplication in a similar way as the pseudocode. + +KleidiAI also provides corresponding packing microkernels for the matmul microkernels, in order to make efficient contiguous memory access to the input of the matrix multiplication, reducing cache misses. + +KleidiAI supports quantized matrix multiplication to speed up AI inference on Arm CPUs. Instead of multiplying full precision (FP32) matrices A and B directly, it quantizes: +- The Left Hand Source (LHS , or Left Hand Martix/activation) matrix to 8-bit integers +- The Right Hand Source( RHS, or Left Hand Matrix/weights) matrix to 4-bit or 8-bit integers + +then packs those quantized values into memory layouts suitable for the CPU vector instructions such as Dotprod, I8MM, SME2 instructions. +Runs a microkernel that efficiently computes on packed quantized data, then scales back to floating point. + +This process can be illustrated in the following diagram, +![Figure showing quantized matrix multiplication with KleidiAI kernels alt-text#center](images/kai_matmul_kernel.jpg "Quantized matrix multiplication with KleidiAI kernel") + +Please find more information in this learning path, [Accelerate Generative AI workloads using KleidiAI](https://learn.arm.com/learning-paths/cross-platform/kleidiai-explainer/). \ No newline at end of file diff --git a/content/learning-paths/mobile-graphics-and-gaming/kai_sme2_matmul_ukernel_explained/sme2_mpoa_matmul.md b/content/learning-paths/mobile-graphics-and-gaming/kai_sme2_matmul_ukernel_explained/sme2_mpoa_matmul.md new file mode 100644 index 0000000000..6b8ec9a77c --- /dev/null +++ b/content/learning-paths/mobile-graphics-and-gaming/kai_sme2_matmul_ukernel_explained/sme2_mpoa_matmul.md @@ -0,0 +1,43 @@ +--- +title: How are SME2 INT8 Outer Product Accumulate instructions used in a matrix multiplication? +weight: 3 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +## How are SME2 INT8 Outer Product Accumulate instructions used in a matrix multiplication? +The INT8 Outer Product Accumulate instructions calculate the sum of four INT8 outer products, widening results into INT32, then the result is destructively added to the destination tile. +![Figure showing SME2 INT8 MOPA instruction alt-text#center](images/sme2_mopa.jpg "SME2 INT8 MOPA instruction") + +When SME2 SVL is 512-bit, each input register (Zn.B, Zm.B) is treated as a matrix of 16x4 INT8 elements, as if each block of four contiguous elements were transposed. +- The first source, Zn.B contains a 16x4 sub-matrix of 8-bit integer values. +- The second source, Zm.B, contains a 16 x4 sub-matrix of 8-bit integer values. +- The INT8 MOPA instruction calculates a 16x 16 widened 32-bit integer sum of outer products, which is then destructively added to the 32-bit integer destination tile, ZAda. + +The video below shows how SME2 INT8 Outer Product Accumulate instructions are used for matrix multiplication. +![Figure showing Matrix Multiplication with 1VLx1VL SME2 MOPA alt-text#center](videos/matrix_mopa_sme2_1vl.gif "Matrix Multiplication with 1VLx1VL SME2 MOPA") +To calculate the result of a 16x16 sub-matrix in matrix C (element type: INT32): + +First, +- a 16x4 sub-matrix in matrix A (element type: INT8) is loaded to a SME2 Z register, +- a 4x16 sub-matrix in matrix B (element type: INT8) is loaded to another SME2 Z register +- a 16x16 sub-matrix in matrix C is stored in an SME2 ZA tile, which is initialized to zero only once + +Then, the SME2 INT8 MOPA instruction uses the data from these two Z registers to perform the outer product operation and accumulates the results into the ZA tile, which holds the 16x16 sub-matrix of matrix C, thus obtaining an intermediate result for this 16x16 sub-matrix. + +Iterate over the K dimension, repeatedly loading 16x4 submatrices from matrix A and 4×16 submatrices from matrix B. For each step, use the SME2 INT8 MPOA instruction to compute outer products and accumulate the results into the same ZA tile. After completing the iteration over K, this ZA tile holds the final values for the corresponding 16×16 submatrix of matrix C. Finally, store the contents of the ZA tile back to memory. + +Apply the same process to all 16x16 sub-matrices in matrix C to complete the entire matrix computation. + +To improve performance, we can pipeline four MOPA instructions and fully utilize four ZA tiles in ZA storage, each MOPA instruction uses one ZA tile. +The video below demonstrates how the four MOPA instructions are used to perfrom matrix multiplication of one 16x4 submatrix from matrix A and four 4x16 submatrices from matrix B in a single iteration. This approach can be referred to as 1VLx4VL, + +![Figure showing Matrix Multiplication with 1VLx4VL SME2 MOPA alt-text#center](videos/1vlx4vl_sme2_mopa.gif "Matrix Multiplication with 1VLx4VL SME2 MOPA") +The intermediate result of 4x16x16 output submatrix is held in four ZA.S tiles. + +You can find more information about SME2 MOPA here, +- [part 1 Arm Scalable Matrix Extension Introduction](https://developer.arm.com/community/arm-community-blogs/b/architectures-and-processors-blog/posts/arm-scalable-matrix-extension-introduction) +- [part 2 Arm Scalable Matrix Extension Instructions](https://developer.arm.com/community/arm-community-blogs/b/architectures-and-processors-blog/posts/arm-scalable-matrix-extension-introduction-p2) +- [part4 Arm SME2 Introduction](https://developer.arm.com/community/arm-community-blogs/b/architectures-and-processors-blog/posts/part4-arm-sme2-introduction) + \ No newline at end of file diff --git a/content/learning-paths/mobile-graphics-and-gaming/kai_sme2_matmul_ukernel_explained/summary.md b/content/learning-paths/mobile-graphics-and-gaming/kai_sme2_matmul_ukernel_explained/summary.md new file mode 100644 index 0000000000..469cd9030a --- /dev/null +++ b/content/learning-paths/mobile-graphics-and-gaming/kai_sme2_matmul_ukernel_explained/summary.md @@ -0,0 +1,10 @@ +--- +title: Summary +weight: 8 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +## Summary +This learning path vividly explains how an SME2-optimized KleidiAI microkernel performs quantization and packing of the RHS and LHS, and how it leverages the powerful SME2 MOPA instructions to enhance matrix multiplication performance. We hope this learning path helps developers learn how to integrate the KleidiAI microkernel into their ML/AI frameworks or applications, or to design their own SME2-optimized kernels, thus fully utilizing the potential of SME2. \ No newline at end of file diff --git a/content/learning-paths/mobile-graphics-and-gaming/kai_sme2_matmul_ukernel_explained/the_sme2_matmul_microkernel.md b/content/learning-paths/mobile-graphics-and-gaming/kai_sme2_matmul_ukernel_explained/the_sme2_matmul_microkernel.md new file mode 100644 index 0000000000..6843b44fa2 --- /dev/null +++ b/content/learning-paths/mobile-graphics-and-gaming/kai_sme2_matmul_ukernel_explained/the_sme2_matmul_microkernel.md @@ -0,0 +1,31 @@ +--- +title: What is the sme2 lvlx4vl microkernel? +weight: 4 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +## What is the sme2 lvlx4vl microkernel? +We use a KleidiAI microkernel, *kai_matmul_clamp_f32_qsi8d32p1vlx4_qsi4c32p4vlx4_1vlx4vl_sme2_mopa*, to explain KleidiAI SME2 microkernels in detail. It is referred as ‘the SME2 matmul microkernel’ in this learning path onwards, unless otherwise noted. + +“_1vlx4vl” in the name indicates that, in a single inner loop iteration, it computes an intermediate result for a 1VL x 4VL submatrix (one SME2 Streaming Vector Length x four SME2 Streaming Vector Length) of the ouput matrix. Assuming the SME2 SVL is 512 bits, it is a 16 x 64 (512/sizeof(FP32)) x (4 x 512/sizeof(FP32)) submatrix. + +To improve performance, we can pipeline four MOPA instructions and fully utilize four ZA tiles in ZA storage, each MOPA instruction uses one ZA tile. +The video below demonstrates how the four MOPA instructions are used to perfrom matrix multiplication of one 16x4 submatrix (1VL) from matrix A and four 4x16 submatrices from matrix B (4VL) in a single iteration. +![Figure showing Matrix Multiplication with 1VLx4VL SME2 MOPA alt-text#center](videos/1vlx4vl_sme2_mopa.gif "Matrix Multiplication with 1VLx4VL SME2 MOPA") +The intermediate result of 4x16x16 output submatrix is held in four ZA.S tiles. + +“qsi8d32p1vlx4” in the name indicates that it expects the LHS with a layout of [M, K] to be symmetrically quantized into signed INT8 type within blocks of 32 elements. +The entire quantized LHS is then divided into submatrices of size 1VL × 4 (since the SME2 SVL is set as 512 bits, it is 16 × 4). Then, each submatrix is packed row-wise into a contiguous memory layout, all the submatrices are packed in this way one after another. So that when using the packed LHS in the SME2 matmul microkernel, memory accesses are to contiguous addresses, improving cache locality. + +“qsi4c32p4vlx4” in its name indicates that the SME2 matmul microkernel expects the RHS with a layout of [N, K] to be symmetrically quantized into signed INT4 type within blocks of 32 elements. +The entire quantized RHS is then divided into submatrices of size 4VL × 4 (since the SME2 SVL is set as 512 bits, it is 4x16× 4). Each submatrix is packed row-wise into a contiguous memory layout. Since the quantization type is INT4, each byte contains two INT4 elements. In the SME2 matmul microkernel, the SME2 LUTI instructions efficiently dequantize INT4 elements into INT8 type, thereby enabling fast matrix multiplication with SME2 INT8 MOPA instructions. + +“_f32_” in its name indicates that the SME2 matmul microkernel outputs FP32 result matrix. The INT32 result produced by SME2 INT8 MOPA instructions has to be dequantized back to FP32 type. + +Sometimes, the original LHS or RHS may not conform to the quantization and packing format requirement of the SME2 matmul microkernel. The software needs to quantize and pack the LHS and RHS appropriately first. + +Next, we will take llama.cpp and the Llama-3.2-3B-Q4_0.gguf model for example to demonstrate, +- how to quantize and pack the LHS and RHS +- perform matrix multiplication using the SME2 matmul microkernel \ No newline at end of file diff --git a/content/learning-paths/mobile-graphics-and-gaming/kai_sme2_matmul_ukernel_explained/videos/1vlx4vl_sme2_mopa.gif b/content/learning-paths/mobile-graphics-and-gaming/kai_sme2_matmul_ukernel_explained/videos/1vlx4vl_sme2_mopa.gif new file mode 100644 index 0000000000..012291cad0 Binary files /dev/null and b/content/learning-paths/mobile-graphics-and-gaming/kai_sme2_matmul_ukernel_explained/videos/1vlx4vl_sme2_mopa.gif differ diff --git a/content/learning-paths/mobile-graphics-and-gaming/kai_sme2_matmul_ukernel_explained/videos/matrix_mopa_sme2_1vl.gif b/content/learning-paths/mobile-graphics-and-gaming/kai_sme2_matmul_ukernel_explained/videos/matrix_mopa_sme2_1vl.gif new file mode 100644 index 0000000000..e3b471a398 Binary files /dev/null and b/content/learning-paths/mobile-graphics-and-gaming/kai_sme2_matmul_ukernel_explained/videos/matrix_mopa_sme2_1vl.gif differ diff --git a/content/learning-paths/mobile-graphics-and-gaming/kai_sme2_matmul_ukernel_explained/videos/matrix_tile.gif b/content/learning-paths/mobile-graphics-and-gaming/kai_sme2_matmul_ukernel_explained/videos/matrix_tile.gif new file mode 100644 index 0000000000..e91549875c Binary files /dev/null and b/content/learning-paths/mobile-graphics-and-gaming/kai_sme2_matmul_ukernel_explained/videos/matrix_tile.gif differ