From a40b2f456f3feb192a54b72e35140c5699482e97 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Pekka=20J=C3=A4=C3=A4skel=C3=A4inen?= Date: Tue, 13 Dec 2022 13:47:12 +0200 Subject: [PATCH 1/9] cl_khr_defined_builtin_kernels First WiP draft of a defined BiKs extension. --- ext/cl_khr_defined_builtin_kernels.asciidoc | 330 +++++ ext/cl_khr_defined_builtin_kernels.html | 1288 +++++++++++++++++++ 2 files changed, 1618 insertions(+) create mode 100644 ext/cl_khr_defined_builtin_kernels.asciidoc create mode 100644 ext/cl_khr_defined_builtin_kernels.html diff --git a/ext/cl_khr_defined_builtin_kernels.asciidoc b/ext/cl_khr_defined_builtin_kernels.asciidoc new file mode 100644 index 000000000..a503b5a2c --- /dev/null +++ b/ext/cl_khr_defined_builtin_kernels.asciidoc @@ -0,0 +1,330 @@ +// Copyright 2018-2022 The Khronos Group. This work is licensed under a +// Creative Commons Attribution 4.0 International License; see +// http://creativecommons.org/licenses/by/4.0/ += cl_khr_defined_builtin_kernels = + +:source-highlighter: coderay + +[[cl_khr_defined_builtin_kernels]] +== Khronos-Defined Built-in Kernels (Early Draft) + +The purpose of this extension is to provide a standardized set of built-in +kernels with well-defined semantics useful for accelerating applications +from various domains. The extension specification is designed to rapidly +expand and "live" via addition of new well-defined built-in kernel +definitions and updating of previously defined ones. + +=== General Information + +==== Name Strings + +`cl_khr_defined_builtin_kernels` + +==== Version History + +[cols="1,1,3",options="header",] +|==== +| *Date* | *Version* | *Description* +| 2022-12-13 | 0.1.0 | First formulation as an extension specification like proposed by Ben Ashbaugh. +|==== + +==== Dependencies + +This extension is written against the OpenCL Specification version 3.0.12. + +This extension requires OpenCL 1.2 or later. + +==== Contributors + +Pekka Jääskeläinen, Intel and Tampere University. + +Topi Leppänen, Tampere University. + +Jan Solanti, Tampere University. + +Ben Ashbaugh, Intel. + + +=== Overview + +OpenCL 1.2 specifies a built-in kernel (BiK) as a kernel that is executed on +an OpenCL device or custom device by fixed-function hardware or in firmware. +Applications can query the built-in kernels supported by a device or custom +device. + +BiKs are referred to by a name (a C string) without any semantics attached +to the functionality. The semantics behind the name is completely device +specific, typically documented in vendor-specific extension specifications. + +The goal for this extension is to lower the bar for utilizing hardware +accelerated functions in drivers by providing a library of +well-defined BiKs with good coverage for common acceleration needs +and which is designed to easily evolve over time. + +The device drivers that implement this extension can freely choose which +subset of defined BiKs they implement and advertise to the clients. The +clients can use the BiKs to accelerate their applications by manually +executing invoking the BiKs. The extension is designed to also support using +automated task graph lowering tooling later. + +==== Background + +ASIC-based coarse-grained hardware accelerators are specialized logic meant to +speed up execution of workloads of interest, or to provide improvements in +energy-efficiency. Examples of contemporary workloads that are beneficially hardware +accelerated over software-based implementations include video coding, deep learning, +cryptography, software-defined radio and graphics rendering. + +FPGAs form a special case somewhere between instruction-set architectures and fixed +function hardware accelerators. While advances in high-level synthesis tools +have attempted to bridge the programmability gap between GPU and FPGA programming, +FPGAs are still considered as devices which are challenging to achieve efficient +implementations with. Due to extensive manual optimization work required for efficient +implementations of the accelerated functionality, defining FPGA designs as +a system of "hardware accelerator IPs" is still a widely used "application abstraction". +FPGAs can be thus seen as a platform that can realize and integrate any +hardware accelerator implementable with the programmable fabric. + +The means to utilize hardware accelerators have typically been +vendor-specific and abstracted behind domain-specific libraries. +The overhead with the "bunch of libraries"-approach is seen in the lowest level +of integration: The libraries utilize a low level library (typically +vendor-specific) to interface with the actual hardware, and thus does not +integrate efficiently with other libraries or software-programmable processors +that might be available on the same chip. + +==== Rationale + +OpenCL's built-in kernel abstraction allows pushing both hardware +accelerated and software defined kernels to the same command-queues, +providing a powerful means for asynchronous execution of heterogeneous +task graphs on diverse heterogeneous platforms. The ability to invoke hardware +accelerators while being able to synchronize and optimize data transfers at +the lowest levels of the driver stack can provide significant latency benefits, +especially when combined with the command-buffering mechanism. + +However, the BiK abstraction works well only when it is widely adopted by +vendors, and when multiple vendors implement the same definitions. Otherwise +each vendor specifies and implements their own BiKs closely matching their +own hardware accelerator properties, resulting in lack of cross-vendor +portability in the API abstraction presented to the upper layers of +heterogeneous computing software stacks. + +This extension standardizes a set of well-defined BiKs the clients can +call from higher level programming stacks built with different languages +and multiple libraries, possibly mix accelerator calls with calls to software kernel +commands, and rely on the driver stack to optimize the execution (especially +the synchronization and communication) as a low level heterogeneous task graph. +It aims to promote the use of BiKs as a programming model for hardware accelerated +functionality, to improve cross-vendor portability of hardware accelerated computing. + +=== Modifications to section 4.2 of the OpenCL API Specification + +Modify *Table 5*, _Device Queries_, of section 4.2, by adding the following +sentences to the description cell of `CL_DEVICE_BUILT_IN_KERNELS`: + +[quote] +The semantics of the returned built-in kernels are undefined or defined in +vendor-specific documentation, unless the name starts with prefix `khr_', +which means it's a built-in kernel with semantics defined in Appendix I. + +=== Add new appendix "Appendix I - Defined Built-in Kernels" to OpenCL API Specification + +This chapter describes standard built-in kernels (BiK) with well-defined +semantics. A conformant device can report to support zero or more of the built-in +kernels via `CL_DEVICE_BUILT_IN_KERNELS` or `CL_DEVICE_BUILT_IN_KERNELS_WITH_VERSION` device queries. + +The general client-side abstraction of the defined built-in kernels is similar to a call +to a C function of which implementation is hidden. The device driver can invoke one or +more physical hardware accelerators combined with firmware to implement the semantics +as efficiently as possible. + +It is the driver's responsibility to handle efficient synchronization and communication +to the hardware accelerator, the internal accelerator state management and resource sharing +across multiple OpenCL contexts. + +==== Standard Built-in Kernels ==== + +The following list of recognized built-ins is organized according to their application +domain and handled data types. It is expected to grow and update while preserving backwards +compatibility. + +[caption="Table A.I.1. "] +.Standard Built-in Kernels and Their Semantics. *The table has been populated with a small set of non-trivial example entries which are subject to change and the list to expand during drafting.* +[cols="1,3,2,2"] +|=== +4+| *General linear algebra* +// https://netlib.org/blas/blasqr.pdf +| Name | Description | NDRange Dimensions | Arguments +| *khr_blas_gemm_float* +| xGEMM: General matrix multiplication with real single precision floating point numbers as described in Basic Linear Algebra Subprograms. Performs C = alpha * trans(A) * trans(B) + beta*C, where A, B and C are matrices, and alpha and beta scalars. trans() is a configurable transpose operation. +a| +[start=1] +. The height. +. The width. +a| +[start=0] +. int: transpose operation (trans) type for matrix A (0 = none, 1 = transpose, 2 = conjugate transpose) +. int: transpose type for matrix B (0 = none, 1 = transpose, 2 = conjugate transpose) +. float: scalar (alpha) to multiply the matrix multiplication result elements with +. float* (input): matrix A +. int: leading dimension of A (0 = row-major, 1 = column-major) +. float* (input): matrix B +. int: leading dimension of B (0 = row-major, 1 = column-major) +. float: scalar (beta) to multiply the C matrix elements with before adding it to the result +. float* (input&output): matrix C which is added to the matrix multiplication result, and stores the output +. int: leading dimension of C (0 = row-major, 1 = column-major) +4+| OpenCL C Semantics +4+a| +[source,c] +---- +__kernel void __khr_blas_gemm_float( + int transA, int transB, float alpha, const global float *A, int ldA, + const global float *B, int ldB, + float beta, global float *C, int ldC) { + // TBD: An example implementation that can be used for verification + // and as a fallback SW implementation. +} +---- + +4+| *OpenVX Neural Network Extension Compatible Kernels* +// Copied from https://registry.khronos.org/OpenVX/extensions/vx_khr_nn/1.2/html/d6/d9a/group__group__cnn.html#ga69764625f436c14d739fc467515c1584 +| Name | Description | NDRange Dimensions | Arguments +| *khr_openvx_nn_extension_convolution_uchar* +| Convolution for 8bit unsigned integer inputs and weights. +a| +[start=1] +. Batch size. +. Width. +. Height. +a| +[start=0] +. uchar* [in]: The input tensor data. 3 lower dimensions represent a single input, all following dimensions represent number of batches, possibly nested. The dimension order is [width, height, #IFM, #batches]. +. uchar* [in]: Weights, as a 4d tensor with dimensions [kernel_x, kernel_y, #IFM, #OFM]. +. uchar* [in]: Biases (optional, ignored if NULL). The biases, which may be shared (one per ofm) or unshared (one per ofm * output location). The possible layouts are either [#OFM] or [width, height, #OFM]. Biases data type must match the data type of the inputs. (Kernel parameter #2) +. size_t: (dilation_x) “inflate” the kernel by inserting zeros between the kernel elements in the x direction. The value is the number of zeros to insert. +. size_t: (dilation_y) “inflate” the kernel by inserting zeros between the kernel elements in the y direction. The value is the number of zeros to insert. +. int: Rounding method for calculating output dimensions. +. int: A VX_TYPE_ENUM of the vx_convert_policy_e enumeration. +. size_t: Number of elements padded at each side in the x dimension of the input. +. size_t: Number of elements padded at each side in the y dimension of the input. +. int: A VX_TYPE_ENUM of the vx_round_policy_e enumeration. +. uchar* [out]: The output tensor data. Output will have the same number and structure of dimensions as input. Output tensor data type must be same as the inputs. (Kernel parameter #4) + +4+| OpenCL C Semantics +4+a| +[source,c] +---- +__kernel void __khr_openvx_nn_extension_convolution_uchar( + const uchar *input, const uchar *weights, const uchar *biases, + size_t dilation_x, size_t dilation_y, + int down_scale_rounding, int overflow_policy, size_t padding_x, size_t padding_y, + int rounding_policy, uchar *output) { + // TBD. +} +---- + +4+| *Direct Input/Output Operations* +4+| Kernels for accessing data sources and destinations directly without host involvement. +| Name | Description | NDRange Dimensions | Arguments +| *khr_io_stream_in_uchar* +| Non-blocking read of data from a sensor/stream associated with the device. +a| - +a| +[start=0] +. uchar* [out]: The data. +. size_t* [in+out]: In: number of bytes to read. Out: Number of bytes that could be read (can be 0). (Compatible with the `cl_pocl_content_size` extension to optimize data transfers with.) + +4+| OpenCL C Semantics +4+a| +[source,c] +---- +__kernel void __khr_io_stream_in_uchar( + uchar *output, size_t *num) { + // It is not feasible to describe this kernel in OpenCL C as I/O devices + // are not representable with it. +} +---- + +| *khr_io_stream_out_uchar* +| Non-blocking write of data to an output/sink associated with the device. +| - +a| +[start=0] +. uchar* [in]: The data to write. +. size_t* [in+out]: In: Number of bytes to write. Out: Number of bytes that could be written (can be 0). +4+| OpenCL C Semantics +4+a| +[source,c] +---- +__kernel void __khr_io_stream_out_uchar( + uchar *input, size_t *num) { + // It is not feasible to describe this kernel in OpenCL C as I/O devices + // are not representable with it. +} +---- + +| *khr_io_stream_in_blocking_uchar* +| Blocking read of data from a sensor/stream associated with the device. +a| - +a| +[start=0] +. uchar* [out]: The data. +* size_t* [in]: How many bytes to read before returning. + +4+| OpenCL C Semantics +4+a| +[source,c] +---- +__kernel void __khr_io_stream_in_blocking_uchar(uchar *output, size_t *num) { + while (*num) { + size_t num_read = *num; + __khr_io_stream_in_uchar(output, &num_read); + num -= num_read; + output += num_read; + } +} +---- + +|=== + +==== Launching BiKs from the Device Side ==== + +BiKs are primarily meant to be launched as kernel commands via host-side command queues. +Optionally, they can be callable from device-side via +`enqueue_kernel`: This capability can be queried on per BiK basis at compile-time in OpenCL C by checking for macro definitions which has the following naming convention: `cl_khr_bik_BUILTIN_KERNEL_NAME`. In case a BiK macro is defined, a kernel with a naming convention `__khr_BUILTIN_KERNEL_NAME()` can be enqueued by the program at device side as software-defined kernels. + + +=== Open questions + +. Should we enable launching BiKs from the device side without requiring device-side enqueue? The main problem is those with NDRange as they are not simple single-WI helper functions. ++ +-- +*UNRESOLVED* + +-- + +. Should the NDRange be used at all in BiKs? It feels sort of unnatural as typically the NDRange is used to imply SPMD parallelism while the hardware/firmware is free to choose whatever parallelism degree to implement the function. On the other hand, similar applies to software kernel launches as the work-items can be executed serially if adhering to barrier semantics. ++ +-- +*UNRESOLVED* + +-- + +. Different accelerators prefer different channel orders (NHWC vs. NCHW...) for the processed data. Should the channel order be passed as a BiK argument (like in the example GEMM's row/column order) or is it better to have different BiK variations for each? ++ +-- +*UNRESOLVED* + +-- + +. How to denote preference? Some of the BiKs are more efficient on a given device as they map more naturally to the underlying HW accelerator, but the slower variations (for example, with unoptimal channel order in NN accelerators) might be still beneficially accelerated. ++ +-- +*UNRESOLVED* + +-- + +. Since the defined built-in kernel concept is basically just a C-like API inside another API, should it be made more generic and thus directly usable for SYCL and Vulkan as well? ++ +-- +*UNRESOLVED* + +-- + diff --git a/ext/cl_khr_defined_builtin_kernels.html b/ext/cl_khr_defined_builtin_kernels.html new file mode 100644 index 000000000..3fda4c9dc --- /dev/null +++ b/ext/cl_khr_defined_builtin_kernels.html @@ -0,0 +1,1288 @@ + + + + + + +cl_khr_defined_builtin_kernels + + + + + +
+
+

Khronos-Defined Built-in Kernels (Early Draft)

+
+

The purpose of this extension is to provide a standardized set of built-in +kernels with well-defined semantics useful for accelerating applications +from various domains. The extension specification is designed to rapidly +expand and "live" via addition of new well-defined built-in kernel +definitions and updating of previously defined ones.

+
+

General Information

+
+

Name Strings

+

cl_khr_defined_builtin_kernels

+
+
+

Version History

+
+ ++++ + + + + + + + + + + + + + +
Date Version Description

2022-12-13

0.1.0

First formulation as an extension specification like proposed by Ben Ashbaugh.

+
+
+
+

Dependencies

+

This extension is written against the OpenCL Specification version 3.0.12.

+

This extension requires OpenCL 1.2 or later.

+
+
+

Contributors

+

Pekka Jääskeläinen, Intel and Tampere University.
+Topi Leppänen, Tampere University.
+Jan Solanti, Tampere University.
+Ben Ashbaugh, Intel.

+
+
+
+

Overview

+

OpenCL 1.2 specifies a built-in kernel (BiK) as a kernel that is executed on +an OpenCL device or custom device by fixed-function hardware or in firmware. +Applications can query the built-in kernels supported by a device or custom +device.

+

BiKs are referred to by a name (a C string) without any semantics attached +to the functionality. The semantics behind the name is completely device +specific, typically documented in vendor-specific extension specifications.

+

The goal for this extension is to lower the bar for utilizing hardware +accelerated functions in drivers by providing a library of +well-defined BiKs with good coverage for common acceleration needs +and which is designed to easily evolve over time.

+

The device drivers that implement this extension can freely choose which +subset of defined BiKs they implement and advertise to the clients. The +clients can use the BiKs to accelerate their applications by manually +executing invoking the BiKs. The extension is designed to also support using +automated task graph lowering tooling later.

+
+

Background

+

ASIC-based coarse-grained hardware accelerators are specialized logic meant to +speed up execution of workloads of interest, or to provide improvements in +energy-efficiency. Examples of contemporary workloads that are beneficially hardware +accelerated over software-based implementations include video coding, deep learning, +cryptography, software-defined radio and graphics rendering.

+

FPGAs form a special case somewhere between instruction-set architectures and fixed +function hardware accelerators. While advances in high-level synthesis tools +have attempted to bridge the programmability gap between GPU and FPGA programming, +FPGAs are still considered as devices which are challenging to achieve efficient +implementations with. Due to extensive manual optimization work required for efficient +implementations of the accelerated functionality, defining FPGA designs as +a system of "hardware accelerator IPs" is still a widely used "application abstraction". +FPGAs can be thus seen as a platform that can realize and integrate any +hardware accelerator implementable with the programmable fabric.

+

The means to utilize hardware accelerators have typically been +vendor-specific and abstracted behind domain-specific libraries. +The overhead with the "bunch of libraries"-approach is seen in the lowest level +of integration: The libraries utilize a low level library (typically +vendor-specific) to interface with the actual hardware, and thus does not +integrate efficiently with other libraries or software-programmable processors +that might be available on the same chip.

+
+
+

Rationale

+

OpenCL’s built-in kernel abstraction allows pushing both hardware +accelerated and software defined kernels to the same command-queues, +providing a powerful means for asynchronous execution of heterogeneous +task graphs on diverse heterogeneous platforms. The ability to invoke hardware +accelerators while being able to synchronize and optimize data transfers at +the lowest levels of the driver stack can provide significant latency benefits, +especially when combined with the command-buffering mechanism.

+

However, the BiK abstraction works well only when it is widely adopted by +vendors, and when multiple vendors implement the same definitions. Otherwise +each vendor specifies and implements their own BiKs closely matching their +own hardware accelerator properties, resulting in lack of cross-vendor +portability in the API abstraction presented to the upper layers of +heterogeneous computing software stacks.

+

This extension standardizes a set of well-defined BiKs the clients can +call from higher level programming stacks built with different languages +and multiple libraries, possibly mix accelerator calls with calls to software kernel +commands, and rely on the driver stack to optimize the execution (especially +the synchronization and communication) as a low level heterogeneous task graph. +It aims to promote the use of BiKs as a programming model for hardware accelerated +functionality, to improve cross-vendor portability of hardware accelerated computing.

+
+
+
+

Modifications to section 4.2 of the OpenCL API Specification

+

Modify Table 5, Device Queries, of section 4.2, by adding the following +sentences to the description cell of CL_DEVICE_BUILT_IN_KERNELS:

+
+
The semantics of the returned built-in kernels are undefined or defined in +vendor-specific documentation, unless the name starts with prefix ‘khr_’, +which means it’s a built-in kernel with semantics defined in Appendix I.
+
+
+
+
+

Add new appendix "Appendix I - Defined Built-in Kernels" to OpenCL API Specification

+

This chapter describes standard built-in kernels (BiK) with well-defined +semantics. A conformant device can report to support zero or more of the built-in +kernels via CL_DEVICE_BUILT_IN_KERNELS or CL_DEVICE_BUILT_IN_KERNELS_WITH_VERSION device queries.

+

The general client-side abstraction of the defined built-in kernels is similar to a call +to a C function of which implementation is hidden. The device driver can invoke one or +more physical hardware accelerators combined with firmware to implement the semantics +as efficiently as possible.

+

It is the driver’s responsibility to handle efficient synchronization and communication +to the hardware accelerator, the internal accelerator state management and resource sharing +across multiple OpenCL contexts.

+
+

Standard Built-in Kernels

+

The following list of recognized built-ins is organized according to their application +domain and handled data types. It is expected to grow and update while preserving backwards +compatibility.

+
+ + +++++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Table A.I.1. Standard Built-in Kernels and Their Semantics. The table has been populated with a small set of non-trivial example entries which are subject to change and the list to expand during drafting.

General linear algebra

Name

Description

NDRange Dimensions

Arguments

khr_blas_gemm_float

xGEMM: General matrix multiplication with real single precision floating point numbers as described in Basic Linear Algebra Subprograms. Performs C = alpha * trans(A) * trans(B) + beta*C, where A, B and C are matrices, and alpha and beta scalars. trans() is a configurable transpose operation.

    +
  1. +

    +The height. +

    +
  2. +
  3. +

    +The width. +

    +
  4. +
    +
  1. +

    +int: transpose operation (trans) type for matrix A (0 = none, 1 = transpose, 2 = conjugate transpose) +

    +
  2. +
  3. +

    +int: transpose type for matrix B (0 = none, 1 = transpose, 2 = conjugate transpose) +

    +
  4. +
  5. +

    +float: scalar (alpha) to multiply the matrix multiplication result elements with +

    +
  6. +
  7. +

    +float* (input): matrix A +

    +
  8. +
  9. +

    +int: leading dimension of A (0 = row-major, 1 = column-major) +

    +
  10. +
  11. +

    +float* (input): matrix B +

    +
  12. +
  13. +

    +int: leading dimension of B (0 = row-major, 1 = column-major) +

    +
  14. +
  15. +

    +float: scalar (beta) to multiply the C matrix elements with before adding it to the result +

    +
  16. +
  17. +

    +float* (input&output): matrix C which is added to the matrix multiplication result, and stores the output +

    +
  18. +
  19. +

    +int: leading dimension of C (0 = row-major, 1 = column-major) +

    +
  20. +

OpenCL C Semantics

+
+
__kernel void __khr_blas_gemm_float(
+   int transA, int transB, float alpha, const global float *A, int ldA,
+   const global float *B, int ldB,
+   float beta, global float *C, int ldC) {
+   // TBD: An example implementation that can be used for verification
+   // and as a fallback SW implementation.
+}

OpenVX Neural Network Extension Compatible Kernels

Name

Description

NDRange Dimensions

Arguments

khr_openvx_nn_extension_convolution_uchar

Convolution for 8bit unsigned integer inputs and weights.

    +
  1. +

    +Batch size. +

    +
  2. +
  3. +

    +Width. +

    +
  4. +
  5. +

    +Height. +

    +
  6. +
    +
  1. +

    +uchar* [in]: The input tensor data. 3 lower dimensions represent a single input, all following dimensions represent number of batches, possibly nested. The dimension order is [width, height, #IFM, #batches]. +

    +
  2. +
  3. +

    +uchar* [in]: Weights, as a 4d tensor with dimensions [kernel_x, kernel_y, #IFM, #OFM]. +

    +
  4. +
  5. +

    +uchar* [in]: Biases (optional, ignored if NULL). The biases, which may be shared (one per ofm) or unshared (one per ofm * output location). The possible layouts are either [#OFM] or [width, height, #OFM]. Biases data type must match the data type of the inputs. (Kernel parameter #2) +

    +
  6. +
  7. +

    +size_t: (dilation_x) “inflate” the kernel by inserting zeros between the kernel elements in the x direction. The value is the number of zeros to insert. +

    +
  8. +
  9. +

    +size_t: (dilation_y) “inflate” the kernel by inserting zeros between the kernel elements in the y direction. The value is the number of zeros to insert. +

    +
  10. +
  11. +

    +int: Rounding method for calculating output dimensions. +

    +
  12. +
  13. +

    +int: A VX_TYPE_ENUM of the vx_convert_policy_e enumeration. +

    +
  14. +
  15. +

    +size_t: Number of elements padded at each side in the x dimension of the input. +

    +
  16. +
  17. +

    +size_t: Number of elements padded at each side in the y dimension of the input. +

    +
  18. +
  19. +

    +int: A VX_TYPE_ENUM of the vx_round_policy_e enumeration. +

    +
  20. +
  21. +

    +uchar* [out]: The output tensor data. Output will have the same number and structure of dimensions as input. Output tensor data type must be same as the inputs. (Kernel parameter #4) +

    +
  22. +

OpenCL C Semantics

+
+
__kernel void __khr_openvx_nn_extension_convolution_uchar(
+   const uchar *input, const uchar *weights, const uchar *biases,
+   size_t dilation_x, size_t dilation_y,
+   int down_scale_rounding, int overflow_policy, size_t padding_x, size_t padding_y,
+   int rounding_policy, uchar *output) {
+   // TBD.
+}

Direct Input/Output Operations

Kernels for accessing data sources and destinations directly without host involvement.

Name

Description

NDRange Dimensions

Arguments

khr_io_stream_in_uchar

Non-blocking read of data from a sensor/stream associated with the device.

+
+
-
+
    +
  1. +

    +uchar* [out]: The data. +

    +
  2. +
  3. +

    +size_t* [in+out]: In: number of bytes to read. Out: Number of bytes that could be read (can be 0). (Compatible with the cl_pocl_content_size extension to optimize data transfers with.) +

    +
  4. +

OpenCL C Semantics

+
+
__kernel void __khr_io_stream_in_uchar(
+   uchar *output, size_t *num) {
+   // It is not feasible to describe this kernel in OpenCL C as I/O devices
+   // are not representable with it.
+}

khr_io_stream_out_uchar

Non-blocking write of data to an output/sink associated with the device.

-

    +
  1. +

    +uchar* [in]: The data to write. +

    +
  2. +
  3. +

    +size_t* [in+out]: In: Number of bytes to write. Out: Number of bytes that could be written (can be 0). +

    +
  4. +

OpenCL C Semantics

+
+
__kernel void __khr_io_stream_out_uchar(
+   uchar *input, size_t *num) {
+   // It is not feasible to describe this kernel in OpenCL C as I/O devices
+   // are not representable with it.
+}

khr_io_stream_in_blocking_uchar

Blocking read of data from a sensor/stream associated with the device.

+
+
-
+
    +
  1. +

    +uchar* [out]: The data. +

    +
      +
    • +

      +size_t* [in]: How many bytes to read before returning. +

      +
    • +
    +
  2. +

OpenCL C Semantics

+
+
__kernel void __khr_io_stream_in_blocking_uchar(uchar *output, size_t *num) {
+   while (*num) {
+       size_t num_read = *num;
+       __khr_io_stream_in_uchar(output, &num_read);
+       num -= num_read;
+       output += num_read;
+   }
+}
+
+
+
+

Launching BiKs from the Device Side

+

BiKs are primarily meant to be launched as kernel commands via host-side command queues. +Optionally, they can be callable from device-side via +enqueue_kernel: This capability can be queried on per BiK basis at compile-time in OpenCL C by checking for macro definitions which has the following naming convention: cl_khr_bik_BUILTIN_KERNEL_NAME. In case a BiK macro is defined, a kernel with a naming convention __khr_BUILTIN_KERNEL_NAME() can be enqueued by the program at device side as software-defined kernels.

+
+
+
+

Open questions

+
    +
  1. +

    +Should we enable launching BiKs from the device side without requiring device-side enqueue? The main problem is those with NDRange as they are not simple single-WI helper functions. +

    +
    +
    +

    UNRESOLVED

    +
    +
  2. +
  3. +

    +Should the NDRange be used at all in BiKs? It feels sort of unnatural as typically the NDRange is used to imply SPMD parallelism while the hardware/firmware is free to choose whatever parallelism degree to implement the function. On the other hand, similar applies to software kernel launches as the work-items can be executed serially if adhering to barrier semantics. +

    +
    +
    +

    UNRESOLVED

    +
    +
  4. +
  5. +

    +Different accelerators prefer different channel orders (NHWC vs. NCHW…) for the processed data. Should the channel order be passed as a BiK argument (like in the example GEMM’s row/column order) or is it better to have different BiK variations for each? +

    +
    +
    +

    UNRESOLVED

    +
    +
  6. +
  7. +

    +How to denote preference? Some of the BiKs are more efficient on a given device as they map more naturally to the underlying HW accelerator, but the slower variations (for example, with unoptimal channel order in NN accelerators) might be still beneficially accelerated. +

    +
    +
    +

    UNRESOLVED

    +
    +
  8. +
  9. +

    +Since the defined built-in kernel concept is basically just a C-like API inside another API, should it be made more generic and thus directly usable for SYCL and Vulkan as well? +

    +
    +
    +

    UNRESOLVED

    +
    +
  10. +
+
+
+
+
+

+ + + From 22cfe758c59a1422e01cb2ea30f545c25a158ec4 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Henry=20Linjam=C3=A4ki?= Date: Wed, 11 Oct 2023 14:37:56 +0300 Subject: [PATCH 2/9] * BiK -> DBK (defined built-in kernel) * Add API function for describing (and querying at the same time) a DBK. * Add API function for creating a program from described DBKs. * Add API function for creating a kernel handle for DBK. * Change DBK description. * Replaced old DBKs for simpler start for illustrating new features. * Use (yet specified) tensors. * Add constraints for usage. * Add sample code for queuing a DBK. * Device-side launching needs redesign. Added note for it. --- ext/cl_khr_defined_builtin_kernels.asciidoc | 515 +++-- ext/cl_khr_defined_builtin_kernels.html | 2274 ++++++++----------- 2 files changed, 1341 insertions(+), 1448 deletions(-) diff --git a/ext/cl_khr_defined_builtin_kernels.asciidoc b/ext/cl_khr_defined_builtin_kernels.asciidoc index a503b5a2c..164baac0a 100644 --- a/ext/cl_khr_defined_builtin_kernels.asciidoc +++ b/ext/cl_khr_defined_builtin_kernels.asciidoc @@ -26,6 +26,7 @@ definitions and updating of previously defined ones. |==== | *Date* | *Version* | *Description* | 2022-12-13 | 0.1.0 | First formulation as an extension specification like proposed by Ben Ashbaugh. +| TODO | TODO | TODO |==== ==== Dependencies @@ -34,6 +35,8 @@ This extension is written against the OpenCL Specification version 3.0.12. This extension requires OpenCL 1.2 or later. +This extension requires cl_khr_tensor (Note: unpublished draft, work in progress) + ==== Contributors Pekka Jääskeläinen, Intel and Tampere University. + @@ -43,24 +46,25 @@ Ben Ashbaugh, Intel. + === Overview -OpenCL 1.2 specifies a built-in kernel (BiK) as a kernel that is executed on +OpenCL 1.2 specifies a built-in kernel as a kernel that is executed on an OpenCL device or custom device by fixed-function hardware or in firmware. Applications can query the built-in kernels supported by a device or custom device. -BiKs are referred to by a name (a C string) without any semantics attached -to the functionality. The semantics behind the name is completely device -specific, typically documented in vendor-specific extension specifications. +Built-in kernels are referred to by a name (a C string) without any +semantics attached to the functionality. The semantics behind the name +is completely device specific, typically documented in vendor-specific +extension specifications. The goal for this extension is to lower the bar for utilizing hardware accelerated functions in drivers by providing a library of -well-defined BiKs with good coverage for common acceleration needs +well-defined built-in kernel with good coverage for common acceleration needs and which is designed to easily evolve over time. The device drivers that implement this extension can freely choose which -subset of defined BiKs they implement and advertise to the clients. The -clients can use the BiKs to accelerate their applications by manually -executing invoking the BiKs. The extension is designed to also support using +subset of DBKs they implement and advertise to the clients. The +clients can use the DBKs to accelerate their applications by manually +executing invoking the DBKs. The extension is designed to also support using automated task graph lowering tooling later. ==== Background @@ -99,222 +103,414 @@ accelerators while being able to synchronize and optimize data transfers at the lowest levels of the driver stack can provide significant latency benefits, especially when combined with the command-buffering mechanism. -However, the BiK abstraction works well only when it is widely adopted by +However, the built-in kernel abstraction works well only when it is widely adopted by vendors, and when multiple vendors implement the same definitions. Otherwise -each vendor specifies and implements their own BiKs closely matching their +each vendor specifies and implements their own built-in kernels closely matching their own hardware accelerator properties, resulting in lack of cross-vendor portability in the API abstraction presented to the upper layers of heterogeneous computing software stacks. -This extension standardizes a set of well-defined BiKs the clients can +This extension standardizes a set of well-defined built-in kernels the clients can call from higher level programming stacks built with different languages and multiple libraries, possibly mix accelerator calls with calls to software kernel commands, and rely on the driver stack to optimize the execution (especially the synchronization and communication) as a low level heterogeneous task graph. -It aims to promote the use of BiKs as a programming model for hardware accelerated +It aims to promote the use of built-in kernels as a programming model for hardware accelerated functionality, to improve cross-vendor portability of hardware accelerated computing. -=== Modifications to section 4.2 of the OpenCL API Specification -Modify *Table 5*, _Device Queries_, of section 4.2, by adding the following -sentences to the description cell of `CL_DEVICE_BUILT_IN_KERNELS`: +=== Add new section X.Y.Z Querying Defined Built-in Kernels + +To request a defined built-in kernel to be executed in the given +devices use: + +[source,c] +---- +cl_dbk_descriptor clCreateDefinedBuiltInKernelDescriptor( + cl_context context, + cl_uint num_devices, + const cl_device_id* device_list, + cl_dbk_name kernel_name, + const void *kernel_attributes, + const cl_dbk_mode_properties* kernel_config, + cl_int *errcode_ret); +---- + +* _context_ must be a valid OpenCL context. + +* _num_devices_ is the number of devices listed in + device_list. _num_devices_ must be non-zero. + +* _device_list_ is a pointer to a list of devices that are in + context. _device_list_ must be a non-NULL value. The defined built-in kernels + are loaded for devices specified in this list. + +* _kernel_name_ is the name of the defined built-in kernel listed in Appendix I. + +* _kernel_attributes_ is a pointer to the structure declared in + description of the kernel in Appendix I. The structure holds + kernel's attributes. + +* _cl_dbk_mode_properties_ is a pointer to a list of defined built-in + kernel mode properties. The supported mode properties are listed in + DBK's entry with default settings in Appendix I. It is valid to set + this argument to NULL in which case default properties apply (if + any). + +*clCreateDefinedBuiltInKernelDescriptor* returns a valid kernel +descriptor on success indicated by _errcode_ret_ which is set to +CL_SUCCESS. Otherwise, the returned object is NULL and the +_errcode_ret_ is set to one of following code: + +* CL_DBK_INVALID_ATTRIBUTE if one or more kernel attributes violates + conditions descried in defined built-in kernel entry in Appendix I. + +* CL_DBK_UNAVAILABLE if kernel attributes are valid but the + kernel is not supported one of the devices. + +* CL_DBK_UNSUPPORTED_MODE_PROPERTY if _cl_dbk_mode_properties_ includes + at least one property not listed in DBK's entry. + +* CL_DBK_UNMET_MAX_RELATIVE_ERROR if the DBK is available but does not + meet the requested constraint set by + CL_DBK_PROPERTY_MAX_RELATIVE_ERROR property. + +* TODO: other error cases. + + +[cols="2,1,2",stripes=odd] +|=== +| *DBK Mode Property* | *Property Value* | *Description* + +| CL_DBK_PROPERTY_MAX_RELATIVE_ERROR | float + +a| Request a DBK whose maximum relative error is bounded by the given +value measured in ULPs. + +| CL_DBK_PROPERTY_NON_DETERMINISTIC | cl_bool + +a| Allow results of the kernel to be non-reproducible. This allows +implementation to switch algorithm of the kernel on each launch for +possibly better performance. +// Idea from https://pytorch.org/docs/stable/notes/randomness.html#cuda-convolution-benchmarking + +| ... | - +a| +Ideas: + +* accumulation with saturation. +* Finite math only. +* Flush denormals to zero. +* data layout preferences (NHWC for convolution). +|=== + +=== Add new function to 5.8.1 Creating Program Objects + +To create a program with a set of defined built-in kernel use: + +[source,c] +---- +cl_program clCreateProgramWithDefinedKernels( + cl_context context, + size_t num_kernel_desc, + const void* kernel_desc_list, + cl_int* errcode_ret); +---- + +* _context_ must be a valid OpenCL context. + +* _num_kernel_desc_ is the number of kernel descriptors. + +* _kernel_desc_list_ is the array of valid + cl_dbk_descriptor objects. The array length must be at + least _num_kernel_desc_. The kernel descriptors must be created on + the same context. + +*clCreateProgramWithDefinedKernels* returns a valid program on success +indicated by _errcode_ret_ which is set to CL_SUCCESS. Otherwise, the +returned object is NULL and the _errcode_ret_ is set to one of +following code: + +* TODO. + +=== Add new function to 5.9.1 Creating Kernel Objects + +To get a kernel handle for a defined built-in kernel in a program use: + +[source,c] +---- +cl_kernel clCreateDefinedBuiltInKernel( + cl_program program, + cl_dbk_descriptor kernel_desc, + cl_int* errcode_ret); +---- + +* _program_ is a program object with a successfully built executable. + +* _kernel_desc_ is a defined built-in kernel descriptor in the program. + +* _errcode_ret_ will return an appropriate error code. If errcode_ret is + NULL, no error code is returned. + +*clCreateDefinedBuiltInKernel* returns a valid non-zero kernel object + and errcode_ret is set to CL_SUCCESS if the kernel object is created + successfully. Otherwise, it returns a NULL value with one of the + following error values returned in _errcode_ret_: + +* TODO. -[quote] -The semantics of the returned built-in kernels are undefined or defined in -vendor-specific documentation, unless the name starts with prefix `khr_', -which means it's a built-in kernel with semantics defined in Appendix I. === Add new appendix "Appendix I - Defined Built-in Kernels" to OpenCL API Specification -This chapter describes standard built-in kernels (BiK) with well-defined -semantics. A conformant device can report to support zero or more of the built-in -kernels via `CL_DEVICE_BUILT_IN_KERNELS` or `CL_DEVICE_BUILT_IN_KERNELS_WITH_VERSION` device queries. +This chapter describes standard defined built-in kernels (DBK) with +well-defined semantics. A conformant devices can report to +availability of the built-in kernels listed in this section with +`clCreateDefinedBuiltInKernelDescriptor` call. The availability of a +DBK is determined from the arguments passed to the +`clCreateDefinedBuiltInKernelDescriptor` and unavailability of a DBK +is indicated by CL_DBK_UNAVAILABLE error code. -The general client-side abstraction of the defined built-in kernels is similar to a call -to a C function of which implementation is hidden. The device driver can invoke one or -more physical hardware accelerators combined with firmware to implement the semantics -as efficiently as possible. +The general client-side abstraction of the DBKs is similar to a call +to a C function of which implementation is hidden. The device driver +can invoke one or more physical hardware accelerators combined with +firmware to implement the semantics as efficiently as possible. It is the driver's responsibility to handle efficient synchronization and communication to the hardware accelerator, the internal accelerator state management and resource sharing across multiple OpenCL contexts. -==== Standard Built-in Kernels ==== +Identical DBKs with identical inputs, are not guaranteed to produce +identical results: + +* across vendors, -The following list of recognized built-ins is organized according to their application -domain and handled data types. It is expected to grow and update while preserving backwards -compatibility. +* across driver versions and + +* across devices. + +Otherwise, identical results are produced unless: + +* otherwise stated in DBK's description or + +* the DBK has CL_DBK_PROPERTY_NON_DETERMINISTIC property set to true. + +Two DKBs are considered identical if their descriptors are created +using identical kernel name, kernel attribute and kernel mode property +arguments. + +==== Standard Defined Built-in Kernels ==== + +The following list of recognized defined built-in kernels. It is +expected to grow and update while preserving backwards compatibility. + +Each defined built-in kernel entry is organized as followed: + +* *Name*: Name of the defined built-in kernel (an enumeration). + +* *Kernel attributes*: The kernel attributes required for creating the + defined built-in kernel via + clCreateDefinedBuiltInKernelDescriptor. Attribute values are + immutable. + +* *Kernel arguments*: The kernel arguments. + +* *Description*: The description of the kernel in detail. + +* *Attribute validation rules*: Conditions of the kernel attribute for + the kernel. Implementation must return CL_DBK_INVALID_ATTRIBUTE on + clCreateDefinedBuiltInKernelDescriptor call if any of the conditions + are violated. + +* *Kernel mode properties*: List of kernel mode + properties (cl_dbk_mode_properties) the kernel recognizes. The + properties can be used to tweak certain implementation details and + behaviors in the kernel execution. If a property not listed in the + DBK entry is fed to clCreateDefinedBuiltInKernelDescriptor call, + then implementation must return CL_DKB_UNSUPPORTED_MODE_PROPERTY. [caption="Table A.I.1. "] .Standard Built-in Kernels and Their Semantics. *The table has been populated with a small set of non-trivial example entries which are subject to change and the list to expand during drafting.* -[cols="1,3,2,2"] |=== -4+| *General linear algebra* -// https://netlib.org/blas/blasqr.pdf -| Name | Description | NDRange Dimensions | Arguments -| *khr_blas_gemm_float* -| xGEMM: General matrix multiplication with real single precision floating point numbers as described in Basic Linear Algebra Subprograms. Performs C = alpha * trans(A) * trans(B) + beta*C, where A, B and C are matrices, and alpha and beta scalars. trans() is a configurable transpose operation. -a| -[start=1] -. The height. -. The width. +| Name: *khr_matmul_v1* +| *Kernel Attributes* a| -[start=0] -. int: transpose operation (trans) type for matrix A (0 = none, 1 = transpose, 2 = conjugate transpose) -. int: transpose type for matrix B (0 = none, 1 = transpose, 2 = conjugate transpose) -. float: scalar (alpha) to multiply the matrix multiplication result elements with -. float* (input): matrix A -. int: leading dimension of A (0 = row-major, 1 = column-major) -. float* (input): matrix B -. int: leading dimension of B (0 = row-major, 1 = column-major) -. float: scalar (beta) to multiply the C matrix elements with before adding it to the result -. float* (input&output): matrix C which is added to the matrix multiplication result, and stores the output -. int: leading dimension of C (0 = row-major, 1 = column-major) -4+| OpenCL C Semantics -4+a| -[source,c] ----- -__kernel void __khr_blas_gemm_float( - int transA, int transB, float alpha, const global float *A, int ldA, - const global float *B, int ldB, - float beta, global float *C, int ldC) { - // TBD: An example implementation that can be used for verification - // and as a fallback SW implementation. -} ----- -4+| *OpenVX Neural Network Extension Compatible Kernels* -// Copied from https://registry.khronos.org/OpenVX/extensions/vx_khr_nn/1.2/html/d6/d9a/group__group__cnn.html#ga69764625f436c14d739fc467515c1584 -| Name | Description | NDRange Dimensions | Arguments -| *khr_openvx_nn_extension_convolution_uchar* -| Convolution for 8bit unsigned integer inputs and weights. +Fields of the `cl_dkb_attributes_matmul_v1` structure: + +. cl_tensor_desc_t A: Tensor description for input matrix A. +. cl_tensor_desc_t B: Tensor description for input matrix B. +. cl_tensor_desc_t R: Tensor description for output matrix C. +. cl_int transposeA: Non-zero transposes A matrix. +. cl_int transposeB: Non-zero transposes B matrix. +| *Kernel Arguments* a| -[start=1] -. Batch size. -. Width. -. Height. +. cl_tensor_t A: Matrix A (read only). +. cl_tensor_t B: Matrix B (read only). +. cl_tensor_t R: Output matrix. (write only). +| *Description* a| -[start=0] -. uchar* [in]: The input tensor data. 3 lower dimensions represent a single input, all following dimensions represent number of batches, possibly nested. The dimension order is [width, height, #IFM, #batches]. -. uchar* [in]: Weights, as a 4d tensor with dimensions [kernel_x, kernel_y, #IFM, #OFM]. -. uchar* [in]: Biases (optional, ignored if NULL). The biases, which may be shared (one per ofm) or unshared (one per ofm * output location). The possible layouts are either [#OFM] or [width, height, #OFM]. Biases data type must match the data type of the inputs. (Kernel parameter #2) -. size_t: (dilation_x) “inflate” the kernel by inserting zeros between the kernel elements in the x direction. The value is the number of zeros to insert. -. size_t: (dilation_y) “inflate” the kernel by inserting zeros between the kernel elements in the y direction. The value is the number of zeros to insert. -. int: Rounding method for calculating output dimensions. -. int: A VX_TYPE_ENUM of the vx_convert_policy_e enumeration. -. size_t: Number of elements padded at each side in the x dimension of the input. -. size_t: Number of elements padded at each side in the y dimension of the input. -. int: A VX_TYPE_ENUM of the vx_round_policy_e enumeration. -. uchar* [out]: The output tensor data. Output will have the same number and structure of dimensions as input. Output tensor data type must be same as the inputs. (Kernel parameter #4) - -4+| OpenCL C Semantics -4+a| -[source,c] ----- -__kernel void __khr_openvx_nn_extension_convolution_uchar( - const uchar *input, const uchar *weights, const uchar *biases, - size_t dilation_x, size_t dilation_y, - int down_scale_rounding, int overflow_policy, size_t padding_x, size_t padding_y, - int rounding_policy, uchar *output) { - // TBD. -} ----- +Performs (batched) matrix multiplication: `R = trans(A) * trans(B)`, +where `A`, `B` and `R` are tensors with at least rank two. The +`trans()` is a configurable transpose operation. + +Last two dimensions of the tensors are treated as operands to the +matric multiplication and rest of the dimensions are treated as batch +dimensions. + +Operations of the matrix muliplication are performed in the precision +of the `elementof\(R)`. -4+| *Direct Input/Output Operations* -4+| Kernels for accessing data sources and destinations directly without host involvement. -| Name | Description | NDRange Dimensions | Arguments -| *khr_io_stream_in_uchar* -| Non-blocking read of data from a sensor/stream associated with the device. -a| - +If an overflow occurs in the accumulation of the products, then `R` +tensor's result will be undefined. + +| *Attribute validation rules* a| -[start=0] -. uchar* [out]: The data. -. size_t* [in+out]: In: number of bytes to read. Out: Number of bytes that could be read (can be 0). (Compatible with the `cl_pocl_content_size` extension to optimize data transfers with.) -4+| OpenCL C Semantics -4+a| -[source,c] ----- -__kernel void __khr_io_stream_in_uchar( - uchar *output, size_t *num) { - // It is not feasible to describe this kernel in OpenCL C as I/O devices - // are not representable with it. -} ----- +* `rankof(A) == rankof(B) >= 2`. +* Let `shapeof(A~t~) == (b..., m, k)` and `shapeof(B~t~) = (b..., k, + n)` of tensors `A` and `B`, respectively, after possible tranposing. + `shapeof\(R)` must be `(b..., m, n)`. +* `elementof(A) == elementof(B)` +* `elemkindof\(R) == elemkindof(A)` +* `elementof\(R) == elementof(A)` or `elementof(A)` is promotable to + `elementof\(R)` without loss of meaning. +// E.g. cl_int -> cl_uint: loses negative values +| *Kernel mode properties* +a| +This DBK accepts the following properties: -| *khr_io_stream_out_uchar* -| Non-blocking write of data to an output/sink associated with the device. -| - +* CL_DBK_PROPERTY_MAX_RELATIVE_ERROR: Unset property defaults to positive infinity. +| +| Name: *khr_leaky_relu_v1* +| *Kernel Attributes* a| -[start=0] -. uchar* [in]: The data to write. -. size_t* [in+out]: In: Number of bytes to write. Out: Number of bytes that could be written (can be 0). -4+| OpenCL C Semantics -4+a| -[source,c] ----- -__kernel void __khr_io_stream_out_uchar( - uchar *input, size_t *num) { - // It is not feasible to describe this kernel in OpenCL C as I/O devices - // are not representable with it. -} ----- +Fields of the `cl_dbk_leaky_relu_v1` structure: +. cl_tensor_desc_t in: Input tensor description. +. cl_tensor_desc_t out: Output tensor description. +. cl_float alpha: Coefficient of leakage. +| *Kernel arguments* +a| +. cl_tensor_t in: The input tensor. +. cl_tensor_t out: The output tensor. +| *Description* +a| +Applies operation `alpha * x if x < 0 else x` on all +elements of the `in` tensor. + +If target device does not support denormals, then `alpha` is flushed +to zero before the operation is applied. -| *khr_io_stream_in_blocking_uchar* -| Blocking read of data from a sensor/stream associated with the device. -a| - +| *Kernel mode properties* +| N/A +| *Attribute validation rules* a| -[start=0] -. uchar* [out]: The data. -* size_t* [in]: How many bytes to read before returning. +* `shapeof(in) == shapeof(out)` +* `elementof(in) == elementof(out)` +* `alpha` must be a finite value. +|=== + +==== Launching DBKs from the Device Side ==== + +DBKs are primarily meant to be launched as kernel commands via +host-side command queues. Optionally, they can be callable from +device-side via `enqueue_kernel`: + +TBC. This probably needs device-side function corresponding to +clCreateDefinedBuiltInKernelDescriptor. + +==== Sample Code ==== -4+| OpenCL C Semantics -4+a| [source,c] ---- -__kernel void __khr_io_stream_in_blocking_uchar(uchar *output, size_t *num) { - while (*num) { - size_t num_read = *num; - __khr_io_stream_in_uchar(output, &num_read); - num -= num_read; - output += num_read; - } +// TBD. Similarly in cl_qcom_ml_ops, tensors have type +// (cl_channel_type) and a number of dimensions (rank) and dimension +// sizes (shape). Difference over the cl_qcom_ml_ops is that the rank is +// "unlimited". +cl_tensor_desc_t lhs_tensor_desc = TBD; +cl_tensor_desc_t rhs_tensor_desc = TBD; +cl_tensor_desc_t res_tensor_desc = TBD; + +cl_dkb_attributes_matmul_v1 matmul_attrs = { + lhs_tensor_desc, rhs_tensor_desc, res_tensor_desc, + 1, 0 // = Transpose lhs tensor } ----- -|=== - -==== Launching BiKs from the Device Side ==== - -BiKs are primarily meant to be launched as kernel commands via host-side command queues. -Optionally, they can be callable from device-side via -`enqueue_kernel`: This capability can be queried on per BiK basis at compile-time in OpenCL C by checking for macro definitions which has the following naming convention: `cl_khr_bik_BUILTIN_KERNEL_NAME`. In case a BiK macro is defined, a kernel with a naming convention `__khr_BUILTIN_KERNEL_NAME()` can be enqueued by the program at device side as software-defined kernels. +cl_dbk_mode_properties matmul_props = { + // Request a matmul implementation that meets this precision. + CL_DBK_PROPERTY_MAX_RELATIVE_ERROR, 100, // in ULPs. +} +cl_uint err; +std::vector kernel_descriptions; +cl_dbk_descriptor matmul_desc = + clCreateDefinedBuiltInKernelDescriptor( + context, num_devices, device_list, + CL_DBK_MATMUL_V1, &matmul_attrs, &matmul_props, &err); + +} else if (err == CL_DBK_UNAVAILABLE) { + // Kernel attributes are valid but the kernel is not supported in at least + // one of the devices. +} else if (err == CL_DBK_UNMET_MAX_RELATIVE_ERROR) { + // E.g. Kernel is supported but is not precise enough. +} else if (err == CL_DBK_UNSUPPORTED_MODE_PROPERTY) { + // cl_dbk_mode_properties has a property not listed in the description of the + // defined built-in kernel. +} else + kernel_descriptions.push_back(matmul_desc); + +... + +cl_program dbk_lib = clCreateProgramWithDefinedBuiltInKernels( + context, kernel_descriptions.size(), kernel_descriptors.data(), err); + +... + +cl_kernel matmul_kernel = clCreateDefinedBuiltinKernel( + dkb_lib, matmul_desc, err); + +// TBD: allocate space for tensors. Perhaps like cl_qcom_ml_ops: query +// tensor sizes after the final program has been created or after +// command buffer (with DBKs within) is finalized. Implementation +// determines the optimal data layout (opaque to the application) for +// the tensors based on their usage. Application uses the tensor +// sizes to create cl_mem buffers which are bound to the tensors. +cl_tensor_t lhs_tensor = TBD; +cl_tensor_t rhs_tensor = TBD; +cl_tensor_t res_tensor = TBD; + +// Transfer data to input tensors + +clSetKernelArg(matmul_kernel, 0, sizeof(cl_tensor_t), &lhs_tensor); +clSetKernelArg(matmul_kernel, 1, sizeof(cl_tensor_t), &rhs_tensor); +clSetKernelArg(matmul_kernel, 2, sizeof(cl_tensor_t), &res_tensor); + +clEnqueueNDRangeKernel(cmd_q, matmul_kernel, 0, NULL, NULL, NULL, 0, NULL, NULL); +---- === Open questions -. Should we enable launching BiKs from the device side without requiring device-side enqueue? The main problem is those with NDRange as they are not simple single-WI helper functions. +. Should we enable launching DBKs from the device side without requiring device-side enqueue? The main problem is those with NDRange as they are not simple single-WI helper functions. + -- *UNRESOLVED* -- -. Should the NDRange be used at all in BiKs? It feels sort of unnatural as typically the NDRange is used to imply SPMD parallelism while the hardware/firmware is free to choose whatever parallelism degree to implement the function. On the other hand, similar applies to software kernel launches as the work-items can be executed serially if adhering to barrier semantics. +. Should the NDRange be used at all in DBKs? It feels sort of unnatural as typically the NDRange is used to imply SPMD parallelism while the hardware/firmware is free to choose whatever parallelism degree to implement the function. On the other hand, similar applies to software kernel launches as the work-items can be executed serially if adhering to barrier semantics. + -- *UNRESOLVED* -- -. Different accelerators prefer different channel orders (NHWC vs. NCHW...) for the processed data. Should the channel order be passed as a BiK argument (like in the example GEMM's row/column order) or is it better to have different BiK variations for each? +. Different accelerators prefer different channel orders (NHWC vs. NCHW...) for the processed data. Should the channel order be passed as a DBK argument (like in the example GEMM's row/column order) or is it better to have different DBK variations for each? + -- *UNRESOLVED* -- -. How to denote preference? Some of the BiKs are more efficient on a given device as they map more naturally to the underlying HW accelerator, but the slower variations (for example, with unoptimal channel order in NN accelerators) might be still beneficially accelerated. +. How to denote preference? Some of the DBKs are more efficient on a given device as they map more naturally to the underlying HW accelerator, but the slower variations (for example, with unoptimal channel order in NN accelerators) might be still beneficially accelerated. + -- *UNRESOLVED* @@ -327,4 +523,3 @@ Optionally, they can be callable from device-side via *UNRESOLVED* -- - diff --git a/ext/cl_khr_defined_builtin_kernels.html b/ext/cl_khr_defined_builtin_kernels.html index 3fda4c9dc..fc5837a44 100644 --- a/ext/cl_khr_defined_builtin_kernels.html +++ b/ext/cl_khr_defined_builtin_kernels.html @@ -1,1288 +1,986 @@ - - - - - - -cl_khr_defined_builtin_kernels - - - - - -
-
-

Khronos-Defined Built-in Kernels (Early Draft)

-
-

The purpose of this extension is to provide a standardized set of built-in -kernels with well-defined semantics useful for accelerating applications -from various domains. The extension specification is designed to rapidly -expand and "live" via addition of new well-defined built-in kernel -definitions and updating of previously defined ones.

-
-

General Information

-
-

Name Strings

-

cl_khr_defined_builtin_kernels

-
-
-

Version History

-
- ---- - - - - - - - - - - - - - -
Date Version Description

2022-12-13

0.1.0

First formulation as an extension specification like proposed by Ben Ashbaugh.

-
-
-
-

Dependencies

-

This extension is written against the OpenCL Specification version 3.0.12.

-

This extension requires OpenCL 1.2 or later.

-
-
-

Contributors

-

Pekka Jääskeläinen, Intel and Tampere University.
-Topi Leppänen, Tampere University.
-Jan Solanti, Tampere University.
-Ben Ashbaugh, Intel.

-
-
-
-

Overview

-

OpenCL 1.2 specifies a built-in kernel (BiK) as a kernel that is executed on -an OpenCL device or custom device by fixed-function hardware or in firmware. -Applications can query the built-in kernels supported by a device or custom -device.

-

BiKs are referred to by a name (a C string) without any semantics attached -to the functionality. The semantics behind the name is completely device -specific, typically documented in vendor-specific extension specifications.

-

The goal for this extension is to lower the bar for utilizing hardware -accelerated functions in drivers by providing a library of -well-defined BiKs with good coverage for common acceleration needs -and which is designed to easily evolve over time.

-

The device drivers that implement this extension can freely choose which -subset of defined BiKs they implement and advertise to the clients. The -clients can use the BiKs to accelerate their applications by manually -executing invoking the BiKs. The extension is designed to also support using -automated task graph lowering tooling later.

-
-

Background

-

ASIC-based coarse-grained hardware accelerators are specialized logic meant to -speed up execution of workloads of interest, or to provide improvements in -energy-efficiency. Examples of contemporary workloads that are beneficially hardware -accelerated over software-based implementations include video coding, deep learning, -cryptography, software-defined radio and graphics rendering.

-

FPGAs form a special case somewhere between instruction-set architectures and fixed -function hardware accelerators. While advances in high-level synthesis tools -have attempted to bridge the programmability gap between GPU and FPGA programming, -FPGAs are still considered as devices which are challenging to achieve efficient -implementations with. Due to extensive manual optimization work required for efficient -implementations of the accelerated functionality, defining FPGA designs as -a system of "hardware accelerator IPs" is still a widely used "application abstraction". -FPGAs can be thus seen as a platform that can realize and integrate any -hardware accelerator implementable with the programmable fabric.

-

The means to utilize hardware accelerators have typically been -vendor-specific and abstracted behind domain-specific libraries. -The overhead with the "bunch of libraries"-approach is seen in the lowest level -of integration: The libraries utilize a low level library (typically -vendor-specific) to interface with the actual hardware, and thus does not -integrate efficiently with other libraries or software-programmable processors -that might be available on the same chip.

-
-
-

Rationale

-

OpenCL’s built-in kernel abstraction allows pushing both hardware -accelerated and software defined kernels to the same command-queues, -providing a powerful means for asynchronous execution of heterogeneous -task graphs on diverse heterogeneous platforms. The ability to invoke hardware -accelerators while being able to synchronize and optimize data transfers at -the lowest levels of the driver stack can provide significant latency benefits, -especially when combined with the command-buffering mechanism.

-

However, the BiK abstraction works well only when it is widely adopted by -vendors, and when multiple vendors implement the same definitions. Otherwise -each vendor specifies and implements their own BiKs closely matching their -own hardware accelerator properties, resulting in lack of cross-vendor -portability in the API abstraction presented to the upper layers of -heterogeneous computing software stacks.

-

This extension standardizes a set of well-defined BiKs the clients can -call from higher level programming stacks built with different languages -and multiple libraries, possibly mix accelerator calls with calls to software kernel -commands, and rely on the driver stack to optimize the execution (especially -the synchronization and communication) as a low level heterogeneous task graph. -It aims to promote the use of BiKs as a programming model for hardware accelerated -functionality, to improve cross-vendor portability of hardware accelerated computing.

-
-
-
-

Modifications to section 4.2 of the OpenCL API Specification

-

Modify Table 5, Device Queries, of section 4.2, by adding the following -sentences to the description cell of CL_DEVICE_BUILT_IN_KERNELS:

-
-
The semantics of the returned built-in kernels are undefined or defined in -vendor-specific documentation, unless the name starts with prefix ‘khr_’, -which means it’s a built-in kernel with semantics defined in Appendix I.
-
-
-
-
-

Add new appendix "Appendix I - Defined Built-in Kernels" to OpenCL API Specification

-

This chapter describes standard built-in kernels (BiK) with well-defined -semantics. A conformant device can report to support zero or more of the built-in -kernels via CL_DEVICE_BUILT_IN_KERNELS or CL_DEVICE_BUILT_IN_KERNELS_WITH_VERSION device queries.

-

The general client-side abstraction of the defined built-in kernels is similar to a call -to a C function of which implementation is hidden. The device driver can invoke one or -more physical hardware accelerators combined with firmware to implement the semantics -as efficiently as possible.

-

It is the driver’s responsibility to handle efficient synchronization and communication -to the hardware accelerator, the internal accelerator state management and resource sharing -across multiple OpenCL contexts.

-
-

Standard Built-in Kernels

-

The following list of recognized built-ins is organized according to their application -domain and handled data types. It is expected to grow and update while preserving backwards -compatibility.

-
- - ----- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Table A.I.1. Standard Built-in Kernels and Their Semantics. The table has been populated with a small set of non-trivial example entries which are subject to change and the list to expand during drafting.

General linear algebra

Name

Description

NDRange Dimensions

Arguments

khr_blas_gemm_float

xGEMM: General matrix multiplication with real single precision floating point numbers as described in Basic Linear Algebra Subprograms. Performs C = alpha * trans(A) * trans(B) + beta*C, where A, B and C are matrices, and alpha and beta scalars. trans() is a configurable transpose operation.

    -
  1. -

    -The height. -

    -
  2. -
  3. -

    -The width. -

    -
  4. -
    -
  1. -

    -int: transpose operation (trans) type for matrix A (0 = none, 1 = transpose, 2 = conjugate transpose) -

    -
  2. -
  3. -

    -int: transpose type for matrix B (0 = none, 1 = transpose, 2 = conjugate transpose) -

    -
  4. -
  5. -

    -float: scalar (alpha) to multiply the matrix multiplication result elements with -

    -
  6. -
  7. -

    -float* (input): matrix A -

    -
  8. -
  9. -

    -int: leading dimension of A (0 = row-major, 1 = column-major) -

    -
  10. -
  11. -

    -float* (input): matrix B -

    -
  12. -
  13. -

    -int: leading dimension of B (0 = row-major, 1 = column-major) -

    -
  14. -
  15. -

    -float: scalar (beta) to multiply the C matrix elements with before adding it to the result -

    -
  16. -
  17. -

    -float* (input&output): matrix C which is added to the matrix multiplication result, and stores the output -

    -
  18. -
  19. -

    -int: leading dimension of C (0 = row-major, 1 = column-major) -

    -
  20. -

OpenCL C Semantics

-
-
__kernel void __khr_blas_gemm_float(
-   int transA, int transB, float alpha, const global float *A, int ldA,
-   const global float *B, int ldB,
-   float beta, global float *C, int ldC) {
-   // TBD: An example implementation that can be used for verification
-   // and as a fallback SW implementation.
-}

OpenVX Neural Network Extension Compatible Kernels

Name

Description

NDRange Dimensions

Arguments

khr_openvx_nn_extension_convolution_uchar

Convolution for 8bit unsigned integer inputs and weights.

    -
  1. -

    -Batch size. -

    -
  2. -
  3. -

    -Width. -

    -
  4. -
  5. -

    -Height. -

    -
  6. -
    -
  1. -

    -uchar* [in]: The input tensor data. 3 lower dimensions represent a single input, all following dimensions represent number of batches, possibly nested. The dimension order is [width, height, #IFM, #batches]. -

    -
  2. -
  3. -

    -uchar* [in]: Weights, as a 4d tensor with dimensions [kernel_x, kernel_y, #IFM, #OFM]. -

    -
  4. -
  5. -

    -uchar* [in]: Biases (optional, ignored if NULL). The biases, which may be shared (one per ofm) or unshared (one per ofm * output location). The possible layouts are either [#OFM] or [width, height, #OFM]. Biases data type must match the data type of the inputs. (Kernel parameter #2) -

    -
  6. -
  7. -

    -size_t: (dilation_x) “inflate” the kernel by inserting zeros between the kernel elements in the x direction. The value is the number of zeros to insert. -

    -
  8. -
  9. -

    -size_t: (dilation_y) “inflate” the kernel by inserting zeros between the kernel elements in the y direction. The value is the number of zeros to insert. -

    -
  10. -
  11. -

    -int: Rounding method for calculating output dimensions. -

    -
  12. -
  13. -

    -int: A VX_TYPE_ENUM of the vx_convert_policy_e enumeration. -

    -
  14. -
  15. -

    -size_t: Number of elements padded at each side in the x dimension of the input. -

    -
  16. -
  17. -

    -size_t: Number of elements padded at each side in the y dimension of the input. -

    -
  18. -
  19. -

    -int: A VX_TYPE_ENUM of the vx_round_policy_e enumeration. -

    -
  20. -
  21. -

    -uchar* [out]: The output tensor data. Output will have the same number and structure of dimensions as input. Output tensor data type must be same as the inputs. (Kernel parameter #4) -

    -
  22. -

OpenCL C Semantics

-
-
__kernel void __khr_openvx_nn_extension_convolution_uchar(
-   const uchar *input, const uchar *weights, const uchar *biases,
-   size_t dilation_x, size_t dilation_y,
-   int down_scale_rounding, int overflow_policy, size_t padding_x, size_t padding_y,
-   int rounding_policy, uchar *output) {
-   // TBD.
-}

Direct Input/Output Operations

Kernels for accessing data sources and destinations directly without host involvement.

Name

Description

NDRange Dimensions

Arguments

khr_io_stream_in_uchar

Non-blocking read of data from a sensor/stream associated with the device.

-
-
-
-
    -
  1. -

    -uchar* [out]: The data. -

    -
  2. -
  3. -

    -size_t* [in+out]: In: number of bytes to read. Out: Number of bytes that could be read (can be 0). (Compatible with the cl_pocl_content_size extension to optimize data transfers with.) -

    -
  4. -

OpenCL C Semantics

-
-
__kernel void __khr_io_stream_in_uchar(
-   uchar *output, size_t *num) {
-   // It is not feasible to describe this kernel in OpenCL C as I/O devices
-   // are not representable with it.
-}

khr_io_stream_out_uchar

Non-blocking write of data to an output/sink associated with the device.

-

    -
  1. -

    -uchar* [in]: The data to write. -

    -
  2. -
  3. -

    -size_t* [in+out]: In: Number of bytes to write. Out: Number of bytes that could be written (can be 0). -

    -
  4. -

OpenCL C Semantics

-
-
__kernel void __khr_io_stream_out_uchar(
-   uchar *input, size_t *num) {
-   // It is not feasible to describe this kernel in OpenCL C as I/O devices
-   // are not representable with it.
-}

khr_io_stream_in_blocking_uchar

Blocking read of data from a sensor/stream associated with the device.

-
-
-
-
    -
  1. -

    -uchar* [out]: The data. -

    -
      -
    • -

      -size_t* [in]: How many bytes to read before returning. -

      -
    • -
    -
  2. -

OpenCL C Semantics

-
-
__kernel void __khr_io_stream_in_blocking_uchar(uchar *output, size_t *num) {
-   while (*num) {
-       size_t num_read = *num;
-       __khr_io_stream_in_uchar(output, &num_read);
-       num -= num_read;
-       output += num_read;
-   }
-}
-
-
-
-

Launching BiKs from the Device Side

-

BiKs are primarily meant to be launched as kernel commands via host-side command queues. -Optionally, they can be callable from device-side via -enqueue_kernel: This capability can be queried on per BiK basis at compile-time in OpenCL C by checking for macro definitions which has the following naming convention: cl_khr_bik_BUILTIN_KERNEL_NAME. In case a BiK macro is defined, a kernel with a naming convention __khr_BUILTIN_KERNEL_NAME() can be enqueued by the program at device side as software-defined kernels.

-
-
-
-

Open questions

-
    -
  1. -

    -Should we enable launching BiKs from the device side without requiring device-side enqueue? The main problem is those with NDRange as they are not simple single-WI helper functions. -

    -
    -
    -

    UNRESOLVED

    -
    -
  2. -
  3. -

    -Should the NDRange be used at all in BiKs? It feels sort of unnatural as typically the NDRange is used to imply SPMD parallelism while the hardware/firmware is free to choose whatever parallelism degree to implement the function. On the other hand, similar applies to software kernel launches as the work-items can be executed serially if adhering to barrier semantics. -

    -
    -
    -

    UNRESOLVED

    -
    -
  4. -
  5. -

    -Different accelerators prefer different channel orders (NHWC vs. NCHW…) for the processed data. Should the channel order be passed as a BiK argument (like in the example GEMM’s row/column order) or is it better to have different BiK variations for each? -

    -
    -
    -

    UNRESOLVED

    -
    -
  6. -
  7. -

    -How to denote preference? Some of the BiKs are more efficient on a given device as they map more naturally to the underlying HW accelerator, but the slower variations (for example, with unoptimal channel order in NN accelerators) might be still beneficially accelerated. -

    -
    -
    -

    UNRESOLVED

    -
    -
  8. -
  9. -

    -Since the defined built-in kernel concept is basically just a C-like API inside another API, should it be made more generic and thus directly usable for SYCL and Vulkan as well? -

    -
    -
    -

    UNRESOLVED

    -
    -
  10. -
-
-
-
-
-

- - - + + + + + + + +cl_khr_defined_builtin_kernels + + + + + +
+
+

Khronos-Defined Built-in Kernels (Early Draft)

+
+
+

The purpose of this extension is to provide a standardized set of built-in +kernels with well-defined semantics useful for accelerating applications +from various domains. The extension specification is designed to rapidly +expand and "live" via addition of new well-defined built-in kernel +definitions and updating of previously defined ones.

+
+
+

General Information

+
+

Name Strings

+
+

cl_khr_defined_builtin_kernels

+
+
+
+

Version History

+ +++++ + + + + + + + + + + + + + + + + + + + +
DateVersionDescription

2022-12-13

0.1.0

First formulation as an extension specification like proposed by Ben Ashbaugh.

TODO

TODO

Reference new concept as "defined built-in kernel".

+
+
+

Dependencies

+
+

This extension is written against the OpenCL Specification version 3.0.12.

+
+
+

This extension requires OpenCL 1.2 or later.

+
+
+
+

Contributors

+
+

Pekka Jääskeläinen, Intel and Tampere University.
+Topi Leppänen, Tampere University.
+Jan Solanti, Tampere University.
+Ben Ashbaugh, Intel.

+
+
+
+
+

Overview

+
+

OpenCL 1.2 specifies a built-in kernel as a kernel that is executed on +an OpenCL device or custom device by fixed-function hardware or in firmware. +Applications can query the built-in kernels supported by a device or custom +device.

+
+
+

Built-in kernels are referred to by a name (a C string) without any +semantics attached to the functionality. The semantics behind the name +is completely device specific, typically documented in vendor-specific +extension specifications.

+
+
+

The goal for this extension is to lower the bar for utilizing hardware +accelerated functions in drivers by providing a library of +well-defined built-in kernel with good coverage for common acceleration needs +and which is designed to easily evolve over time.

+
+
+

The device drivers that implement this extension can freely choose which +subset of defined BiKs they implement and advertise to the clients. The +clients can use the BiKs to accelerate their applications by manually +executing invoking the BiKs. The extension is designed to also support using +automated task graph lowering tooling later.

+
+
+

Background

+
+

ASIC-based coarse-grained hardware accelerators are specialized logic meant to +speed up execution of workloads of interest, or to provide improvements in +energy-efficiency. Examples of contemporary workloads that are beneficially hardware +accelerated over software-based implementations include video coding, deep learning, +cryptography, software-defined radio and graphics rendering.

+
+
+

FPGAs form a special case somewhere between instruction-set architectures and fixed +function hardware accelerators. While advances in high-level synthesis tools +have attempted to bridge the programmability gap between GPU and FPGA programming, +FPGAs are still considered as devices which are challenging to achieve efficient +implementations with. Due to extensive manual optimization work required for efficient +implementations of the accelerated functionality, defining FPGA designs as +a system of "hardware accelerator IPs" is still a widely used "application abstraction". +FPGAs can be thus seen as a platform that can realize and integrate any +hardware accelerator implementable with the programmable fabric.

+
+
+

The means to utilize hardware accelerators have typically been +vendor-specific and abstracted behind domain-specific libraries. +The overhead with the "bunch of libraries"-approach is seen in the lowest level +of integration: The libraries utilize a low level library (typically +vendor-specific) to interface with the actual hardware, and thus does not +integrate efficiently with other libraries or software-programmable processors +that might be available on the same chip.

+
+
+
+

Rationale

+
+

OpenCL’s built-in kernel abstraction allows pushing both hardware +accelerated and software defined kernels to the same command-queues, +providing a powerful means for asynchronous execution of heterogeneous +task graphs on diverse heterogeneous platforms. The ability to invoke hardware +accelerators while being able to synchronize and optimize data transfers at +the lowest levels of the driver stack can provide significant latency benefits, +especially when combined with the command-buffering mechanism.

+
+
+

However, the built-in kernel abstraction works well only when it is widely adopted by +vendors, and when multiple vendors implement the same definitions. Otherwise +each vendor specifies and implements their own built-in kernels closely matching their +own hardware accelerator properties, resulting in lack of cross-vendor +portability in the API abstraction presented to the upper layers of +heterogeneous computing software stacks.

+
+
+

This extension standardizes a set of well-defined built-in kernels the clients can +call from higher level programming stacks built with different languages +and multiple libraries, possibly mix accelerator calls with calls to software kernel +commands, and rely on the driver stack to optimize the execution (especially +the synchronization and communication) as a low level heterogeneous task graph. +It aims to promote the use of built-in kernels as a programming model for hardware accelerated +functionality, to improve cross-vendor portability of hardware accelerated computing.

+
+
+
+
+

Modifications to section 4.2 of the OpenCL API Specification

+
+

Modify Table 5, Device Queries, of section 4.2, by adding the following +sentences to the description cell of CL_DEVICE_BUILT_IN_KERNELS:

+
+
+
+The semantics of the returned built-in kernels are undefined or defined in +vendor-specific documentation, unless the name starts with prefix `khr_', +which means it’s a defined built-in kernel with semantics defined in Appendix I. +
+
+
+
+

Add new appendix "Appendix I - Defined Built-in Kernels" to OpenCL API Specification

+
+

This chapter describes standard defined built-in kernels (DBK) with well-defined +semantics. A conformant device can report to support zero or more of the built-in +kernels via CL_DEVICE_BUILT_IN_KERNELS or CL_DEVICE_BUILT_IN_KERNELS_WITH_VERSION device queries.

+
+
+

The general client-side abstraction of the DBKs is similar to a call +to a C function of which implementation is hidden. The device driver +can invoke one or more physical hardware accelerators combined with +firmware to implement the semantics as efficiently as possible.

+
+
+

It is the driver’s responsibility to handle efficient synchronization and communication +to the hardware accelerator, the internal accelerator state management and resource sharing +across multiple OpenCL contexts.

+
+
+

Standard Defined Built-in Kernels

+
+

The following list of recognized defined built-ins is organized +according to their application domain and handled data types. It is +expected to grow and update while preserving backwards compatibility.

+
+ + ++++++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Table A.I.1. Standard Built-in Kernels and Their Semantics. The table has been populated with a small set of non-trivial example entries which are subject to change and the list to expand during drafting.

General linear algebra

Name

Description

NDRange Dimensions

Arguments

khr_blas_gemm_float

xGEMM: General matrix multiplication with real single precision floating point numbers as described in Basic Linear Algebra Subprograms. Performs C = alpha * trans(A) * trans(B) + beta*C, where A, B and C are matrices, and alpha and beta scalars. trans() is a configurable transpose operation.

+
    +
  1. +

    The height.

    +
  2. +
  3. +

    The width.

    +
  4. +
+
+
    +
  1. +

    int: transpose operation (trans) type for matrix A (0 = none, 1 = transpose, 2 = conjugate transpose)

    +
  2. +
  3. +

    int: transpose type for matrix B (0 = none, 1 = transpose, 2 = conjugate transpose)

    +
  4. +
  5. +

    float: scalar (alpha) to multiply the matrix multiplication result elements with

    +
  6. +
  7. +

    float* (input): matrix A

    +
  8. +
  9. +

    int: leading dimension of A (0 = row-major, 1 = column-major)

    +
  10. +
  11. +

    float* (input): matrix B

    +
  12. +
  13. +

    int: leading dimension of B (0 = row-major, 1 = column-major)

    +
  14. +
  15. +

    float: scalar (beta) to multiply the C matrix elements with before adding it to the result

    +
  16. +
  17. +

    float* (input&output): matrix C which is added to the matrix multiplication result, and stores the output

    +
  18. +
  19. +

    int: leading dimension of C (0 = row-major, 1 = column-major)

    +
  20. +
+

OpenCL C Semantics

+
+
__kernel void __khr_blas_gemm_float(
+   int transA, int transB, float alpha, const global float *A, int ldA,
+   const global float *B, int ldB,
+   float beta, global float *C, int ldC) {
+   // TBD: An example implementation that can be used for verification
+   // and as a fallback SW implementation.
+}
+
+

OpenVX Neural Network Extension Compatible Kernels

Name

Description

NDRange Dimensions

Arguments

khr_openvx_nn_extension_convolution_uchar

Convolution for 8bit unsigned integer inputs and weights.

+
    +
  1. +

    Batch size.

    +
  2. +
  3. +

    Width.

    +
  4. +
  5. +

    Height.

    +
  6. +
+
+
    +
  1. +

    uchar* [in]: The input tensor data. 3 lower dimensions represent a single input, all following dimensions represent number of batches, possibly nested. The dimension order is [width, height, #IFM, #batches].

    +
  2. +
  3. +

    uchar* [in]: Weights, as a 4d tensor with dimensions [kernel_x, kernel_y, #IFM, #OFM].

    +
  4. +
  5. +

    uchar* [in]: Biases (optional, ignored if NULL). The biases, which may be shared (one per ofm) or unshared (one per ofm * output location). The possible layouts are either [#OFM] or [width, height, #OFM]. Biases data type must match the data type of the inputs. (Kernel parameter #2)

    +
  6. +
  7. +

    size_t: (dilation_x) “inflate” the kernel by inserting zeros between the kernel elements in the x direction. The value is the number of zeros to insert.

    +
  8. +
  9. +

    size_t: (dilation_y) “inflate” the kernel by inserting zeros between the kernel elements in the y direction. The value is the number of zeros to insert.

    +
  10. +
  11. +

    int: Rounding method for calculating output dimensions.

    +
  12. +
  13. +

    int: A VX_TYPE_ENUM of the vx_convert_policy_e enumeration.

    +
  14. +
  15. +

    size_t: Number of elements padded at each side in the x dimension of the input.

    +
  16. +
  17. +

    size_t: Number of elements padded at each side in the y dimension of the input.

    +
  18. +
  19. +

    int: A VX_TYPE_ENUM of the vx_round_policy_e enumeration.

    +
  20. +
  21. +

    uchar* [out]: The output tensor data. Output will have the same number and structure of dimensions as input. Output tensor data type must be same as the inputs. (Kernel parameter #4)

    +
  22. +
+

OpenCL C Semantics

+
+
__kernel void __khr_openvx_nn_extension_convolution_uchar(
+   const uchar *input, const uchar *weights, const uchar *biases,
+   size_t dilation_x, size_t dilation_y,
+   int down_scale_rounding, int overflow_policy, size_t padding_x, size_t padding_y,
+   int rounding_policy, uchar *output) {
+   // TBD.
+}
+
+

Direct Input/Output Operations

Kernels for accessing data sources and destinations directly without host involvement.

Name

Description

NDRange Dimensions

Arguments

khr_io_stream_in_uchar

Non-blocking read of data from a sensor/stream associated with the device.

+

-

+
+
    +
  1. +

    uchar* [out]: The data.

    +
  2. +
  3. +

    size_t* [in+out]: In: number of bytes to read. Out: Number of bytes that could be read (can be 0). (Compatible with the cl_pocl_content_size extension to optimize data transfers with.)

    +
  4. +
+

OpenCL C Semantics

+
+
__kernel void __khr_io_stream_in_uchar(
+   uchar *output, size_t *num) {
+   // It is not feasible to describe this kernel in OpenCL C as I/O devices
+   // are not representable with it.
+}
+
+

khr_io_stream_out_uchar

Non-blocking write of data to an output/sink associated with the device.

-

+
    +
  1. +

    uchar* [in]: The data to write.

    +
  2. +
  3. +

    size_t* [in+out]: In: Number of bytes to write. Out: Number of bytes that could be written (can be 0).

    +
  4. +
+

OpenCL C Semantics

+
+
__kernel void __khr_io_stream_out_uchar(
+   uchar *input, size_t *num) {
+   // It is not feasible to describe this kernel in OpenCL C as I/O devices
+   // are not representable with it.
+}
+
+

khr_io_stream_in_blocking_uchar

Blocking read of data from a sensor/stream associated with the device.

+

-

+
+
    +
  1. +

    uchar* [out]: The data.

    +
    +
      +
    • +

      size_t* [in]: How many bytes to read before returning.

      +
    • +
    +
    +
  2. +
+

OpenCL C Semantics

+
+
__kernel void __khr_io_stream_in_blocking_uchar(uchar *output, size_t *num) {
+   while (*num) {
+       size_t num_read = *num;
+       __khr_io_stream_in_uchar(output, &num_read);
+       num -= num_read;
+       output += num_read;
+   }
+}
+
+
+
+
+

Launching BiKs from the Device Side

+
+

BiKs are primarily meant to be launched as kernel commands via host-side command queues. +Optionally, they can be callable from device-side via +enqueue_kernel: This capability can be queried on per BiK basis at compile-time in OpenCL C by checking for macro definitions which has the following naming convention: cl_khr_bik_BUILTIN_KERNEL_NAME. In case a BiK macro is defined, a kernel with a naming convention __khr_BUILTIN_KERNEL_NAME() can be enqueued by the program at device side as software-defined kernels.

+
+
+
+
+

Open questions

+
+
    +
  1. +

    Should we enable launching BiKs from the device side without requiring device-side enqueue? The main problem is those with NDRange as they are not simple single-WI helper functions.

    +
    +
    +
    +

    UNRESOLVED

    +
    +
    +
    +
  2. +
  3. +

    Should the NDRange be used at all in BiKs? It feels sort of unnatural as typically the NDRange is used to imply SPMD parallelism while the hardware/firmware is free to choose whatever parallelism degree to implement the function. On the other hand, similar applies to software kernel launches as the work-items can be executed serially if adhering to barrier semantics.

    +
    +
    +
    +

    UNRESOLVED

    +
    +
    +
    +
  4. +
  5. +

    Different accelerators prefer different channel orders (NHWC vs. NCHW…​) for the processed data. Should the channel order be passed as a BiK argument (like in the example GEMM’s row/column order) or is it better to have different BiK variations for each?

    +
    +
    +
    +

    UNRESOLVED

    +
    +
    +
    +
  6. +
  7. +

    How to denote preference? Some of the BiKs are more efficient on a given device as they map more naturally to the underlying HW accelerator, but the slower variations (for example, with unoptimal channel order in NN accelerators) might be still beneficially accelerated.

    +
    +
    +
    +

    UNRESOLVED

    +
    +
    +
    +
  8. +
  9. +

    Since the defined built-in kernel concept is basically just a C-like API inside another API, should it be made more generic and thus directly usable for SYCL and Vulkan as well?

    +
    +
    +
    +

    UNRESOLVED

    +
    +
    +
    +
  10. +
+
+
+
+
+
+ + + \ No newline at end of file From 008015110c2fd62068b52fcc792e027ce43b78e7 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Henry=20Linjam=C3=A4ki?= Date: Mon, 23 Oct 2023 11:25:10 +0300 Subject: [PATCH 3/9] Apply suggested changes MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-authored-by: Pekka Jääskeläinen --- ext/cl_khr_defined_builtin_kernels.asciidoc | 117 ++-- ext/cl_khr_defined_builtin_kernels.html | 730 ++++++++++++++------ 2 files changed, 583 insertions(+), 264 deletions(-) diff --git a/ext/cl_khr_defined_builtin_kernels.asciidoc b/ext/cl_khr_defined_builtin_kernels.asciidoc index 164baac0a..028f021fa 100644 --- a/ext/cl_khr_defined_builtin_kernels.asciidoc +++ b/ext/cl_khr_defined_builtin_kernels.asciidoc @@ -26,7 +26,9 @@ definitions and updating of previously defined ones. |==== | *Date* | *Version* | *Description* | 2022-12-13 | 0.1.0 | First formulation as an extension specification like proposed by Ben Ashbaugh. -| TODO | TODO | TODO +| 2022-10-23 | 0.2.0 | +Add APIs for defined built-in kernel (DBK) creation. Model DBKs on +tensor type. Add sample code. |==== ==== Dependencies @@ -35,7 +37,12 @@ This extension is written against the OpenCL Specification version 3.0.12. This extension requires OpenCL 1.2 or later. -This extension requires cl_khr_tensor (Note: unpublished draft, work in progress) +This extension requires cl_khr_tensor. + +[NOTE] +cl_khr_tensor is unpublished, work-in-progress +extension. Briefly, It will bring concept of tensor, N-dimensional +data structure whose data layout is opaque to applications. ==== Contributors @@ -43,6 +50,7 @@ Pekka Jääskeläinen, Intel and Tampere University. + Topi Leppänen, Tampere University. + Jan Solanti, Tampere University. + Ben Ashbaugh, Intel. + +Henry Linjamäki, Intel. + === Overview @@ -62,7 +70,7 @@ well-defined built-in kernel with good coverage for common acceleration needs and which is designed to easily evolve over time. The device drivers that implement this extension can freely choose which -subset of DBKs they implement and advertise to the clients. The +subset of defined built-in-kernels (DBKs) they implement and advertise to the clients. The clients can use the DBKs to accelerate their applications by manually executing invoking the DBKs. The extension is designed to also support using automated task graph lowering tooling later. @@ -110,13 +118,18 @@ own hardware accelerator properties, resulting in lack of cross-vendor portability in the API abstraction presented to the upper layers of heterogeneous computing software stacks. -This extension standardizes a set of well-defined built-in kernels the clients can -call from higher level programming stacks built with different languages -and multiple libraries, possibly mix accelerator calls with calls to software kernel -commands, and rely on the driver stack to optimize the execution (especially -the synchronization and communication) as a low level heterogeneous task graph. -It aims to promote the use of built-in kernels as a programming model for hardware accelerated -functionality, to improve cross-vendor portability of hardware accelerated computing. +This extension standardizes a set of well-defined built-in kernels the +clients can call from higher level programming stacks built with +different languages and multiple libraries, possibly mix accelerator +calls with calls to software kernel commands, and rely on the driver +stack to optimize the execution (especially the synchronization and +communication) as a low level heterogeneous task graph. The +heterogeneous task graph can be described using multiple +command-queues and optionally cached using the command buffer +extension (cl_khr_command_buffer). It aims to promote the use of +built-in kernels as a programming model for hardware accelerated +functionality, to improve cross-vendor portability of hardware +accelerated computing. === Add new section X.Y.Z Querying Defined Built-in Kernels @@ -166,7 +179,7 @@ _errcode_ret_ is set to one of following code: conditions descried in defined built-in kernel entry in Appendix I. * CL_DBK_UNAVAILABLE if kernel attributes are valid but the - kernel is not supported one of the devices. + kernel is not supported on one of the devices. * CL_DBK_UNSUPPORTED_MODE_PROPERTY if _cl_dbk_mode_properties_ includes at least one property not listed in DBK's entry. @@ -175,17 +188,15 @@ _errcode_ret_ is set to one of following code: meet the requested constraint set by CL_DBK_PROPERTY_MAX_RELATIVE_ERROR property. -* TODO: other error cases. - - [cols="2,1,2",stripes=odd] |=== | *DBK Mode Property* | *Property Value* | *Description* | CL_DBK_PROPERTY_MAX_RELATIVE_ERROR | float -a| Request a DBK whose maximum relative error is bounded by the given -value measured in ULPs. +a| Require that the DBK produces the results which do not deviate more +than the given amount value of ULPs (units in the last place) respect +to infnitely precise result. | CL_DBK_PROPERTY_NON_DETERMINISTIC | cl_bool @@ -194,14 +205,6 @@ implementation to switch algorithm of the kernel on each launch for possibly better performance. // Idea from https://pytorch.org/docs/stable/notes/randomness.html#cuda-convolution-benchmarking -| ... | - -a| -Ideas: - -* accumulation with saturation. -* Finite math only. -* Flush denormals to zero. -* data layout preferences (NHWC for convolution). |=== === Add new function to 5.8.1 Creating Program Objects @@ -263,7 +266,7 @@ cl_kernel clCreateDefinedBuiltInKernel( === Add new appendix "Appendix I - Defined Built-in Kernels" to OpenCL API Specification This chapter describes standard defined built-in kernels (DBK) with -well-defined semantics. A conformant devices can report to +well-defined semantics. Devices can report availability of the built-in kernels listed in this section with `clCreateDefinedBuiltInKernelDescriptor` call. The availability of a DBK is determined from the arguments passed to the @@ -272,38 +275,45 @@ is indicated by CL_DBK_UNAVAILABLE error code. The general client-side abstraction of the DBKs is similar to a call to a C function of which implementation is hidden. The device driver -can invoke one or more physical hardware accelerators combined with +are free to implement a DBK by invoking one or more coarse and fine grained hardware accelerators combined with firmware to implement the semantics as efficiently as possible. It is the driver's responsibility to handle efficient synchronization and communication to the hardware accelerator, the internal accelerator state management and resource sharing across multiple OpenCL contexts. -Identical DBKs with identical inputs, are not guaranteed to produce -identical results: - -* across vendors, +==== Reproducibility ==== -* across driver versions and +Identical DBKs or same DBKs executed repeatedly with identical inputs are +guaranteed to produce identical results, unless otherwise stated in +the DBK's description, when: -* across devices. +* enqueued to the same device, -Otherwise, identical results are produced unless: +* on the same platform, -* otherwise stated in DBK's description or +* on the same vendor with the same driver version and -* the DBK has CL_DBK_PROPERTY_NON_DETERMINISTIC property set to true. +* CL_DBK_PROPERTY_NON_DETERMINISTIC property is not set on. -Two DKBs are considered identical if their descriptors are created +Two DBK descriptors for a device are considered identical if they are created using identical kernel name, kernel attribute and kernel mode property -arguments. +arguments. In other cases, identical and inputs may produce different +results. The result difference may occur because, for example, +different algorithms being used across devices. -==== Standard Defined Built-in Kernels ==== +DBKs may produce approximated results and the error, respect to +infinitely precise result, can be optionally controlled by +CL_DBK_PROPERTY_MAX_RELATIVE_ERROR when the property name is listed in +the DBK's description. DBKs without CL_DBK_PROPERTY_MAX_RELATIVE_ERROR +property produces exact result. -The following list of recognized defined built-in kernels. It is -expected to grow and update while preserving backwards compatibility. +==== The Defined Built-in Kernels ==== -Each defined built-in kernel entry is organized as followed: +The following is list of recognized defined built-in kernels. It is +expected to be expanded and updated over the versions of this extensions, while preserving backwards compatibility. + +Each defined built-in kernel entry is organized as follows: * *Name*: Name of the defined built-in kernel (an enumeration). @@ -331,11 +341,11 @@ Each defined built-in kernel entry is organized as followed: [caption="Table A.I.1. "] .Standard Built-in Kernels and Their Semantics. *The table has been populated with a small set of non-trivial example entries which are subject to change and the list to expand during drafting.* |=== -| Name: *khr_matmul_v1* +| Name: *khr_matmul* | *Kernel Attributes* a| -Fields of the `cl_dkb_attributes_matmul_v1` structure: +Fields of the `cl_dkb_attributes_matmul` structure: . cl_tensor_desc_t A: Tensor description for input matrix A. . cl_tensor_desc_t B: Tensor description for input matrix B. @@ -381,10 +391,10 @@ This DBK accepts the following properties: * CL_DBK_PROPERTY_MAX_RELATIVE_ERROR: Unset property defaults to positive infinity. | -| Name: *khr_leaky_relu_v1* +| Name: *khr_leaky_relu* | *Kernel Attributes* a| -Fields of the `cl_dbk_leaky_relu_v1` structure: +Fields of the `cl_dbk_leaky_relu` structure: . cl_tensor_desc_t in: Input tensor description. . cl_tensor_desc_t out: Output tensor description. . cl_float alpha: Coefficient of leakage. @@ -430,13 +440,13 @@ cl_tensor_desc_t lhs_tensor_desc = TBD; cl_tensor_desc_t rhs_tensor_desc = TBD; cl_tensor_desc_t res_tensor_desc = TBD; -cl_dkb_attributes_matmul_v1 matmul_attrs = { +cl_dkb_attributes_matmul matmul_attrs = { lhs_tensor_desc, rhs_tensor_desc, res_tensor_desc, 1, 0 // = Transpose lhs tensor } cl_dbk_mode_properties matmul_props = { - // Request a matmul implementation that meets this precision. + // Request a matmul instance that meets this precision. CL_DBK_PROPERTY_MAX_RELATIVE_ERROR, 100, // in ULPs. } @@ -445,7 +455,7 @@ std::vector kernel_descriptions; cl_dbk_descriptor matmul_desc = clCreateDefinedBuiltInKernelDescriptor( context, num_devices, device_list, - CL_DBK_MATMUL_V1, &matmul_attrs, &matmul_props, &err); + CL_DBK_MATMUL, &matmul_attrs, &matmul_props, &err); } else if (err == CL_DBK_UNAVAILABLE) { // Kernel attributes are valid but the kernel is not supported in at least @@ -496,7 +506,7 @@ clEnqueueNDRangeKernel(cmd_q, matmul_kernel, 0, NULL, NULL, NULL, 0, NULL, NULL) -- -. Should the NDRange be used at all in DBKs? It feels sort of unnatural as typically the NDRange is used to imply SPMD parallelism while the hardware/firmware is free to choose whatever parallelism degree to implement the function. On the other hand, similar applies to software kernel launches as the work-items can be executed serially if adhering to barrier semantics. +. Should the NDRange be used at all in DBKs? It feels sort of unnatural as typically the NDRange is used to imply SPMD parallelism while the hardware/firmware is free to choose whatever parallelization strategy to implement the function. On the other hand, similar applies to software kernel launches as the NDRange-launched work-items can be executed serially if adhering to barrier semantics. + -- *UNRESOLVED* @@ -523,3 +533,12 @@ clEnqueueNDRangeKernel(cmd_q, matmul_kernel, 0, NULL, NULL, NULL, 0, NULL, NULL) *UNRESOLVED* -- + +. What other DBK mode properties we should have? Here are some ideas: +** Perform accumulation with saturation. +** Finite math only +** Flush denormals to zero. +** data layout preferences (NHWC for convolution). +-- +*UNRESOLVED* +-- diff --git a/ext/cl_khr_defined_builtin_kernels.html b/ext/cl_khr_defined_builtin_kernels.html index fc5837a44..7f17b7551 100644 --- a/ext/cl_khr_defined_builtin_kernels.html +++ b/ext/cl_khr_defined_builtin_kernels.html @@ -482,9 +482,10 @@

Version History

First formulation as an extension specification like proposed by Ben Ashbaugh.

-

TODO

-

TODO

-

Reference new concept as "defined built-in kernel".

+

2022-10-23

+

0.2.0

+

Add APIs for defined built-in kernel (DBK) creation. Model DBKs on +tensor type. Add sample code.

@@ -497,6 +498,23 @@

Dependencies

This extension requires OpenCL 1.2 or later.

+
+

This extension requires cl_khr_tensor.

+
+
+ + + + + +
+
Note
+
+cl_khr_tensor is unpublished, work-in-progress +extension. Briefly, It will bring concept of tensor, N-dimensional +data structure whose data layout is opaque to applications. +
+

Contributors

@@ -504,7 +522,8 @@

Contributors

Pekka Jääskeläinen, Intel and Tampere University.
Topi Leppänen, Tampere University.
Jan Solanti, Tampere University.
-Ben Ashbaugh, Intel.

+Ben Ashbaugh, Intel.
+Henry Linjamäki, Intel.

@@ -530,9 +549,9 @@

Overview

The device drivers that implement this extension can freely choose which -subset of defined BiKs they implement and advertise to the clients. The -clients can use the BiKs to accelerate their applications by manually -executing invoking the BiKs. The extension is designed to also support using +subset of defined built-in-kernels (DBKs) they implement and advertise to the clients. The +clients can use the DBKs to accelerate their applications by manually +executing invoking the DBKs. The extension is designed to also support using automated task graph lowering tooling later.

@@ -585,41 +604,232 @@

Rationale

heterogeneous computing software stacks.

-

This extension standardizes a set of well-defined built-in kernels the clients can -call from higher level programming stacks built with different languages -and multiple libraries, possibly mix accelerator calls with calls to software kernel -commands, and rely on the driver stack to optimize the execution (especially -the synchronization and communication) as a low level heterogeneous task graph. -It aims to promote the use of built-in kernels as a programming model for hardware accelerated -functionality, to improve cross-vendor portability of hardware accelerated computing.

+

This extension standardizes a set of well-defined built-in kernels the +clients can call from higher level programming stacks built with +different languages and multiple libraries, possibly mix accelerator +calls with calls to software kernel commands, and rely on the driver +stack to optimize the execution (especially the synchronization and +communication) as a low level heterogeneous task graph. The +heterogeneous task graph can be described using multiple +command-queues and optionally cached using the command buffer +extension (cl_khr_command_buffer). It aims to promote the use of +built-in kernels as a programming model for hardware accelerated +functionality, to improve cross-vendor portability of hardware +accelerated computing.

+
+ + +
+

Add new section X.Y.Z Querying Defined Built-in Kernels

+
+

To request a defined built-in kernel to be executed in the given +devices use:

+
+
+
+
cl_dbk_descriptor clCreateDefinedBuiltInKernelDescriptor(
+    cl_context context,
+    cl_uint num_devices,
+    const cl_device_id* device_list,
+    cl_dbk_name kernel_name,
+    const void *kernel_attributes,
+    const cl_dbk_mode_properties* kernel_config,
+    cl_int *errcode_ret);
+
+
+
+
    +
  • +

    context must be a valid OpenCL context.

    +
  • +
  • +

    num_devices is the number of devices listed in +device_list. num_devices must be non-zero.

    +
  • +
  • +

    device_list is a pointer to a list of devices that are in +context. device_list must be a non-NULL value. The defined built-in kernels +are loaded for devices specified in this list.

    +
  • +
  • +

    kernel_name is the name of the defined built-in kernel listed in Appendix I.

    +
  • +
  • +

    kernel_attributes is a pointer to the structure declared in +description of the kernel in Appendix I. The structure holds +kernel’s attributes.

    +
  • +
  • +

    cl_dbk_mode_properties is a pointer to a list of defined built-in +kernel mode properties. The supported mode properties are listed in +DBK’s entry with default settings in Appendix I. It is valid to set +this argument to NULL in which case default properties apply (if +any).

    +
  • +
+
+
+

clCreateDefinedBuiltInKernelDescriptor returns a valid kernel +descriptor on success indicated by errcode_ret which is set to +CL_SUCCESS. Otherwise, the returned object is NULL and the +errcode_ret is set to one of following code:

+
+
+
    +
  • +

    CL_DBK_INVALID_ATTRIBUTE if one or more kernel attributes violates +conditions descried in defined built-in kernel entry in Appendix I.

    +
  • +
  • +

    CL_DBK_UNAVAILABLE if kernel attributes are valid but the +kernel is not supported on one of the devices.

    +
  • +
  • +

    CL_DBK_UNSUPPORTED_MODE_PROPERTY if cl_dbk_mode_properties includes +at least one property not listed in DBK’s entry.

    +
  • +
  • +

    CL_DBK_UNMET_MAX_RELATIVE_ERROR if the DBK is available but does not +meet the requested constraint set by +CL_DBK_PROPERTY_MAX_RELATIVE_ERROR property.

    +
  • +
+
+ +++++ + + + + + + + + + + + + + + + + + + + +
DBK Mode PropertyProperty ValueDescription

CL_DBK_PROPERTY_MAX_RELATIVE_ERROR

float

+

Require that the DBK produces the results which do not deviate more +than the given amount value of ULPs (units in the last place) respect +to infnitely precise result.

+

CL_DBK_PROPERTY_NON_DETERMINISTIC

cl_bool

+

Allow results of the kernel to be non-reproducible. This allows +implementation to switch algorithm of the kernel on each launch for +possibly better performance.

+
+
+
+

Add new function to 5.8.1 Creating Program Objects

+
+

To create a program with a set of defined built-in kernel use:

+
+
+
+
cl_program clCreateProgramWithDefinedKernels(
+    cl_context context,
+    size_t num_kernel_desc,
+    const void* kernel_desc_list,
+    cl_int* errcode_ret);
+
+
+
+
    +
  • +

    context must be a valid OpenCL context.

    +
  • +
  • +

    num_kernel_desc is the number of kernel descriptors.

    +
  • +
  • +

    kernel_desc_list is the array of valid +cl_dbk_descriptor objects. The array length must be at +least num_kernel_desc. The kernel descriptors must be created on +the same context.

    +
  • +
+
+
+

clCreateProgramWithDefinedKernels returns a valid program on success +indicated by errcode_ret which is set to CL_SUCCESS. Otherwise, the +returned object is NULL and the errcode_ret is set to one of +following code:

+
+
    +
  • +

    TODO.

    +
  • +
-

Modifications to section 4.2 of the OpenCL API Specification

+

Add new function to 5.9.1 Creating Kernel Objects

-

Modify Table 5, Device Queries, of section 4.2, by adding the following -sentences to the description cell of CL_DEVICE_BUILT_IN_KERNELS:

+

To get a kernel handle for a defined built-in kernel in a program use:

-
-
-The semantics of the returned built-in kernels are undefined or defined in -vendor-specific documentation, unless the name starts with prefix `khr_', -which means it’s a defined built-in kernel with semantics defined in Appendix I. -
+
+
+
cl_kernel clCreateDefinedBuiltInKernel(
+    cl_program program,
+    cl_dbk_descriptor kernel_desc,
+    cl_int* errcode_ret);
+
+
+
+
    +
  • +

    program is a program object with a successfully built executable.

    +
  • +
  • +

    kernel_desc is a defined built-in kernel descriptor in the program.

    +
  • +
  • +

    errcode_ret will return an appropriate error code. If errcode_ret is +NULL, no error code is returned.

    +
  • +
+
+
+

clCreateDefinedBuiltInKernel returns a valid non-zero kernel object + and errcode_ret is set to CL_SUCCESS if the kernel object is created + successfully. Otherwise, it returns a NULL value with one of the + following error values returned in errcode_ret:

+
+
+
    +
  • +

    TODO.

    +
  • +

Add new appendix "Appendix I - Defined Built-in Kernels" to OpenCL API Specification

-

This chapter describes standard defined built-in kernels (DBK) with well-defined -semantics. A conformant device can report to support zero or more of the built-in -kernels via CL_DEVICE_BUILT_IN_KERNELS or CL_DEVICE_BUILT_IN_KERNELS_WITH_VERSION device queries.

+

This chapter describes standard defined built-in kernels (DBK) with +well-defined semantics. Devices can report +availability of the built-in kernels listed in this section with +clCreateDefinedBuiltInKernelDescriptor call. The availability of a +DBK is determined from the arguments passed to the +clCreateDefinedBuiltInKernelDescriptor and unavailability of a DBK +is indicated by CL_DBK_UNAVAILABLE error code.

The general client-side abstraction of the DBKs is similar to a call to a C function of which implementation is hidden. The device driver -can invoke one or more physical hardware accelerators combined with +are free to implement a DBK by invoking one or more coarse and fine grained hardware accelerators combined with firmware to implement the semantics as efficiently as possible.

@@ -628,292 +838,356 @@

-

Standard Defined Built-in Kernels

+

Reproducibility

-

The following list of recognized defined built-ins is organized -according to their application domain and handled data types. It is -expected to grow and update while preserving backwards compatibility.

+

Identical DBKs or same DBKs executed repeatedly with identical inputs are +guaranteed to produce identical results, unless otherwise stated in +the DBK’s description, when:

- - ------ - - - - - - - - - - - - - - - - - - - - - - +
Table A.I.1. Standard Built-in Kernels and Their Semantics. The table has been populated with a small set of non-trivial example entries which are subject to change and the list to expand during drafting.

General linear algebra

Name

Description

NDRange Dimensions

Arguments

khr_blas_gemm_float

xGEMM: General matrix multiplication with real single precision floating point numbers as described in Basic Linear Algebra Subprograms. Performs C = alpha * trans(A) * trans(B) + beta*C, where A, B and C are matrices, and alpha and beta scalars. trans() is a configurable transpose operation.

-
    -
  1. -

    The height.

    -
  2. -
  3. -

    The width.

    -
  4. -
-
-
    +
    +
    • -

      int: transpose operation (trans) type for matrix A (0 = none, 1 = transpose, 2 = conjugate transpose)

      +

      enqueued to the same device,

    • -

      int: transpose type for matrix B (0 = none, 1 = transpose, 2 = conjugate transpose)

      +

      on the same platform,

    • -

      float: scalar (alpha) to multiply the matrix multiplication result elements with

      +

      on the same vendor with the same driver version and

    • -

      float* (input): matrix A

      +

      CL_DBK_PROPERTY_NON_DETERMINISTIC property is not set on.

    • +
    +
    +
    +

    Two DBK descriptors for a device are considered identical if they are created +using identical kernel name, kernel attribute and kernel mode property +arguments. In other cases, identical and inputs may produce different +results. The result difference may occur because, for example, +different algorithms being used across devices.

    +
    +
    +

    DBKs may produce approximated results and the error, respect to +infinitely precise result, can be optionally controlled by +CL_DBK_PROPERTY_MAX_RELATIVE_ERROR when the property name is listed in +the DBK’s description. DBKs without CL_DBK_PROPERTY_MAX_RELATIVE_ERROR +property produces exact result.

    +
    +
+
+

The Defined Built-in Kernels

+
+

The following is list of recognized defined built-in kernels. It is +expected to be expanded and updated over the versions of this extensions, while preserving backwards compatibility.

+
+
+

Each defined built-in kernel entry is organized as follows:

+
+
+
  • -

    int: leading dimension of A (0 = row-major, 1 = column-major)

    +

    Name: Name of the defined built-in kernel (an enumeration).

  • -

    float* (input): matrix B

    +

    Kernel attributes: The kernel attributes required for creating the +defined built-in kernel via +clCreateDefinedBuiltInKernelDescriptor. Attribute values are +immutable.

  • -

    int: leading dimension of B (0 = row-major, 1 = column-major)

    +

    Kernel arguments: The kernel arguments.

  • -

    float: scalar (beta) to multiply the C matrix elements with before adding it to the result

    +

    Description: The description of the kernel in detail.

  • -

    float* (input&output): matrix C which is added to the matrix multiplication result, and stores the output

    +

    Attribute validation rules: Conditions of the kernel attribute for +the kernel. Implementation must return CL_DBK_INVALID_ATTRIBUTE on +clCreateDefinedBuiltInKernelDescriptor call if any of the conditions +are violated.

  • -

    int: leading dimension of C (0 = row-major, 1 = column-major)

    +

    Kernel mode properties: List of kernel mode +properties (cl_dbk_mode_properties) the kernel recognizes. The +properties can be used to tweak certain implementation details and +behaviors in the kernel execution. If a property not listed in the +DBK entry is fed to clCreateDefinedBuiltInKernelDescriptor call, +then implementation must return CL_DKB_UNSUPPORTED_MODE_PROPERTY.

  • - -

OpenCL C Semantics

-
-
__kernel void __khr_blas_gemm_float(
-   int transA, int transB, float alpha, const global float *A, int ldA,
-   const global float *B, int ldB,
-   float beta, global float *C, int ldC) {
-   // TBD: An example implementation that can be used for verification
-   // and as a fallback SW implementation.
-}
+
-
+ +++ + - + - - - - + - - - - + + + + + + + + + + + + + + + + + + - + - - + - + - - - - + - - + + + + + - + - - - - - + - + - + - - - - - - - - - -
Table A.I.1. Standard Built-in Kernels and Their Semantics. The table has been populated with a small set of non-trivial example entries which are subject to change and the list to expand during drafting.

OpenVX Neural Network Extension Compatible Kernels

Name: khr_matmul

Name

Description

NDRange Dimensions

Arguments

Kernel Attributes

khr_openvx_nn_extension_convolution_uchar

Convolution for 8bit unsigned integer inputs and weights.

-
    -
  1. -

    Batch size.

    -
  2. +
+

Fields of the cl_dkb_attributes_matmul structure:

+
+
+
  1. -

    Width.

    +

    cl_tensor_desc_t A: Tensor description for input matrix A.

  2. -

    Height.

    +

    cl_tensor_desc_t B: Tensor description for input matrix B.

  3. -
-
-
  1. -

    uchar* [in]: The input tensor data. 3 lower dimensions represent a single input, all following dimensions represent number of batches, possibly nested. The dimension order is [width, height, #IFM, #batches].

    +

    cl_tensor_desc_t R: Tensor description for output matrix C.

  2. -

    uchar* [in]: Weights, as a 4d tensor with dimensions [kernel_x, kernel_y, #IFM, #OFM].

    +

    cl_int transposeA: Non-zero transposes A matrix.

  3. -

    uchar* [in]: Biases (optional, ignored if NULL). The biases, which may be shared (one per ofm) or unshared (one per ofm * output location). The possible layouts are either [#OFM] or [width, height, #OFM]. Biases data type must match the data type of the inputs. (Kernel parameter #2)

    +

    cl_int transposeB: Non-zero transposes B matrix.

  4. +
+

Kernel Arguments

+
  1. -

    size_t: (dilation_x) “inflate” the kernel by inserting zeros between the kernel elements in the x direction. The value is the number of zeros to insert.

    +

    cl_tensor_t A: Matrix A (read only).

  2. -

    size_t: (dilation_y) “inflate” the kernel by inserting zeros between the kernel elements in the y direction. The value is the number of zeros to insert.

    +

    cl_tensor_t B: Matrix B (read only).

  3. -

    int: Rounding method for calculating output dimensions.

    +

    cl_tensor_t R: Output matrix. (write only).

  4. +
+

Description

+

Performs (batched) matrix multiplication: R = trans(A) * trans(B), +where A, B and R are tensors with at least rank two. The +trans() is a configurable transpose operation.

+
+
+

Last two dimensions of the tensors are treated as operands to the +matric multiplication and rest of the dimensions are treated as batch +dimensions.

+
+
+

Operations of the matrix muliplication are performed in the precision +of the elementof(R).

+
+
+

If an overflow occurs in the accumulation of the products, then R +tensor’s result will be undefined.

+

Attribute validation rules

+
  • -

    int: A VX_TYPE_ENUM of the vx_convert_policy_e enumeration.

    +

    rankof(A) == rankof(B) >= 2.

  • -

    size_t: Number of elements padded at each side in the x dimension of the input.

    +

    Let shapeof(At) == (b…​, m, k) and shapeof(Bt) = (b…​, k, +n) of tensors A and B, respectively, after possible tranposing. +shapeof(R) must be (b…​, m, n).

  • -

    size_t: Number of elements padded at each side in the y dimension of the input.

    +

    elementof(A) == elementof(B)

  • -

    int: A VX_TYPE_ENUM of the vx_round_policy_e enumeration.

    +

    elemkindof(R) == elemkindof(A)

  • -

    uchar* [out]: The output tensor data. Output will have the same number and structure of dimensions as input. Output tensor data type must be same as the inputs. (Kernel parameter #4)

    +

    elementof(R) == elementof(A) or elementof(A) is promotable to +elementof(R) without loss of meaning.

  • - +

OpenCL C Semantics

Kernel mode properties

-
-
__kernel void __khr_openvx_nn_extension_convolution_uchar(
-   const uchar *input, const uchar *weights, const uchar *biases,
-   size_t dilation_x, size_t dilation_y,
-   int down_scale_rounding, int overflow_policy, size_t padding_x, size_t padding_y,
-   int rounding_policy, uchar *output) {
-   // TBD.
-}
+
+

This DBK accepts the following properties:

+
+
    +
  • +

    CL_DBK_PROPERTY_MAX_RELATIVE_ERROR: Unset property defaults to positive infinity.

    +
  • +

Direct Input/Output Operations

Kernels for accessing data sources and destinations directly without host involvement.

Name: khr_leaky_relu

Name

Description

NDRange Dimensions

Arguments

Kernel Attributes

khr_io_stream_in_uchar

Non-blocking read of data from a sensor/stream associated with the device.

-

-

+

Fields of the cl_dbk_leaky_relu structure: +. cl_tensor_desc_t in: Input tensor description. +. cl_tensor_desc_t out: Output tensor description. +. cl_float alpha: Coefficient of leakage.

Kernel arguments

-
    +
    1. -

      uchar* [out]: The data.

      +

      cl_tensor_t in: The input tensor.

    2. -

      size_t* [in+out]: In: number of bytes to read. Out: Number of bytes that could be read (can be 0). (Compatible with the cl_pocl_content_size extension to optimize data transfers with.)

      +

      cl_tensor_t out: The output tensor.

OpenCL C Semantics

Description

-
-
__kernel void __khr_io_stream_in_uchar(
-   uchar *output, size_t *num) {
-   // It is not feasible to describe this kernel in OpenCL C as I/O devices
-   // are not representable with it.
-}
+
+

Applies operation alpha * x if x < 0 else x on all +elements of the in tensor.

+
+

If target device does not support denormals, then alpha is flushed +to zero before the operation is applied.

khr_io_stream_out_uchar

Non-blocking write of data to an output/sink associated with the device.

-

-
    -
  1. -

    uchar* [in]: The data to write.

    -
  2. -
  3. -

    size_t* [in+out]: In: Number of bytes to write. Out: Number of bytes that could be written (can be 0).

    -
  4. -
-

Kernel mode properties

OpenCL C Semantics

N/A

-
-
__kernel void __khr_io_stream_out_uchar(
-   uchar *input, size_t *num) {
-   // It is not feasible to describe this kernel in OpenCL C as I/O devices
-   // are not representable with it.
-}
-
-

Attribute validation rules

khr_io_stream_in_blocking_uchar

Blocking read of data from a sensor/stream associated with the device.

-

-

-
-
    -
  1. -

    uchar* [out]: The data.

    -
    +
  • -

    size_t* [in]: How many bytes to read before returning.

    +

    shapeof(in) == shapeof(out)

  • -
-
+
  • +

    elementof(in) == elementof(out)

  • - -

    OpenCL C Semantics

    -
    -
    __kernel void __khr_io_stream_in_blocking_uchar(uchar *output, size_t *num) {
    -   while (*num) {
    -       size_t num_read = *num;
    -       __khr_io_stream_in_uchar(output, &num_read);
    -       num -= num_read;
    -       output += num_read;
    -   }
    -}
    -
    +
  • +

    alpha must be a finite value.

    +
  • +
    -

    Launching BiKs from the Device Side

    +

    Launching DBKs from the Device Side

    -

    BiKs are primarily meant to be launched as kernel commands via host-side command queues. -Optionally, they can be callable from device-side via -enqueue_kernel: This capability can be queried on per BiK basis at compile-time in OpenCL C by checking for macro definitions which has the following naming convention: cl_khr_bik_BUILTIN_KERNEL_NAME. In case a BiK macro is defined, a kernel with a naming convention __khr_BUILTIN_KERNEL_NAME() can be enqueued by the program at device side as software-defined kernels.

    +

    DBKs are primarily meant to be launched as kernel commands via +host-side command queues. Optionally, they can be callable from +device-side via enqueue_kernel:

    +
    +
    +

    TBC. This probably needs device-side function corresponding to +clCreateDefinedBuiltInKernelDescriptor.

    +
    +
    +
    +

    Sample Code

    +
    +
    +
    // TBD. Similarly in cl_qcom_ml_ops, tensors have type
    +// (cl_channel_type) and a number of dimensions (rank) and dimension
    +// sizes (shape). Difference over the cl_qcom_ml_ops is that the rank is
    +// "unlimited".
    +cl_tensor_desc_t lhs_tensor_desc = TBD;
    +cl_tensor_desc_t rhs_tensor_desc = TBD;
    +cl_tensor_desc_t res_tensor_desc = TBD;
    +
    +cl_dkb_attributes_matmul matmul_attrs = {
    +  lhs_tensor_desc, rhs_tensor_desc, res_tensor_desc,
    +  1, 0 // = Transpose lhs tensor
    +}
    +
    +cl_dbk_mode_properties matmul_props = {
    +  // Request a matmul instance that meets this precision.
    +  CL_DBK_PROPERTY_MAX_RELATIVE_ERROR, 100, // in ULPs.
    +}
    +
    +cl_uint err;
    +std::vector<cl_dbk_descriptor> kernel_descriptions;
    +cl_dbk_descriptor matmul_desc =
    +  clCreateDefinedBuiltInKernelDescriptor(
    +  context, num_devices, device_list,
    +  CL_DBK_MATMUL, &matmul_attrs, &matmul_props, &err);
    +
    +} else if (err == CL_DBK_UNAVAILABLE) {
    +  // Kernel attributes are valid but the kernel is not supported in at least
    +  // one of the devices.
    +} else if (err == CL_DBK_UNMET_MAX_RELATIVE_ERROR) {
    +  // E.g. Kernel is supported but is not precise enough.
    +} else if (err == CL_DBK_UNSUPPORTED_MODE_PROPERTY) {
    +  // cl_dbk_mode_properties has a property not listed in the description of the
    +  // defined built-in kernel.
    +} else
    +  kernel_descriptions.push_back(matmul_desc);
    +
    +...
    +
    +cl_program dbk_lib = clCreateProgramWithDefinedBuiltInKernels(
    +  context, kernel_descriptions.size(), kernel_descriptors.data(), err);
    +
    +...
    +
    +cl_kernel matmul_kernel = clCreateDefinedBuiltinKernel(
    +  dkb_lib, matmul_desc, err);
    +
    +// TBD: allocate space for tensors. Perhaps like cl_qcom_ml_ops: query
    +// tensor sizes after the final program has been created or after
    +// command buffer (with DBKs within) is finalized. Implementation
    +// determines the optimal data layout (opaque to the application) for
    +// the tensors based on their usage.  Application uses the tensor
    +// sizes to create cl_mem buffers which are bound to the tensors.
    +cl_tensor_t lhs_tensor = TBD;
    +cl_tensor_t rhs_tensor = TBD;
    +cl_tensor_t res_tensor = TBD;
    +
    +// Transfer data to input tensors
    +
    +clSetKernelArg(matmul_kernel, 0, sizeof(cl_tensor_t), &lhs_tensor);
    +clSetKernelArg(matmul_kernel, 1, sizeof(cl_tensor_t), &rhs_tensor);
    +clSetKernelArg(matmul_kernel, 2, sizeof(cl_tensor_t), &res_tensor);
    +
    +clEnqueueNDRangeKernel(cmd_q, matmul_kernel, 0, NULL, NULL, NULL, 0, NULL, NULL);
    +
    @@ -922,7 +1196,7 @@

    Open questions

    1. -

      Should we enable launching BiKs from the device side without requiring device-side enqueue? The main problem is those with NDRange as they are not simple single-WI helper functions.

      +

      Should we enable launching DBKs from the device side without requiring device-side enqueue? The main problem is those with NDRange as they are not simple single-WI helper functions.

      @@ -932,7 +1206,7 @@

      Open questions

    2. -

      Should the NDRange be used at all in BiKs? It feels sort of unnatural as typically the NDRange is used to imply SPMD parallelism while the hardware/firmware is free to choose whatever parallelism degree to implement the function. On the other hand, similar applies to software kernel launches as the work-items can be executed serially if adhering to barrier semantics.

      +

      Should the NDRange be used at all in DBKs? It feels sort of unnatural as typically the NDRange is used to imply SPMD parallelism while the hardware/firmware is free to choose whatever parallelization strategy to implement the function. On the other hand, similar applies to software kernel launches as the NDRange-launched work-items can be executed serially if adhering to barrier semantics.

      @@ -942,7 +1216,7 @@

      Open questions

    3. -

      Different accelerators prefer different channel orders (NHWC vs. NCHW…​) for the processed data. Should the channel order be passed as a BiK argument (like in the example GEMM’s row/column order) or is it better to have different BiK variations for each?

      +

      Different accelerators prefer different channel orders (NHWC vs. NCHW…​) for the processed data. Should the channel order be passed as a DBK argument (like in the example GEMM’s row/column order) or is it better to have different DBK variations for each?

      @@ -952,7 +1226,7 @@

      Open questions

    4. -

      How to denote preference? Some of the BiKs are more efficient on a given device as they map more naturally to the underlying HW accelerator, but the slower variations (for example, with unoptimal channel order in NN accelerators) might be still beneficially accelerated.

      +

      How to denote preference? Some of the DBKs are more efficient on a given device as they map more naturally to the underlying HW accelerator, but the slower variations (for example, with unoptimal channel order in NN accelerators) might be still beneficially accelerated.

      @@ -971,15 +1245,41 @@

      Open questions

    5. +
    6. +

      What other DBK mode properties we should have? Here are some ideas:

      +
      +
        +
      • +

        Perform accumulation with saturation.

        +
      • +
      • +

        Finite math only

        +
      • +
      • +

        Flush denormals to zero.

        +
      • +
      • +

        data layout preferences (NHWC for convolution).

        +
      • +
      +
      +
    +
    +
    +
    +

    UNRESOLVED

    +
    +
    +
    From c086f4d81980cdc0dccbeb33eaaec46b80e3a4f7 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Henry=20Linjam=C3=A4ki?= Date: Wed, 25 Oct 2023 10:24:52 +0300 Subject: [PATCH 4/9] Strip '_t' suffix on the new OpenCL API types --- ext/cl_khr_defined_builtin_kernels.asciidoc | 32 +++++++++---------- ext/cl_khr_defined_builtin_kernels.html | 34 ++++++++++----------- 2 files changed, 33 insertions(+), 33 deletions(-) diff --git a/ext/cl_khr_defined_builtin_kernels.asciidoc b/ext/cl_khr_defined_builtin_kernels.asciidoc index 028f021fa..838473b58 100644 --- a/ext/cl_khr_defined_builtin_kernels.asciidoc +++ b/ext/cl_khr_defined_builtin_kernels.asciidoc @@ -347,16 +347,16 @@ a| Fields of the `cl_dkb_attributes_matmul` structure: -. cl_tensor_desc_t A: Tensor description for input matrix A. -. cl_tensor_desc_t B: Tensor description for input matrix B. -. cl_tensor_desc_t R: Tensor description for output matrix C. +. cl_tensor_desc A: Tensor description for input matrix A. +. cl_tensor_desc B: Tensor description for input matrix B. +. cl_tensor_desc R: Tensor description for output matrix C. . cl_int transposeA: Non-zero transposes A matrix. . cl_int transposeB: Non-zero transposes B matrix. | *Kernel Arguments* a| -. cl_tensor_t A: Matrix A (read only). -. cl_tensor_t B: Matrix B (read only). -. cl_tensor_t R: Output matrix. (write only). +. cl_tensor A: Matrix A (read only). +. cl_tensor B: Matrix B (read only). +. cl_tensor R: Output matrix. (write only). | *Description* a| Performs (batched) matrix multiplication: `R = trans(A) * trans(B)`, @@ -395,13 +395,13 @@ This DBK accepts the following properties: | *Kernel Attributes* a| Fields of the `cl_dbk_leaky_relu` structure: -. cl_tensor_desc_t in: Input tensor description. -. cl_tensor_desc_t out: Output tensor description. +. cl_tensor_desc in: Input tensor description. +. cl_tensor_desc out: Output tensor description. . cl_float alpha: Coefficient of leakage. | *Kernel arguments* a| -. cl_tensor_t in: The input tensor. -. cl_tensor_t out: The output tensor. +. cl_tensor in: The input tensor. +. cl_tensor out: The output tensor. | *Description* a| Applies operation `alpha * x if x < 0 else x` on all @@ -436,9 +436,9 @@ clCreateDefinedBuiltInKernelDescriptor. // (cl_channel_type) and a number of dimensions (rank) and dimension // sizes (shape). Difference over the cl_qcom_ml_ops is that the rank is // "unlimited". -cl_tensor_desc_t lhs_tensor_desc = TBD; -cl_tensor_desc_t rhs_tensor_desc = TBD; -cl_tensor_desc_t res_tensor_desc = TBD; +cl_tensor_desc lhs_tensor_desc = TBD; +cl_tensor_desc rhs_tensor_desc = TBD; +cl_tensor_desc res_tensor_desc = TBD; cl_dkb_attributes_matmul matmul_attrs = { lhs_tensor_desc, rhs_tensor_desc, res_tensor_desc, @@ -484,9 +484,9 @@ cl_kernel matmul_kernel = clCreateDefinedBuiltinKernel( // determines the optimal data layout (opaque to the application) for // the tensors based on their usage. Application uses the tensor // sizes to create cl_mem buffers which are bound to the tensors. -cl_tensor_t lhs_tensor = TBD; -cl_tensor_t rhs_tensor = TBD; -cl_tensor_t res_tensor = TBD; +cl_tensor lhs_tensor = TBD; +cl_tensor rhs_tensor = TBD; +cl_tensor res_tensor = TBD; // Transfer data to input tensors diff --git a/ext/cl_khr_defined_builtin_kernels.html b/ext/cl_khr_defined_builtin_kernels.html index 7f17b7551..6681f064a 100644 --- a/ext/cl_khr_defined_builtin_kernels.html +++ b/ext/cl_khr_defined_builtin_kernels.html @@ -936,13 +936,13 @@

    The Defined Built-in Kernels

    1. -

      cl_tensor_desc_t A: Tensor description for input matrix A.

      +

      cl_tensor_desc A: Tensor description for input matrix A.

    2. -

      cl_tensor_desc_t B: Tensor description for input matrix B.

      +

      cl_tensor_desc B: Tensor description for input matrix B.

    3. -

      cl_tensor_desc_t R: Tensor description for output matrix C.

      +

      cl_tensor_desc R: Tensor description for output matrix C.

    4. cl_int transposeA: Non-zero transposes A matrix.

      @@ -960,13 +960,13 @@

      The Defined Built-in Kernels

      1. -

        cl_tensor_t A: Matrix A (read only).

        +

        cl_tensor A: Matrix A (read only).

      2. -

        cl_tensor_t B: Matrix B (read only).

        +

        cl_tensor B: Matrix B (read only).

      3. -

        cl_tensor_t R: Output matrix. (write only).

        +

        cl_tensor R: Output matrix. (write only).

      @@ -1048,8 +1048,8 @@

      The Defined Built-in Kernels

      Fields of the cl_dbk_leaky_relu structure: -. cl_tensor_desc_t in: Input tensor description. -. cl_tensor_desc_t out: Output tensor description. +. cl_tensor_desc in: Input tensor description. +. cl_tensor_desc out: Output tensor description. . cl_float alpha: Coefficient of leakage.

      @@ -1060,10 +1060,10 @@

      The Defined Built-in Kernels

      1. -

        cl_tensor_t in: The input tensor.

        +

        cl_tensor in: The input tensor.

      2. -

        cl_tensor_t out: The output tensor.

        +

        cl_tensor out: The output tensor.

      @@ -1128,9 +1128,9 @@

      Sample Code

      // (cl_channel_type) and a number of dimensions (rank) and dimension // sizes (shape). Difference over the cl_qcom_ml_ops is that the rank is // "unlimited". -cl_tensor_desc_t lhs_tensor_desc = TBD; -cl_tensor_desc_t rhs_tensor_desc = TBD; -cl_tensor_desc_t res_tensor_desc = TBD; +cl_tensor_desc lhs_tensor_desc = TBD; +cl_tensor_desc rhs_tensor_desc = TBD; +cl_tensor_desc res_tensor_desc = TBD; cl_dkb_attributes_matmul matmul_attrs = { lhs_tensor_desc, rhs_tensor_desc, res_tensor_desc, @@ -1176,9 +1176,9 @@

      Sample Code

      // determines the optimal data layout (opaque to the application) for // the tensors based on their usage. Application uses the tensor // sizes to create cl_mem buffers which are bound to the tensors. -cl_tensor_t lhs_tensor = TBD; -cl_tensor_t rhs_tensor = TBD; -cl_tensor_t res_tensor = TBD; +cl_tensor lhs_tensor = TBD; +cl_tensor rhs_tensor = TBD; +cl_tensor res_tensor = TBD; // Transfer data to input tensors @@ -1279,7 +1279,7 @@

      Open questions

    From 13d085d9413b24844864422089d8caf120159c46 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Henry=20Linjam=C3=A4ki?= Date: Fri, 27 Oct 2023 11:17:47 +0300 Subject: [PATCH 5/9] * Reflect cl_exp_tensor contents in the dependency section and the DBK code sample. * Fixes and tweaks in the DBK code sample. * Update date. --- ext/cl_khr_defined_builtin_kernels.asciidoc | 69 +++++++++++-------- ext/cl_khr_defined_builtin_kernels.html | 73 ++++++++++++--------- 2 files changed, 84 insertions(+), 58 deletions(-) diff --git a/ext/cl_khr_defined_builtin_kernels.asciidoc b/ext/cl_khr_defined_builtin_kernels.asciidoc index 838473b58..4842034c8 100644 --- a/ext/cl_khr_defined_builtin_kernels.asciidoc +++ b/ext/cl_khr_defined_builtin_kernels.asciidoc @@ -26,7 +26,7 @@ definitions and updating of previously defined ones. |==== | *Date* | *Version* | *Description* | 2022-12-13 | 0.1.0 | First formulation as an extension specification like proposed by Ben Ashbaugh. -| 2022-10-23 | 0.2.0 | +| 2023-11-21 | 0.2.0 | Add APIs for defined built-in kernel (DBK) creation. Model DBKs on tensor type. Add sample code. |==== @@ -37,10 +37,10 @@ This extension is written against the OpenCL Specification version 3.0.12. This extension requires OpenCL 1.2 or later. -This extension requires cl_khr_tensor. +This extension requires cl_exp_tensor. [NOTE] -cl_khr_tensor is unpublished, work-in-progress +cl_exp_tensor is unpublished, work-in-progress extension. Briefly, It will bring concept of tensor, N-dimensional data structure whose data layout is opaque to applications. @@ -432,25 +432,21 @@ clCreateDefinedBuiltInKernelDescriptor. [source,c] ---- -// TBD. Similarly in cl_qcom_ml_ops, tensors have type -// (cl_channel_type) and a number of dimensions (rank) and dimension -// sizes (shape). Difference over the cl_qcom_ml_ops is that the rank is -// "unlimited". -cl_tensor_desc lhs_tensor_desc = TBD; -cl_tensor_desc rhs_tensor_desc = TBD; -cl_tensor_desc res_tensor_desc = TBD; +constexpr size_t b = 64, m = 100, n = 200, k = 50; +cl_int err; +cl_tensor lhs_tensor = clCreateTensor(context, nullptr, 3, {b, m, k}, CL_TENSOR_FLOAT, err); +cl_tensor rhs_tensor = clCreateTensor(context, nullptr, 3, {b, k, n}, CL_TENSOR_FLOAT, err); +cl_tensor res_tensor = clCreateTensor(context, nullptr, 3, {b, m, n}, CL_TENSOR_FLOAT, err); cl_dkb_attributes_matmul matmul_attrs = { - lhs_tensor_desc, rhs_tensor_desc, res_tensor_desc, - 1, 0 // = Transpose lhs tensor -} + lhs_tensor, rhs_tensor, res_tensor, 1, 0 // = Transpose lhs tensor +}; cl_dbk_mode_properties matmul_props = { // Request a matmul instance that meets this precision. CL_DBK_PROPERTY_MAX_RELATIVE_ERROR, 100, // in ULPs. -} +}; -cl_uint err; std::vector kernel_descriptions; cl_dbk_descriptor matmul_desc = clCreateDefinedBuiltInKernelDescriptor( @@ -460,11 +456,14 @@ cl_dbk_descriptor matmul_desc = } else if (err == CL_DBK_UNAVAILABLE) { // Kernel attributes are valid but the kernel is not supported in at least // one of the devices. + ... } else if (err == CL_DBK_UNMET_MAX_RELATIVE_ERROR) { // E.g. Kernel is supported but is not precise enough. + ... } else if (err == CL_DBK_UNSUPPORTED_MODE_PROPERTY) { // cl_dbk_mode_properties has a property not listed in the description of the // defined built-in kernel. + ... } else kernel_descriptions.push_back(matmul_desc); @@ -476,25 +475,39 @@ cl_program dbk_lib = clCreateProgramWithDefinedBuiltInKernels( ... cl_kernel matmul_kernel = clCreateDefinedBuiltinKernel( - dkb_lib, matmul_desc, err); - -// TBD: allocate space for tensors. Perhaps like cl_qcom_ml_ops: query -// tensor sizes after the final program has been created or after -// command buffer (with DBKs within) is finalized. Implementation -// determines the optimal data layout (opaque to the application) for -// the tensors based on their usage. Application uses the tensor -// sizes to create cl_mem buffers which are bound to the tensors. -cl_tensor lhs_tensor = TBD; -cl_tensor rhs_tensor = TBD; -cl_tensor res_tensor = TBD; - -// Transfer data to input tensors + dkb_lib, matmul_desc, &err); +// Set tensor kernel arguments before binding storage to the tensors. This +// gives clCreateBufferWithProperties() opportunity to reason about tensors' +// uses for determining the optimal memory layout (opaque to application) and +// the space neededfor the tensors. clSetKernelArg(matmul_kernel, 0, sizeof(cl_tensor_t), &lhs_tensor); clSetKernelArg(matmul_kernel, 1, sizeof(cl_tensor_t), &rhs_tensor); clSetKernelArg(matmul_kernel, 2, sizeof(cl_tensor_t), &res_tensor); +// Allocate storage for tensors. +cl_mem lhs_mem = clCreateBufferWithProperties( + context, {CL_MEM_BIND_TO_TENSOR, lhs_tensor, 0}, CL_MEM_READ_ONLY, 0, nullptr, &err); +cl_mem rhs_mem = clCreateBufferWithProperties( + context, {CL_MEM_BIND_TO_TENSOR, rhs_tensor, 0}, CL_MEM_READ_ONLY, 0, nullptr, &err); +cl_mem res_mem = clCreateBufferWithProperties( + context, {CL_MEM_BIND_TO_TENSOR, res_tensor, 0}, CL_MEM_WRITE_ONLY, 0, nullptr, &err); + +// Transfer data to input tensors, execute DBK, and import results +// from the output tensor. + +std::vector lhs_data = ...; +std::vector rhs_data = ...; +std::vector res_data(b * m * n); + +clEnqueueExportToTensor(cmd_q, lhs_tensor, false, {0, 0, 0}, {0, 0, 0}, {b, m, k}, + nullptr, nullptr, lhs_data.data(), 0, nullptr, nullptr) +clEnqueueExportToTensor(cmd_q, rhs_tensor, false, {0, 0, 0}, {0, 0, 0}, {b, k, n}, + nullptr, nullptr, rhs_data.data(), 0, nullptr, nullptr) clEnqueueNDRangeKernel(cmd_q, matmul_kernel, 0, NULL, NULL, NULL, 0, NULL, NULL); +clEnqueueImportFromTensor( + cmd_q, res_tensor, false, {0, 0, 0}, {0, 0, 0}, {b, m, n}, + nullptr, nullptr, res_data.data(), 0, nullptr, nullptr); ---- === Open questions diff --git a/ext/cl_khr_defined_builtin_kernels.html b/ext/cl_khr_defined_builtin_kernels.html index 6681f064a..c045b144e 100644 --- a/ext/cl_khr_defined_builtin_kernels.html +++ b/ext/cl_khr_defined_builtin_kernels.html @@ -482,7 +482,7 @@

    Version History

    First formulation as an extension specification like proposed by Ben Ashbaugh.

    -

    2022-10-23

    +

    2023-11-21

    0.2.0

    Add APIs for defined built-in kernel (DBK) creation. Model DBKs on tensor type. Add sample code.

    @@ -499,7 +499,7 @@

    Dependencies

    This extension requires OpenCL 1.2 or later.

    -

    This extension requires cl_khr_tensor.

    +

    This extension requires cl_exp_tensor.

    @@ -508,7 +508,7 @@

    Dependencies

    Note
    @@ -1124,25 +1124,21 @@

    Launching DBKs from the Device Sid

    Sample Code

    -
    // TBD. Similarly in cl_qcom_ml_ops, tensors have type
    -// (cl_channel_type) and a number of dimensions (rank) and dimension
    -// sizes (shape). Difference over the cl_qcom_ml_ops is that the rank is
    -// "unlimited".
    -cl_tensor_desc lhs_tensor_desc = TBD;
    -cl_tensor_desc rhs_tensor_desc = TBD;
    -cl_tensor_desc res_tensor_desc = TBD;
    +
    constexpr size_t b = 64, m = 100, n = 200, k = 50;
    +cl_int err;
    +cl_tensor lhs_tensor = clCreateTensor(context, nullptr, 3, {b, m, k}, CL_TENSOR_FLOAT, err);
    +cl_tensor rhs_tensor = clCreateTensor(context, nullptr, 3, {b, k, n}, CL_TENSOR_FLOAT, err);
    +cl_tensor res_tensor = clCreateTensor(context, nullptr, 3, {b, m, n}, CL_TENSOR_FLOAT, err);
     
     cl_dkb_attributes_matmul matmul_attrs = {
    -  lhs_tensor_desc, rhs_tensor_desc, res_tensor_desc,
    -  1, 0 // = Transpose lhs tensor
    -}
    +  lhs_tensor, rhs_tensor, res_tensor, 1, 0 // = Transpose lhs tensor
    +};
     
     cl_dbk_mode_properties matmul_props = {
       // Request a matmul instance that meets this precision.
       CL_DBK_PROPERTY_MAX_RELATIVE_ERROR, 100, // in ULPs.
    -}
    +};
     
    -cl_uint err;
     std::vector<cl_dbk_descriptor> kernel_descriptions;
     cl_dbk_descriptor matmul_desc =
       clCreateDefinedBuiltInKernelDescriptor(
    @@ -1152,11 +1148,14 @@ 

    Sample Code

    } else if (err == CL_DBK_UNAVAILABLE) { // Kernel attributes are valid but the kernel is not supported in at least // one of the devices. + ... } else if (err == CL_DBK_UNMET_MAX_RELATIVE_ERROR) { // E.g. Kernel is supported but is not precise enough. + ... } else if (err == CL_DBK_UNSUPPORTED_MODE_PROPERTY) { // cl_dbk_mode_properties has a property not listed in the description of the // defined built-in kernel. + ... } else kernel_descriptions.push_back(matmul_desc); @@ -1168,25 +1167,39 @@

    Sample Code

    ... cl_kernel matmul_kernel = clCreateDefinedBuiltinKernel( - dkb_lib, matmul_desc, err); - -// TBD: allocate space for tensors. Perhaps like cl_qcom_ml_ops: query -// tensor sizes after the final program has been created or after -// command buffer (with DBKs within) is finalized. Implementation -// determines the optimal data layout (opaque to the application) for -// the tensors based on their usage. Application uses the tensor -// sizes to create cl_mem buffers which are bound to the tensors. -cl_tensor lhs_tensor = TBD; -cl_tensor rhs_tensor = TBD; -cl_tensor res_tensor = TBD; - -// Transfer data to input tensors + dkb_lib, matmul_desc, &err); +// Set tensor kernel arguments before binding storage to the tensors. This +// gives clCreateBufferWithProperties() opportunity to reason about tensors' +// uses for determining the optimal memory layout (opaque to application) and +// the space neededfor the tensors. clSetKernelArg(matmul_kernel, 0, sizeof(cl_tensor_t), &lhs_tensor); clSetKernelArg(matmul_kernel, 1, sizeof(cl_tensor_t), &rhs_tensor); clSetKernelArg(matmul_kernel, 2, sizeof(cl_tensor_t), &res_tensor); -clEnqueueNDRangeKernel(cmd_q, matmul_kernel, 0, NULL, NULL, NULL, 0, NULL, NULL);
    +// Allocate storage for tensors. +cl_mem lhs_mem = clCreateBufferWithProperties( + context, {CL_MEM_BIND_TO_TENSOR, lhs_tensor, 0}, CL_MEM_READ_ONLY, 0, nullptr, &err); +cl_mem rhs_mem = clCreateBufferWithProperties( + context, {CL_MEM_BIND_TO_TENSOR, rhs_tensor, 0}, CL_MEM_READ_ONLY, 0, nullptr, &err); +cl_mem res_mem = clCreateBufferWithProperties( + context, {CL_MEM_BIND_TO_TENSOR, res_tensor, 0}, CL_MEM_WRITE_ONLY, 0, nullptr, &err); + +// Transfer data to input tensors, execute DBK, and import results +// from the output tensor. + +std::vector<float> lhs_data = ...; +std::vector<float> rhs_data = ...; +std::vector<float> res_data(b * m * n); + +clEnqueueExportToTensor(cmd_q, lhs_tensor, false, {0, 0, 0}, {0, 0, 0}, {b, m, k}, + nullptr, nullptr, lhs_data.data(), 0, nullptr, nullptr) +clEnqueueExportToTensor(cmd_q, rhs_tensor, false, {0, 0, 0}, {0, 0, 0}, {b, k, n}, + nullptr, nullptr, rhs_data.data(), 0, nullptr, nullptr) +clEnqueueNDRangeKernel(cmd_q, matmul_kernel, 0, NULL, NULL, NULL, 0, NULL, NULL); +clEnqueueImportFromTensor( + cmd_q, res_tensor, false, {0, 0, 0}, {0, 0, 0}, {b, m, n}, + nullptr, nullptr, res_data.data(), 0, nullptr, nullptr);
    @@ -1279,7 +1292,7 @@

    Open questions

    From e0905aaf0998159ea90b5a16bebc1f574656c50c Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Henry=20Linjam=C3=A4ki?= Date: Wed, 22 Nov 2023 12:22:01 +0200 Subject: [PATCH 6/9] Remove note about unpublished extension --- ext/cl_khr_defined_builtin_kernels.asciidoc | 7 +------ ext/cl_khr_defined_builtin_kernels.html | 18 ++---------------- 2 files changed, 3 insertions(+), 22 deletions(-) diff --git a/ext/cl_khr_defined_builtin_kernels.asciidoc b/ext/cl_khr_defined_builtin_kernels.asciidoc index 4842034c8..5b1c3564f 100644 --- a/ext/cl_khr_defined_builtin_kernels.asciidoc +++ b/ext/cl_khr_defined_builtin_kernels.asciidoc @@ -26,7 +26,7 @@ definitions and updating of previously defined ones. |==== | *Date* | *Version* | *Description* | 2022-12-13 | 0.1.0 | First formulation as an extension specification like proposed by Ben Ashbaugh. -| 2023-11-21 | 0.2.0 | +| 2023-11-22 | 0.2.0 | Add APIs for defined built-in kernel (DBK) creation. Model DBKs on tensor type. Add sample code. |==== @@ -39,11 +39,6 @@ This extension requires OpenCL 1.2 or later. This extension requires cl_exp_tensor. -[NOTE] -cl_exp_tensor is unpublished, work-in-progress -extension. Briefly, It will bring concept of tensor, N-dimensional -data structure whose data layout is opaque to applications. - ==== Contributors Pekka Jääskeläinen, Intel and Tampere University. + diff --git a/ext/cl_khr_defined_builtin_kernels.html b/ext/cl_khr_defined_builtin_kernels.html index c045b144e..20b209040 100644 --- a/ext/cl_khr_defined_builtin_kernels.html +++ b/ext/cl_khr_defined_builtin_kernels.html @@ -482,7 +482,7 @@

    Version History

    - + @@ -501,20 +501,6 @@

    Dependencies

    This extension requires cl_exp_tensor.

    -
    -
    -cl_khr_tensor is unpublished, work-in-progress +cl_exp_tensor is unpublished, work-in-progress extension. Briefly, It will bring concept of tensor, N-dimensional data structure whose data layout is opaque to applications.

    First formulation as an extension specification like proposed by Ben Ashbaugh.

    2023-11-21

    2023-11-22

    0.2.0

    Add APIs for defined built-in kernel (DBK) creation. Model DBKs on tensor type. Add sample code.

    - - - - -
    -
    Note
    -
    -cl_exp_tensor is unpublished, work-in-progress -extension. Briefly, It will bring concept of tensor, N-dimensional -data structure whose data layout is opaque to applications. -
    -

    Contributors

    @@ -1292,7 +1278,7 @@

    Open questions

    From e8ce3b079941fc69e501fc9ac0ee9c9a1a956707 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Henry=20Linjam=C3=A4ki?= Date: Thu, 23 Nov 2023 10:05:13 +0200 Subject: [PATCH 7/9] * Missed cl_tensor_desc -> cl_tensor * Fix typo. --- ext/cl_khr_defined_builtin_kernels.asciidoc | 16 +++++------ ext/cl_khr_defined_builtin_kernels.html | 30 ++++++++++++++------- 2 files changed, 28 insertions(+), 18 deletions(-) diff --git a/ext/cl_khr_defined_builtin_kernels.asciidoc b/ext/cl_khr_defined_builtin_kernels.asciidoc index 5b1c3564f..da5c811d2 100644 --- a/ext/cl_khr_defined_builtin_kernels.asciidoc +++ b/ext/cl_khr_defined_builtin_kernels.asciidoc @@ -26,7 +26,7 @@ definitions and updating of previously defined ones. |==== | *Date* | *Version* | *Description* | 2022-12-13 | 0.1.0 | First formulation as an extension specification like proposed by Ben Ashbaugh. -| 2023-11-22 | 0.2.0 | +| 2023-11-23 | 0.2.0 | Add APIs for defined built-in kernel (DBK) creation. Model DBKs on tensor type. Add sample code. |==== @@ -339,12 +339,11 @@ Each defined built-in kernel entry is organized as follows: | Name: *khr_matmul* | *Kernel Attributes* a| - Fields of the `cl_dkb_attributes_matmul` structure: -. cl_tensor_desc A: Tensor description for input matrix A. -. cl_tensor_desc B: Tensor description for input matrix B. -. cl_tensor_desc R: Tensor description for output matrix C. +. cl_tensor A: Tensor description for input matrix A. +. cl_tensor B: Tensor description for input matrix B. +. cl_tensor R: Tensor description for output matrix C. . cl_int transposeA: Non-zero transposes A matrix. . cl_int transposeB: Non-zero transposes B matrix. | *Kernel Arguments* @@ -390,8 +389,9 @@ This DBK accepts the following properties: | *Kernel Attributes* a| Fields of the `cl_dbk_leaky_relu` structure: -. cl_tensor_desc in: Input tensor description. -. cl_tensor_desc out: Output tensor description. + +. cl_tensor in: Input tensor description. +. cl_tensor out: Output tensor description. . cl_float alpha: Coefficient of leakage. | *Kernel arguments* a| @@ -475,7 +475,7 @@ cl_kernel matmul_kernel = clCreateDefinedBuiltinKernel( // Set tensor kernel arguments before binding storage to the tensors. This // gives clCreateBufferWithProperties() opportunity to reason about tensors' // uses for determining the optimal memory layout (opaque to application) and -// the space neededfor the tensors. +// the space needed for the tensors. clSetKernelArg(matmul_kernel, 0, sizeof(cl_tensor_t), &lhs_tensor); clSetKernelArg(matmul_kernel, 1, sizeof(cl_tensor_t), &rhs_tensor); clSetKernelArg(matmul_kernel, 2, sizeof(cl_tensor_t), &res_tensor); diff --git a/ext/cl_khr_defined_builtin_kernels.html b/ext/cl_khr_defined_builtin_kernels.html index 20b209040..6d780d29e 100644 --- a/ext/cl_khr_defined_builtin_kernels.html +++ b/ext/cl_khr_defined_builtin_kernels.html @@ -482,7 +482,7 @@

    Version History

    First formulation as an extension specification like proposed by Ben Ashbaugh.

    -

    2023-11-22

    +

    2023-11-23

    0.2.0

    Add APIs for defined built-in kernel (DBK) creation. Model DBKs on tensor type. Add sample code.

    @@ -922,13 +922,13 @@

    The Defined Built-in Kernels

    1. -

      cl_tensor_desc A: Tensor description for input matrix A.

      +

      cl_tensor A: Tensor description for input matrix A.

    2. -

      cl_tensor_desc B: Tensor description for input matrix B.

      +

      cl_tensor B: Tensor description for input matrix B.

    3. -

      cl_tensor_desc R: Tensor description for output matrix C.

      +

      cl_tensor R: Tensor description for output matrix C.

    4. cl_int transposeA: Non-zero transposes A matrix.

      @@ -1033,10 +1033,20 @@

      The Defined Built-in Kernels

      -

      Fields of the cl_dbk_leaky_relu structure: -. cl_tensor_desc in: Input tensor description. -. cl_tensor_desc out: Output tensor description. -. cl_float alpha: Coefficient of leakage.

      +

      Fields of the cl_dbk_leaky_relu structure:

      +
      +
      +
        +
      1. +

        cl_tensor in: Input tensor description.

        +
      2. +
      3. +

        cl_tensor out: Output tensor description.

        +
      4. +
      5. +

        cl_float alpha: Coefficient of leakage.

        +
      6. +
      @@ -1158,7 +1168,7 @@

      Sample Code

      // Set tensor kernel arguments before binding storage to the tensors. This // gives clCreateBufferWithProperties() opportunity to reason about tensors' // uses for determining the optimal memory layout (opaque to application) and -// the space neededfor the tensors. +// the space needed for the tensors. clSetKernelArg(matmul_kernel, 0, sizeof(cl_tensor_t), &lhs_tensor); clSetKernelArg(matmul_kernel, 1, sizeof(cl_tensor_t), &rhs_tensor); clSetKernelArg(matmul_kernel, 2, sizeof(cl_tensor_t), &res_tensor); @@ -1278,7 +1288,7 @@

      Open questions

    From 452597e36c080e325f1cc57c88ad6a3160d0eb76 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Henry=20Linjam=C3=A4ki?= Date: Mon, 19 Aug 2024 17:25:35 +0300 Subject: [PATCH 8/9] Update cl_khr_defined_builtin_kernels --- ext/cl_khr_defined_builtin_kernels.asciidoc | 552 ----- ext/cl_khr_defined_builtin_kernels.html | 1295 ----------- .../cl_khr_defined_builtin_kernels.asciidoc | 925 ++++++++ .../cl_khr_defined_builtin_kernels.html | 1888 +++++++++++++++++ 4 files changed, 2813 insertions(+), 1847 deletions(-) delete mode 100644 ext/cl_khr_defined_builtin_kernels.asciidoc delete mode 100644 ext/cl_khr_defined_builtin_kernels.html create mode 100644 extensions/cl_khr_defined_builtin_kernels.asciidoc create mode 100644 extensions/cl_khr_defined_builtin_kernels.html diff --git a/ext/cl_khr_defined_builtin_kernels.asciidoc b/ext/cl_khr_defined_builtin_kernels.asciidoc deleted file mode 100644 index da5c811d2..000000000 --- a/ext/cl_khr_defined_builtin_kernels.asciidoc +++ /dev/null @@ -1,552 +0,0 @@ -// Copyright 2018-2022 The Khronos Group. This work is licensed under a -// Creative Commons Attribution 4.0 International License; see -// http://creativecommons.org/licenses/by/4.0/ -= cl_khr_defined_builtin_kernels = - -:source-highlighter: coderay - -[[cl_khr_defined_builtin_kernels]] -== Khronos-Defined Built-in Kernels (Early Draft) - -The purpose of this extension is to provide a standardized set of built-in -kernels with well-defined semantics useful for accelerating applications -from various domains. The extension specification is designed to rapidly -expand and "live" via addition of new well-defined built-in kernel -definitions and updating of previously defined ones. - -=== General Information - -==== Name Strings - -`cl_khr_defined_builtin_kernels` - -==== Version History - -[cols="1,1,3",options="header",] -|==== -| *Date* | *Version* | *Description* -| 2022-12-13 | 0.1.0 | First formulation as an extension specification like proposed by Ben Ashbaugh. -| 2023-11-23 | 0.2.0 | -Add APIs for defined built-in kernel (DBK) creation. Model DBKs on -tensor type. Add sample code. -|==== - -==== Dependencies - -This extension is written against the OpenCL Specification version 3.0.12. - -This extension requires OpenCL 1.2 or later. - -This extension requires cl_exp_tensor. - -==== Contributors - -Pekka Jääskeläinen, Intel and Tampere University. + -Topi Leppänen, Tampere University. + -Jan Solanti, Tampere University. + -Ben Ashbaugh, Intel. + -Henry Linjamäki, Intel. + - -=== Overview - -OpenCL 1.2 specifies a built-in kernel as a kernel that is executed on -an OpenCL device or custom device by fixed-function hardware or in firmware. -Applications can query the built-in kernels supported by a device or custom -device. - -Built-in kernels are referred to by a name (a C string) without any -semantics attached to the functionality. The semantics behind the name -is completely device specific, typically documented in vendor-specific -extension specifications. - -The goal for this extension is to lower the bar for utilizing hardware -accelerated functions in drivers by providing a library of -well-defined built-in kernel with good coverage for common acceleration needs -and which is designed to easily evolve over time. - -The device drivers that implement this extension can freely choose which -subset of defined built-in-kernels (DBKs) they implement and advertise to the clients. The -clients can use the DBKs to accelerate their applications by manually -executing invoking the DBKs. The extension is designed to also support using -automated task graph lowering tooling later. - -==== Background - -ASIC-based coarse-grained hardware accelerators are specialized logic meant to -speed up execution of workloads of interest, or to provide improvements in -energy-efficiency. Examples of contemporary workloads that are beneficially hardware -accelerated over software-based implementations include video coding, deep learning, -cryptography, software-defined radio and graphics rendering. - -FPGAs form a special case somewhere between instruction-set architectures and fixed -function hardware accelerators. While advances in high-level synthesis tools -have attempted to bridge the programmability gap between GPU and FPGA programming, -FPGAs are still considered as devices which are challenging to achieve efficient -implementations with. Due to extensive manual optimization work required for efficient -implementations of the accelerated functionality, defining FPGA designs as -a system of "hardware accelerator IPs" is still a widely used "application abstraction". -FPGAs can be thus seen as a platform that can realize and integrate any -hardware accelerator implementable with the programmable fabric. - -The means to utilize hardware accelerators have typically been -vendor-specific and abstracted behind domain-specific libraries. -The overhead with the "bunch of libraries"-approach is seen in the lowest level -of integration: The libraries utilize a low level library (typically -vendor-specific) to interface with the actual hardware, and thus does not -integrate efficiently with other libraries or software-programmable processors -that might be available on the same chip. - -==== Rationale - -OpenCL's built-in kernel abstraction allows pushing both hardware -accelerated and software defined kernels to the same command-queues, -providing a powerful means for asynchronous execution of heterogeneous -task graphs on diverse heterogeneous platforms. The ability to invoke hardware -accelerators while being able to synchronize and optimize data transfers at -the lowest levels of the driver stack can provide significant latency benefits, -especially when combined with the command-buffering mechanism. - -However, the built-in kernel abstraction works well only when it is widely adopted by -vendors, and when multiple vendors implement the same definitions. Otherwise -each vendor specifies and implements their own built-in kernels closely matching their -own hardware accelerator properties, resulting in lack of cross-vendor -portability in the API abstraction presented to the upper layers of -heterogeneous computing software stacks. - -This extension standardizes a set of well-defined built-in kernels the -clients can call from higher level programming stacks built with -different languages and multiple libraries, possibly mix accelerator -calls with calls to software kernel commands, and rely on the driver -stack to optimize the execution (especially the synchronization and -communication) as a low level heterogeneous task graph. The -heterogeneous task graph can be described using multiple -command-queues and optionally cached using the command buffer -extension (cl_khr_command_buffer). It aims to promote the use of -built-in kernels as a programming model for hardware accelerated -functionality, to improve cross-vendor portability of hardware -accelerated computing. - - -=== Add new section X.Y.Z Querying Defined Built-in Kernels - -To request a defined built-in kernel to be executed in the given -devices use: - -[source,c] ----- -cl_dbk_descriptor clCreateDefinedBuiltInKernelDescriptor( - cl_context context, - cl_uint num_devices, - const cl_device_id* device_list, - cl_dbk_name kernel_name, - const void *kernel_attributes, - const cl_dbk_mode_properties* kernel_config, - cl_int *errcode_ret); ----- - -* _context_ must be a valid OpenCL context. - -* _num_devices_ is the number of devices listed in - device_list. _num_devices_ must be non-zero. - -* _device_list_ is a pointer to a list of devices that are in - context. _device_list_ must be a non-NULL value. The defined built-in kernels - are loaded for devices specified in this list. - -* _kernel_name_ is the name of the defined built-in kernel listed in Appendix I. - -* _kernel_attributes_ is a pointer to the structure declared in - description of the kernel in Appendix I. The structure holds - kernel's attributes. - -* _cl_dbk_mode_properties_ is a pointer to a list of defined built-in - kernel mode properties. The supported mode properties are listed in - DBK's entry with default settings in Appendix I. It is valid to set - this argument to NULL in which case default properties apply (if - any). - -*clCreateDefinedBuiltInKernelDescriptor* returns a valid kernel -descriptor on success indicated by _errcode_ret_ which is set to -CL_SUCCESS. Otherwise, the returned object is NULL and the -_errcode_ret_ is set to one of following code: - -* CL_DBK_INVALID_ATTRIBUTE if one or more kernel attributes violates - conditions descried in defined built-in kernel entry in Appendix I. - -* CL_DBK_UNAVAILABLE if kernel attributes are valid but the - kernel is not supported on one of the devices. - -* CL_DBK_UNSUPPORTED_MODE_PROPERTY if _cl_dbk_mode_properties_ includes - at least one property not listed in DBK's entry. - -* CL_DBK_UNMET_MAX_RELATIVE_ERROR if the DBK is available but does not - meet the requested constraint set by - CL_DBK_PROPERTY_MAX_RELATIVE_ERROR property. - -[cols="2,1,2",stripes=odd] -|=== -| *DBK Mode Property* | *Property Value* | *Description* - -| CL_DBK_PROPERTY_MAX_RELATIVE_ERROR | float - -a| Require that the DBK produces the results which do not deviate more -than the given amount value of ULPs (units in the last place) respect -to infnitely precise result. - -| CL_DBK_PROPERTY_NON_DETERMINISTIC | cl_bool - -a| Allow results of the kernel to be non-reproducible. This allows -implementation to switch algorithm of the kernel on each launch for -possibly better performance. -// Idea from https://pytorch.org/docs/stable/notes/randomness.html#cuda-convolution-benchmarking - -|=== - -=== Add new function to 5.8.1 Creating Program Objects - -To create a program with a set of defined built-in kernel use: - -[source,c] ----- -cl_program clCreateProgramWithDefinedKernels( - cl_context context, - size_t num_kernel_desc, - const void* kernel_desc_list, - cl_int* errcode_ret); ----- - -* _context_ must be a valid OpenCL context. - -* _num_kernel_desc_ is the number of kernel descriptors. - -* _kernel_desc_list_ is the array of valid - cl_dbk_descriptor objects. The array length must be at - least _num_kernel_desc_. The kernel descriptors must be created on - the same context. - -*clCreateProgramWithDefinedKernels* returns a valid program on success -indicated by _errcode_ret_ which is set to CL_SUCCESS. Otherwise, the -returned object is NULL and the _errcode_ret_ is set to one of -following code: - -* TODO. - -=== Add new function to 5.9.1 Creating Kernel Objects - -To get a kernel handle for a defined built-in kernel in a program use: - -[source,c] ----- -cl_kernel clCreateDefinedBuiltInKernel( - cl_program program, - cl_dbk_descriptor kernel_desc, - cl_int* errcode_ret); ----- - -* _program_ is a program object with a successfully built executable. - -* _kernel_desc_ is a defined built-in kernel descriptor in the program. - -* _errcode_ret_ will return an appropriate error code. If errcode_ret is - NULL, no error code is returned. - -*clCreateDefinedBuiltInKernel* returns a valid non-zero kernel object - and errcode_ret is set to CL_SUCCESS if the kernel object is created - successfully. Otherwise, it returns a NULL value with one of the - following error values returned in _errcode_ret_: - -* TODO. - - -=== Add new appendix "Appendix I - Defined Built-in Kernels" to OpenCL API Specification - -This chapter describes standard defined built-in kernels (DBK) with -well-defined semantics. Devices can report -availability of the built-in kernels listed in this section with -`clCreateDefinedBuiltInKernelDescriptor` call. The availability of a -DBK is determined from the arguments passed to the -`clCreateDefinedBuiltInKernelDescriptor` and unavailability of a DBK -is indicated by CL_DBK_UNAVAILABLE error code. - -The general client-side abstraction of the DBKs is similar to a call -to a C function of which implementation is hidden. The device driver -are free to implement a DBK by invoking one or more coarse and fine grained hardware accelerators combined with -firmware to implement the semantics as efficiently as possible. - -It is the driver's responsibility to handle efficient synchronization and communication -to the hardware accelerator, the internal accelerator state management and resource sharing -across multiple OpenCL contexts. - -==== Reproducibility ==== - -Identical DBKs or same DBKs executed repeatedly with identical inputs are -guaranteed to produce identical results, unless otherwise stated in -the DBK's description, when: - -* enqueued to the same device, - -* on the same platform, - -* on the same vendor with the same driver version and - -* CL_DBK_PROPERTY_NON_DETERMINISTIC property is not set on. - -Two DBK descriptors for a device are considered identical if they are created -using identical kernel name, kernel attribute and kernel mode property -arguments. In other cases, identical and inputs may produce different -results. The result difference may occur because, for example, -different algorithms being used across devices. - -DBKs may produce approximated results and the error, respect to -infinitely precise result, can be optionally controlled by -CL_DBK_PROPERTY_MAX_RELATIVE_ERROR when the property name is listed in -the DBK's description. DBKs without CL_DBK_PROPERTY_MAX_RELATIVE_ERROR -property produces exact result. - -==== The Defined Built-in Kernels ==== - -The following is list of recognized defined built-in kernels. It is -expected to be expanded and updated over the versions of this extensions, while preserving backwards compatibility. - -Each defined built-in kernel entry is organized as follows: - -* *Name*: Name of the defined built-in kernel (an enumeration). - -* *Kernel attributes*: The kernel attributes required for creating the - defined built-in kernel via - clCreateDefinedBuiltInKernelDescriptor. Attribute values are - immutable. - -* *Kernel arguments*: The kernel arguments. - -* *Description*: The description of the kernel in detail. - -* *Attribute validation rules*: Conditions of the kernel attribute for - the kernel. Implementation must return CL_DBK_INVALID_ATTRIBUTE on - clCreateDefinedBuiltInKernelDescriptor call if any of the conditions - are violated. - -* *Kernel mode properties*: List of kernel mode - properties (cl_dbk_mode_properties) the kernel recognizes. The - properties can be used to tweak certain implementation details and - behaviors in the kernel execution. If a property not listed in the - DBK entry is fed to clCreateDefinedBuiltInKernelDescriptor call, - then implementation must return CL_DKB_UNSUPPORTED_MODE_PROPERTY. - -[caption="Table A.I.1. "] -.Standard Built-in Kernels and Their Semantics. *The table has been populated with a small set of non-trivial example entries which are subject to change and the list to expand during drafting.* -|=== -| Name: *khr_matmul* -| *Kernel Attributes* -a| -Fields of the `cl_dkb_attributes_matmul` structure: - -. cl_tensor A: Tensor description for input matrix A. -. cl_tensor B: Tensor description for input matrix B. -. cl_tensor R: Tensor description for output matrix C. -. cl_int transposeA: Non-zero transposes A matrix. -. cl_int transposeB: Non-zero transposes B matrix. -| *Kernel Arguments* -a| -. cl_tensor A: Matrix A (read only). -. cl_tensor B: Matrix B (read only). -. cl_tensor R: Output matrix. (write only). -| *Description* -a| -Performs (batched) matrix multiplication: `R = trans(A) * trans(B)`, -where `A`, `B` and `R` are tensors with at least rank two. The -`trans()` is a configurable transpose operation. - -Last two dimensions of the tensors are treated as operands to the -matric multiplication and rest of the dimensions are treated as batch -dimensions. - -Operations of the matrix muliplication are performed in the precision -of the `elementof\(R)`. - -If an overflow occurs in the accumulation of the products, then `R` -tensor's result will be undefined. - -| *Attribute validation rules* -a| - -* `rankof(A) == rankof(B) >= 2`. -* Let `shapeof(A~t~) == (b..., m, k)` and `shapeof(B~t~) = (b..., k, - n)` of tensors `A` and `B`, respectively, after possible tranposing. - `shapeof\(R)` must be `(b..., m, n)`. -* `elementof(A) == elementof(B)` -* `elemkindof\(R) == elemkindof(A)` -* `elementof\(R) == elementof(A)` or `elementof(A)` is promotable to - `elementof\(R)` without loss of meaning. -// E.g. cl_int -> cl_uint: loses negative values -| *Kernel mode properties* -a| -This DBK accepts the following properties: - -* CL_DBK_PROPERTY_MAX_RELATIVE_ERROR: Unset property defaults to positive infinity. -| -| Name: *khr_leaky_relu* -| *Kernel Attributes* -a| -Fields of the `cl_dbk_leaky_relu` structure: - -. cl_tensor in: Input tensor description. -. cl_tensor out: Output tensor description. -. cl_float alpha: Coefficient of leakage. -| *Kernel arguments* -a| -. cl_tensor in: The input tensor. -. cl_tensor out: The output tensor. -| *Description* -a| -Applies operation `alpha * x if x < 0 else x` on all -elements of the `in` tensor. - -If target device does not support denormals, then `alpha` is flushed -to zero before the operation is applied. - -| *Kernel mode properties* -| N/A -| *Attribute validation rules* -a| -* `shapeof(in) == shapeof(out)` -* `elementof(in) == elementof(out)` -* `alpha` must be a finite value. -|=== - -==== Launching DBKs from the Device Side ==== - -DBKs are primarily meant to be launched as kernel commands via -host-side command queues. Optionally, they can be callable from -device-side via `enqueue_kernel`: - -TBC. This probably needs device-side function corresponding to -clCreateDefinedBuiltInKernelDescriptor. - -==== Sample Code ==== - -[source,c] ----- -constexpr size_t b = 64, m = 100, n = 200, k = 50; -cl_int err; -cl_tensor lhs_tensor = clCreateTensor(context, nullptr, 3, {b, m, k}, CL_TENSOR_FLOAT, err); -cl_tensor rhs_tensor = clCreateTensor(context, nullptr, 3, {b, k, n}, CL_TENSOR_FLOAT, err); -cl_tensor res_tensor = clCreateTensor(context, nullptr, 3, {b, m, n}, CL_TENSOR_FLOAT, err); - -cl_dkb_attributes_matmul matmul_attrs = { - lhs_tensor, rhs_tensor, res_tensor, 1, 0 // = Transpose lhs tensor -}; - -cl_dbk_mode_properties matmul_props = { - // Request a matmul instance that meets this precision. - CL_DBK_PROPERTY_MAX_RELATIVE_ERROR, 100, // in ULPs. -}; - -std::vector kernel_descriptions; -cl_dbk_descriptor matmul_desc = - clCreateDefinedBuiltInKernelDescriptor( - context, num_devices, device_list, - CL_DBK_MATMUL, &matmul_attrs, &matmul_props, &err); - -} else if (err == CL_DBK_UNAVAILABLE) { - // Kernel attributes are valid but the kernel is not supported in at least - // one of the devices. - ... -} else if (err == CL_DBK_UNMET_MAX_RELATIVE_ERROR) { - // E.g. Kernel is supported but is not precise enough. - ... -} else if (err == CL_DBK_UNSUPPORTED_MODE_PROPERTY) { - // cl_dbk_mode_properties has a property not listed in the description of the - // defined built-in kernel. - ... -} else - kernel_descriptions.push_back(matmul_desc); - -... - -cl_program dbk_lib = clCreateProgramWithDefinedBuiltInKernels( - context, kernel_descriptions.size(), kernel_descriptors.data(), err); - -... - -cl_kernel matmul_kernel = clCreateDefinedBuiltinKernel( - dkb_lib, matmul_desc, &err); - -// Set tensor kernel arguments before binding storage to the tensors. This -// gives clCreateBufferWithProperties() opportunity to reason about tensors' -// uses for determining the optimal memory layout (opaque to application) and -// the space needed for the tensors. -clSetKernelArg(matmul_kernel, 0, sizeof(cl_tensor_t), &lhs_tensor); -clSetKernelArg(matmul_kernel, 1, sizeof(cl_tensor_t), &rhs_tensor); -clSetKernelArg(matmul_kernel, 2, sizeof(cl_tensor_t), &res_tensor); - -// Allocate storage for tensors. -cl_mem lhs_mem = clCreateBufferWithProperties( - context, {CL_MEM_BIND_TO_TENSOR, lhs_tensor, 0}, CL_MEM_READ_ONLY, 0, nullptr, &err); -cl_mem rhs_mem = clCreateBufferWithProperties( - context, {CL_MEM_BIND_TO_TENSOR, rhs_tensor, 0}, CL_MEM_READ_ONLY, 0, nullptr, &err); -cl_mem res_mem = clCreateBufferWithProperties( - context, {CL_MEM_BIND_TO_TENSOR, res_tensor, 0}, CL_MEM_WRITE_ONLY, 0, nullptr, &err); - -// Transfer data to input tensors, execute DBK, and import results -// from the output tensor. - -std::vector lhs_data = ...; -std::vector rhs_data = ...; -std::vector res_data(b * m * n); - -clEnqueueExportToTensor(cmd_q, lhs_tensor, false, {0, 0, 0}, {0, 0, 0}, {b, m, k}, - nullptr, nullptr, lhs_data.data(), 0, nullptr, nullptr) -clEnqueueExportToTensor(cmd_q, rhs_tensor, false, {0, 0, 0}, {0, 0, 0}, {b, k, n}, - nullptr, nullptr, rhs_data.data(), 0, nullptr, nullptr) -clEnqueueNDRangeKernel(cmd_q, matmul_kernel, 0, NULL, NULL, NULL, 0, NULL, NULL); -clEnqueueImportFromTensor( - cmd_q, res_tensor, false, {0, 0, 0}, {0, 0, 0}, {b, m, n}, - nullptr, nullptr, res_data.data(), 0, nullptr, nullptr); ----- - -=== Open questions - -. Should we enable launching DBKs from the device side without requiring device-side enqueue? The main problem is those with NDRange as they are not simple single-WI helper functions. -+ --- -*UNRESOLVED* - --- - -. Should the NDRange be used at all in DBKs? It feels sort of unnatural as typically the NDRange is used to imply SPMD parallelism while the hardware/firmware is free to choose whatever parallelization strategy to implement the function. On the other hand, similar applies to software kernel launches as the NDRange-launched work-items can be executed serially if adhering to barrier semantics. -+ --- -*UNRESOLVED* - --- - -. Different accelerators prefer different channel orders (NHWC vs. NCHW...) for the processed data. Should the channel order be passed as a DBK argument (like in the example GEMM's row/column order) or is it better to have different DBK variations for each? -+ --- -*UNRESOLVED* - --- - -. How to denote preference? Some of the DBKs are more efficient on a given device as they map more naturally to the underlying HW accelerator, but the slower variations (for example, with unoptimal channel order in NN accelerators) might be still beneficially accelerated. -+ --- -*UNRESOLVED* - --- - -. Since the defined built-in kernel concept is basically just a C-like API inside another API, should it be made more generic and thus directly usable for SYCL and Vulkan as well? -+ --- -*UNRESOLVED* - --- - -. What other DBK mode properties we should have? Here are some ideas: -** Perform accumulation with saturation. -** Finite math only -** Flush denormals to zero. -** data layout preferences (NHWC for convolution). --- -*UNRESOLVED* --- diff --git a/ext/cl_khr_defined_builtin_kernels.html b/ext/cl_khr_defined_builtin_kernels.html deleted file mode 100644 index 6d780d29e..000000000 --- a/ext/cl_khr_defined_builtin_kernels.html +++ /dev/null @@ -1,1295 +0,0 @@ - - - - - - - -cl_khr_defined_builtin_kernels - - - - - -
    -
    -

    Khronos-Defined Built-in Kernels (Early Draft)

    -
    -
    -

    The purpose of this extension is to provide a standardized set of built-in -kernels with well-defined semantics useful for accelerating applications -from various domains. The extension specification is designed to rapidly -expand and "live" via addition of new well-defined built-in kernel -definitions and updating of previously defined ones.

    -
    -
    -

    General Information

    -
    -

    Name Strings

    -
    -

    cl_khr_defined_builtin_kernels

    -
    -
    -
    -

    Version History

    - ----- - - - - - - - - - - - - - - - - - - - -
    DateVersionDescription

    2022-12-13

    0.1.0

    First formulation as an extension specification like proposed by Ben Ashbaugh.

    2023-11-23

    0.2.0

    Add APIs for defined built-in kernel (DBK) creation. Model DBKs on -tensor type. Add sample code.

    -
    -
    -

    Dependencies

    -
    -

    This extension is written against the OpenCL Specification version 3.0.12.

    -
    -
    -

    This extension requires OpenCL 1.2 or later.

    -
    -
    -

    This extension requires cl_exp_tensor.

    -
    -
    -
    -

    Contributors

    -
    -

    Pekka Jääskeläinen, Intel and Tampere University.
    -Topi Leppänen, Tampere University.
    -Jan Solanti, Tampere University.
    -Ben Ashbaugh, Intel.
    -Henry Linjamäki, Intel.

    -
    -
    -
    -
    -

    Overview

    -
    -

    OpenCL 1.2 specifies a built-in kernel as a kernel that is executed on -an OpenCL device or custom device by fixed-function hardware or in firmware. -Applications can query the built-in kernels supported by a device or custom -device.

    -
    -
    -

    Built-in kernels are referred to by a name (a C string) without any -semantics attached to the functionality. The semantics behind the name -is completely device specific, typically documented in vendor-specific -extension specifications.

    -
    -
    -

    The goal for this extension is to lower the bar for utilizing hardware -accelerated functions in drivers by providing a library of -well-defined built-in kernel with good coverage for common acceleration needs -and which is designed to easily evolve over time.

    -
    -
    -

    The device drivers that implement this extension can freely choose which -subset of defined built-in-kernels (DBKs) they implement and advertise to the clients. The -clients can use the DBKs to accelerate their applications by manually -executing invoking the DBKs. The extension is designed to also support using -automated task graph lowering tooling later.

    -
    -
    -

    Background

    -
    -

    ASIC-based coarse-grained hardware accelerators are specialized logic meant to -speed up execution of workloads of interest, or to provide improvements in -energy-efficiency. Examples of contemporary workloads that are beneficially hardware -accelerated over software-based implementations include video coding, deep learning, -cryptography, software-defined radio and graphics rendering.

    -
    -
    -

    FPGAs form a special case somewhere between instruction-set architectures and fixed -function hardware accelerators. While advances in high-level synthesis tools -have attempted to bridge the programmability gap between GPU and FPGA programming, -FPGAs are still considered as devices which are challenging to achieve efficient -implementations with. Due to extensive manual optimization work required for efficient -implementations of the accelerated functionality, defining FPGA designs as -a system of "hardware accelerator IPs" is still a widely used "application abstraction". -FPGAs can be thus seen as a platform that can realize and integrate any -hardware accelerator implementable with the programmable fabric.

    -
    -
    -

    The means to utilize hardware accelerators have typically been -vendor-specific and abstracted behind domain-specific libraries. -The overhead with the "bunch of libraries"-approach is seen in the lowest level -of integration: The libraries utilize a low level library (typically -vendor-specific) to interface with the actual hardware, and thus does not -integrate efficiently with other libraries or software-programmable processors -that might be available on the same chip.

    -
    -
    -
    -

    Rationale

    -
    -

    OpenCL’s built-in kernel abstraction allows pushing both hardware -accelerated and software defined kernels to the same command-queues, -providing a powerful means for asynchronous execution of heterogeneous -task graphs on diverse heterogeneous platforms. The ability to invoke hardware -accelerators while being able to synchronize and optimize data transfers at -the lowest levels of the driver stack can provide significant latency benefits, -especially when combined with the command-buffering mechanism.

    -
    -
    -

    However, the built-in kernel abstraction works well only when it is widely adopted by -vendors, and when multiple vendors implement the same definitions. Otherwise -each vendor specifies and implements their own built-in kernels closely matching their -own hardware accelerator properties, resulting in lack of cross-vendor -portability in the API abstraction presented to the upper layers of -heterogeneous computing software stacks.

    -
    -
    -

    This extension standardizes a set of well-defined built-in kernels the -clients can call from higher level programming stacks built with -different languages and multiple libraries, possibly mix accelerator -calls with calls to software kernel commands, and rely on the driver -stack to optimize the execution (especially the synchronization and -communication) as a low level heterogeneous task graph. The -heterogeneous task graph can be described using multiple -command-queues and optionally cached using the command buffer -extension (cl_khr_command_buffer). It aims to promote the use of -built-in kernels as a programming model for hardware accelerated -functionality, to improve cross-vendor portability of hardware -accelerated computing.

    -
    -
    -
    -
    -

    Add new section X.Y.Z Querying Defined Built-in Kernels

    -
    -

    To request a defined built-in kernel to be executed in the given -devices use:

    -
    -
    -
    -
    cl_dbk_descriptor clCreateDefinedBuiltInKernelDescriptor(
    -    cl_context context,
    -    cl_uint num_devices,
    -    const cl_device_id* device_list,
    -    cl_dbk_name kernel_name,
    -    const void *kernel_attributes,
    -    const cl_dbk_mode_properties* kernel_config,
    -    cl_int *errcode_ret);
    -
    -
    -
    -
      -
    • -

      context must be a valid OpenCL context.

      -
    • -
    • -

      num_devices is the number of devices listed in -device_list. num_devices must be non-zero.

      -
    • -
    • -

      device_list is a pointer to a list of devices that are in -context. device_list must be a non-NULL value. The defined built-in kernels -are loaded for devices specified in this list.

      -
    • -
    • -

      kernel_name is the name of the defined built-in kernel listed in Appendix I.

      -
    • -
    • -

      kernel_attributes is a pointer to the structure declared in -description of the kernel in Appendix I. The structure holds -kernel’s attributes.

      -
    • -
    • -

      cl_dbk_mode_properties is a pointer to a list of defined built-in -kernel mode properties. The supported mode properties are listed in -DBK’s entry with default settings in Appendix I. It is valid to set -this argument to NULL in which case default properties apply (if -any).

      -
    • -
    -
    -
    -

    clCreateDefinedBuiltInKernelDescriptor returns a valid kernel -descriptor on success indicated by errcode_ret which is set to -CL_SUCCESS. Otherwise, the returned object is NULL and the -errcode_ret is set to one of following code:

    -
    -
    -
      -
    • -

      CL_DBK_INVALID_ATTRIBUTE if one or more kernel attributes violates -conditions descried in defined built-in kernel entry in Appendix I.

      -
    • -
    • -

      CL_DBK_UNAVAILABLE if kernel attributes are valid but the -kernel is not supported on one of the devices.

      -
    • -
    • -

      CL_DBK_UNSUPPORTED_MODE_PROPERTY if cl_dbk_mode_properties includes -at least one property not listed in DBK’s entry.

      -
    • -
    • -

      CL_DBK_UNMET_MAX_RELATIVE_ERROR if the DBK is available but does not -meet the requested constraint set by -CL_DBK_PROPERTY_MAX_RELATIVE_ERROR property.

      -
    • -
    -
    - ----- - - - - - - - - - - - - - - - - - - - -
    DBK Mode PropertyProperty ValueDescription

    CL_DBK_PROPERTY_MAX_RELATIVE_ERROR

    float

    -

    Require that the DBK produces the results which do not deviate more -than the given amount value of ULPs (units in the last place) respect -to infnitely precise result.

    -

    CL_DBK_PROPERTY_NON_DETERMINISTIC

    cl_bool

    -

    Allow results of the kernel to be non-reproducible. This allows -implementation to switch algorithm of the kernel on each launch for -possibly better performance.

    -
    -
    -
    -

    Add new function to 5.8.1 Creating Program Objects

    -
    -

    To create a program with a set of defined built-in kernel use:

    -
    -
    -
    -
    cl_program clCreateProgramWithDefinedKernels(
    -    cl_context context,
    -    size_t num_kernel_desc,
    -    const void* kernel_desc_list,
    -    cl_int* errcode_ret);
    -
    -
    -
    -
      -
    • -

      context must be a valid OpenCL context.

      -
    • -
    • -

      num_kernel_desc is the number of kernel descriptors.

      -
    • -
    • -

      kernel_desc_list is the array of valid -cl_dbk_descriptor objects. The array length must be at -least num_kernel_desc. The kernel descriptors must be created on -the same context.

      -
    • -
    -
    -
    -

    clCreateProgramWithDefinedKernels returns a valid program on success -indicated by errcode_ret which is set to CL_SUCCESS. Otherwise, the -returned object is NULL and the errcode_ret is set to one of -following code:

    -
    -
    -
      -
    • -

      TODO.

      -
    • -
    -
    -
    -
    -

    Add new function to 5.9.1 Creating Kernel Objects

    -
    -

    To get a kernel handle for a defined built-in kernel in a program use:

    -
    -
    -
    -
    cl_kernel clCreateDefinedBuiltInKernel(
    -    cl_program program,
    -    cl_dbk_descriptor kernel_desc,
    -    cl_int* errcode_ret);
    -
    -
    -
    -
      -
    • -

      program is a program object with a successfully built executable.

      -
    • -
    • -

      kernel_desc is a defined built-in kernel descriptor in the program.

      -
    • -
    • -

      errcode_ret will return an appropriate error code. If errcode_ret is -NULL, no error code is returned.

      -
    • -
    -
    -
    -

    clCreateDefinedBuiltInKernel returns a valid non-zero kernel object - and errcode_ret is set to CL_SUCCESS if the kernel object is created - successfully. Otherwise, it returns a NULL value with one of the - following error values returned in errcode_ret:

    -
    -
    -
      -
    • -

      TODO.

      -
    • -
    -
    -
    -
    -

    Add new appendix "Appendix I - Defined Built-in Kernels" to OpenCL API Specification

    -
    -

    This chapter describes standard defined built-in kernels (DBK) with -well-defined semantics. Devices can report -availability of the built-in kernels listed in this section with -clCreateDefinedBuiltInKernelDescriptor call. The availability of a -DBK is determined from the arguments passed to the -clCreateDefinedBuiltInKernelDescriptor and unavailability of a DBK -is indicated by CL_DBK_UNAVAILABLE error code.

    -
    -
    -

    The general client-side abstraction of the DBKs is similar to a call -to a C function of which implementation is hidden. The device driver -are free to implement a DBK by invoking one or more coarse and fine grained hardware accelerators combined with -firmware to implement the semantics as efficiently as possible.

    -
    -
    -

    It is the driver’s responsibility to handle efficient synchronization and communication -to the hardware accelerator, the internal accelerator state management and resource sharing -across multiple OpenCL contexts.

    -
    -
    -

    Reproducibility

    -
    -

    Identical DBKs or same DBKs executed repeatedly with identical inputs are -guaranteed to produce identical results, unless otherwise stated in -the DBK’s description, when:

    -
    -
    -
      -
    • -

      enqueued to the same device,

      -
    • -
    • -

      on the same platform,

      -
    • -
    • -

      on the same vendor with the same driver version and

      -
    • -
    • -

      CL_DBK_PROPERTY_NON_DETERMINISTIC property is not set on.

      -
    • -
    -
    -
    -

    Two DBK descriptors for a device are considered identical if they are created -using identical kernel name, kernel attribute and kernel mode property -arguments. In other cases, identical and inputs may produce different -results. The result difference may occur because, for example, -different algorithms being used across devices.

    -
    -
    -

    DBKs may produce approximated results and the error, respect to -infinitely precise result, can be optionally controlled by -CL_DBK_PROPERTY_MAX_RELATIVE_ERROR when the property name is listed in -the DBK’s description. DBKs without CL_DBK_PROPERTY_MAX_RELATIVE_ERROR -property produces exact result.

    -
    -
    -
    -

    The Defined Built-in Kernels

    -
    -

    The following is list of recognized defined built-in kernels. It is -expected to be expanded and updated over the versions of this extensions, while preserving backwards compatibility.

    -
    -
    -

    Each defined built-in kernel entry is organized as follows:

    -
    -
    -
      -
    • -

      Name: Name of the defined built-in kernel (an enumeration).

      -
    • -
    • -

      Kernel attributes: The kernel attributes required for creating the -defined built-in kernel via -clCreateDefinedBuiltInKernelDescriptor. Attribute values are -immutable.

      -
    • -
    • -

      Kernel arguments: The kernel arguments.

      -
    • -
    • -

      Description: The description of the kernel in detail.

      -
    • -
    • -

      Attribute validation rules: Conditions of the kernel attribute for -the kernel. Implementation must return CL_DBK_INVALID_ATTRIBUTE on -clCreateDefinedBuiltInKernelDescriptor call if any of the conditions -are violated.

      -
    • -
    • -

      Kernel mode properties: List of kernel mode -properties (cl_dbk_mode_properties) the kernel recognizes. The -properties can be used to tweak certain implementation details and -behaviors in the kernel execution. If a property not listed in the -DBK entry is fed to clCreateDefinedBuiltInKernelDescriptor call, -then implementation must return CL_DKB_UNSUPPORTED_MODE_PROPERTY.

      -
    • -
    -
    - - --- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
    Table A.I.1. Standard Built-in Kernels and Their Semantics. The table has been populated with a small set of non-trivial example entries which are subject to change and the list to expand during drafting.

    Name: khr_matmul

    Kernel Attributes

    -

    Fields of the cl_dkb_attributes_matmul structure:

    -
    -
    -
      -
    1. -

      cl_tensor A: Tensor description for input matrix A.

      -
    2. -
    3. -

      cl_tensor B: Tensor description for input matrix B.

      -
    4. -
    5. -

      cl_tensor R: Tensor description for output matrix C.

      -
    6. -
    7. -

      cl_int transposeA: Non-zero transposes A matrix.

      -
    8. -
    9. -

      cl_int transposeB: Non-zero transposes B matrix.

      -
    10. -
    -

    Kernel Arguments

    -
      -
    1. -

      cl_tensor A: Matrix A (read only).

      -
    2. -
    3. -

      cl_tensor B: Matrix B (read only).

      -
    4. -
    5. -

      cl_tensor R: Output matrix. (write only).

      -
    6. -
    -

    Description

    -

    Performs (batched) matrix multiplication: R = trans(A) * trans(B), -where A, B and R are tensors with at least rank two. The -trans() is a configurable transpose operation.

    -
    -
    -

    Last two dimensions of the tensors are treated as operands to the -matric multiplication and rest of the dimensions are treated as batch -dimensions.

    -
    -
    -

    Operations of the matrix muliplication are performed in the precision -of the elementof(R).

    -
    -
    -

    If an overflow occurs in the accumulation of the products, then R -tensor’s result will be undefined.

    -

    Attribute validation rules

    -
      -
    • -

      rankof(A) == rankof(B) >= 2.

      -
    • -
    • -

      Let shapeof(At) == (b…​, m, k) and shapeof(Bt) = (b…​, k, -n) of tensors A and B, respectively, after possible tranposing. -shapeof(R) must be (b…​, m, n).

      -
    • -
    • -

      elementof(A) == elementof(B)

      -
    • -
    • -

      elemkindof(R) == elemkindof(A)

      -
    • -
    • -

      elementof(R) == elementof(A) or elementof(A) is promotable to -elementof(R) without loss of meaning.

      -
    • -
    -

    Kernel mode properties

    -

    This DBK accepts the following properties:

    -
    -
    -
      -
    • -

      CL_DBK_PROPERTY_MAX_RELATIVE_ERROR: Unset property defaults to positive infinity.

      -
    • -
    -

    Name: khr_leaky_relu

    Kernel Attributes

    -

    Fields of the cl_dbk_leaky_relu structure:

    -
    -
    -
      -
    1. -

      cl_tensor in: Input tensor description.

      -
    2. -
    3. -

      cl_tensor out: Output tensor description.

      -
    4. -
    5. -

      cl_float alpha: Coefficient of leakage.

      -
    6. -
    -

    Kernel arguments

    -
      -
    1. -

      cl_tensor in: The input tensor.

      -
    2. -
    3. -

      cl_tensor out: The output tensor.

      -
    4. -
    -

    Description

    -

    Applies operation alpha * x if x < 0 else x on all -elements of the in tensor.

    -
    -
    -

    If target device does not support denormals, then alpha is flushed -to zero before the operation is applied.

    -

    Kernel mode properties

    N/A

    Attribute validation rules

    -
      -
    • -

      shapeof(in) == shapeof(out)

      -
    • -
    • -

      elementof(in) == elementof(out)

      -
    • -
    • -

      alpha must be a finite value.

      -
    • -
    -
    -
    -
    -

    Launching DBKs from the Device Side

    -
    -

    DBKs are primarily meant to be launched as kernel commands via -host-side command queues. Optionally, they can be callable from -device-side via enqueue_kernel:

    -
    -
    -

    TBC. This probably needs device-side function corresponding to -clCreateDefinedBuiltInKernelDescriptor.

    -
    -
    -
    -

    Sample Code

    -
    -
    -
    constexpr size_t b = 64, m = 100, n = 200, k = 50;
    -cl_int err;
    -cl_tensor lhs_tensor = clCreateTensor(context, nullptr, 3, {b, m, k}, CL_TENSOR_FLOAT, err);
    -cl_tensor rhs_tensor = clCreateTensor(context, nullptr, 3, {b, k, n}, CL_TENSOR_FLOAT, err);
    -cl_tensor res_tensor = clCreateTensor(context, nullptr, 3, {b, m, n}, CL_TENSOR_FLOAT, err);
    -
    -cl_dkb_attributes_matmul matmul_attrs = {
    -  lhs_tensor, rhs_tensor, res_tensor, 1, 0 // = Transpose lhs tensor
    -};
    -
    -cl_dbk_mode_properties matmul_props = {
    -  // Request a matmul instance that meets this precision.
    -  CL_DBK_PROPERTY_MAX_RELATIVE_ERROR, 100, // in ULPs.
    -};
    -
    -std::vector<cl_dbk_descriptor> kernel_descriptions;
    -cl_dbk_descriptor matmul_desc =
    -  clCreateDefinedBuiltInKernelDescriptor(
    -  context, num_devices, device_list,
    -  CL_DBK_MATMUL, &matmul_attrs, &matmul_props, &err);
    -
    -} else if (err == CL_DBK_UNAVAILABLE) {
    -  // Kernel attributes are valid but the kernel is not supported in at least
    -  // one of the devices.
    -  ...
    -} else if (err == CL_DBK_UNMET_MAX_RELATIVE_ERROR) {
    -  // E.g. Kernel is supported but is not precise enough.
    -  ...
    -} else if (err == CL_DBK_UNSUPPORTED_MODE_PROPERTY) {
    -  // cl_dbk_mode_properties has a property not listed in the description of the
    -  // defined built-in kernel.
    -  ...
    -} else
    -  kernel_descriptions.push_back(matmul_desc);
    -
    -...
    -
    -cl_program dbk_lib = clCreateProgramWithDefinedBuiltInKernels(
    -  context, kernel_descriptions.size(), kernel_descriptors.data(), err);
    -
    -...
    -
    -cl_kernel matmul_kernel = clCreateDefinedBuiltinKernel(
    -  dkb_lib, matmul_desc, &err);
    -
    -// Set tensor kernel arguments before binding storage to the tensors. This
    -// gives clCreateBufferWithProperties() opportunity to reason about tensors'
    -// uses for determining the optimal memory layout (opaque to application) and
    -// the space needed for the tensors.
    -clSetKernelArg(matmul_kernel, 0, sizeof(cl_tensor_t), &lhs_tensor);
    -clSetKernelArg(matmul_kernel, 1, sizeof(cl_tensor_t), &rhs_tensor);
    -clSetKernelArg(matmul_kernel, 2, sizeof(cl_tensor_t), &res_tensor);
    -
    -// Allocate storage for tensors.
    -cl_mem lhs_mem = clCreateBufferWithProperties(
    -  context, {CL_MEM_BIND_TO_TENSOR, lhs_tensor, 0}, CL_MEM_READ_ONLY, 0, nullptr, &err);
    -cl_mem rhs_mem = clCreateBufferWithProperties(
    -  context, {CL_MEM_BIND_TO_TENSOR, rhs_tensor, 0}, CL_MEM_READ_ONLY, 0, nullptr, &err);
    -cl_mem res_mem = clCreateBufferWithProperties(
    -  context, {CL_MEM_BIND_TO_TENSOR, res_tensor, 0}, CL_MEM_WRITE_ONLY, 0, nullptr, &err);
    -
    -// Transfer data to input tensors, execute DBK, and import results
    -// from the output tensor.
    -
    -std::vector<float> lhs_data = ...;
    -std::vector<float> rhs_data = ...;
    -std::vector<float> res_data(b * m * n);
    -
    -clEnqueueExportToTensor(cmd_q, lhs_tensor, false, {0, 0, 0}, {0, 0, 0}, {b, m, k},
    -  nullptr, nullptr, lhs_data.data(), 0, nullptr, nullptr)
    -clEnqueueExportToTensor(cmd_q, rhs_tensor, false, {0, 0, 0}, {0, 0, 0}, {b, k, n},
    -  nullptr, nullptr, rhs_data.data(), 0, nullptr, nullptr)
    -clEnqueueNDRangeKernel(cmd_q, matmul_kernel, 0, NULL, NULL, NULL, 0, NULL, NULL);
    -clEnqueueImportFromTensor(
    -  cmd_q, res_tensor, false,  {0, 0, 0}, {0, 0, 0}, {b, m, n},
    -  nullptr, nullptr, res_data.data(), 0, nullptr, nullptr);
    -
    -
    -
    -
    -
    -

    Open questions

    -
    -
      -
    1. -

      Should we enable launching DBKs from the device side without requiring device-side enqueue? The main problem is those with NDRange as they are not simple single-WI helper functions.

      -
      -
      -
      -

      UNRESOLVED

      -
      -
      -
      -
    2. -
    3. -

      Should the NDRange be used at all in DBKs? It feels sort of unnatural as typically the NDRange is used to imply SPMD parallelism while the hardware/firmware is free to choose whatever parallelization strategy to implement the function. On the other hand, similar applies to software kernel launches as the NDRange-launched work-items can be executed serially if adhering to barrier semantics.

      -
      -
      -
      -

      UNRESOLVED

      -
      -
      -
      -
    4. -
    5. -

      Different accelerators prefer different channel orders (NHWC vs. NCHW…​) for the processed data. Should the channel order be passed as a DBK argument (like in the example GEMM’s row/column order) or is it better to have different DBK variations for each?

      -
      -
      -
      -

      UNRESOLVED

      -
      -
      -
      -
    6. -
    7. -

      How to denote preference? Some of the DBKs are more efficient on a given device as they map more naturally to the underlying HW accelerator, but the slower variations (for example, with unoptimal channel order in NN accelerators) might be still beneficially accelerated.

      -
      -
      -
      -

      UNRESOLVED

      -
      -
      -
      -
    8. -
    9. -

      Since the defined built-in kernel concept is basically just a C-like API inside another API, should it be made more generic and thus directly usable for SYCL and Vulkan as well?

      -
      -
      -
      -

      UNRESOLVED

      -
      -
      -
      -
    10. -
    11. -

      What other DBK mode properties we should have? Here are some ideas:

      -
      -
        -
      • -

        Perform accumulation with saturation.

        -
      • -
      • -

        Finite math only

        -
      • -
      • -

        Flush denormals to zero.

        -
      • -
      • -

        data layout preferences (NHWC for convolution).

        -
      • -
      -
      -
    12. -
    -
    -
    -
    -
    -

    UNRESOLVED

    -
    -
    -
    -
    -
    -
    -
    - - - \ No newline at end of file diff --git a/extensions/cl_khr_defined_builtin_kernels.asciidoc b/extensions/cl_khr_defined_builtin_kernels.asciidoc new file mode 100644 index 000000000..a44aa8b9e --- /dev/null +++ b/extensions/cl_khr_defined_builtin_kernels.asciidoc @@ -0,0 +1,925 @@ +// Copyright 2018-2022 The Khronos Group. This work is licensed under a +// Creative Commons Attribution 4.0 International License; see +// http://creativecommons.org/licenses/by/4.0/ + +:data-uri: +:icons: font +include::../config/attribs.txt[] +:source-highlighter: coderay +:stem: + += cl_khr_defined_builtin_kernels + +The purpose of this extension is to provide a standardized set of built-in +kernels with well-defined semantics useful for accelerating applications +from various domains. The extension specification is designed to rapidly +expand and "live" via addition of new well-defined built-in kernel +definitions and updating of previously defined ones. + +[float] +== XXX - Not complete yet!!! + + +== Name Strings + +`cl_khr_defined_builtin_kernels` + +== Contact + +TODO + +== Contributors + +Pekka Jääskeläinen, Intel and Tampere University. + +Topi Leppänen, Tampere University. + +Jan Solanti, Tampere University. + +Ben Ashbaugh, Intel. + +Henry Linjamäki, Intel. + + +== Notice + +TODO + +== Status + +Draft spec, NOT APPROVED!! + +== Version +Built On: {docdate} + +Version: 0.3.0 + +== Dependencies + +This extension is written against the OpenCL Specification version 3.0.12. + +This extension requires OpenCL 1.2 or later. + +This extension requires cl_exp_tensor. + +== Overview + +OpenCL 1.2 specifies a built-in kernel as a kernel that is executed on +an OpenCL device or custom device by fixed-function hardware or in firmware. +Applications can query the built-in kernels supported by a device or custom +device. + +Built-in kernels are referred to by a name (a C string) without any +semantics attached to the functionality. The semantics behind the name +is completely device specific, typically documented in vendor-specific +extension specifications. + +The goal for this extension is to lower the bar for utilizing hardware +accelerated functions in drivers by providing a library of +well-defined built-in kernel with good coverage for common acceleration needs +and which is designed to easily evolve over time. + +The device drivers that implement this extension can freely choose which +subset of defined built-in-kernels (DBKs) they implement and advertise to the clients. The +clients can use the DBKs to accelerate their applications by manually +executing invoking the DBKs. The extension is designed to also support using +automated task graph lowering tooling later. + +=== Background + +ASIC-based coarse-grained hardware accelerators are specialized logic meant to +speed up execution of workloads of interest, or to provide improvements in +energy-efficiency. Examples of contemporary workloads that are beneficially hardware +accelerated over software-based implementations include video coding, deep learning, +cryptography, software-defined radio and graphics rendering. + +FPGAs form a special case somewhere between instruction-set architectures and fixed +function hardware accelerators. While advances in high-level synthesis tools +have attempted to bridge the programmability gap between GPU and FPGA programming, +FPGAs are still considered as devices which are challenging to achieve efficient +implementations with. Due to extensive manual optimization work required for efficient +implementations of the accelerated functionality, defining FPGA designs as +a system of "hardware accelerator IPs" is still a widely used "application abstraction". +FPGAs can be thus seen as a platform that can realize and integrate any +hardware accelerator implementable with the programmable fabric. + +The means to utilize hardware accelerators have typically been +vendor-specific and abstracted behind domain-specific libraries. +The overhead with the "bunch of libraries"-approach is seen in the lowest level +of integration: The libraries utilize a low level library (typically +vendor-specific) to interface with the actual hardware, and thus does not +integrate efficiently with other libraries or software-programmable processors +that might be available on the same chip. + +=== Rationale + +OpenCL's built-in kernel abstraction allows pushing both hardware +accelerated and software defined kernels to the same command-queues, +providing a powerful means for asynchronous execution of heterogeneous +task graphs on diverse heterogeneous platforms. The ability to invoke hardware +accelerators while being able to synchronize and optimize data transfers at +the lowest levels of the driver stack can provide significant latency benefits, +especially when combined with the command-buffering mechanism. + +However, the built-in kernel abstraction works well only when it is widely adopted by +vendors, and when multiple vendors implement the same definitions. Otherwise +each vendor specifies and implements their own built-in kernels closely matching their +own hardware accelerator properties, resulting in lack of cross-vendor +portability in the API abstraction presented to the upper layers of +heterogeneous computing software stacks. + +This extension standardizes a set of well-defined built-in kernels the +clients can call from higher level programming stacks built with +different languages and multiple libraries, possibly mix accelerator +calls with calls to software kernel commands, and rely on the driver +stack to optimize the execution (especially the synchronization and +communication) as a low level heterogeneous task graph. The +heterogeneous task graph can be described using multiple +command-queues and optionally cached using the command buffer +extension (cl_khr_command_buffer). It aims to promote the use of +built-in kernels as a programming model for hardware accelerated +functionality, to improve cross-vendor portability of hardware +accelerated computing. + + +== New API Functions + +[source,c] +---- +#define CL_MAX_DBK_PROPERTIES 16 + +clCreateProgramWithDefinedBuiltInKernels( + cl_context context, + cl_uint num_devices, + const cl_device_id* device_list, + cl_uint num_kernels, + const char** kernel_names, + const cl_dbk_id_khr* kernel_ids, + const void** kernel_attributes, + cl_int* device_support_ret, + cl_int* errcode_ret); +---- + +== New API Types + +[source,c] +---- +typedef cl_uint cl_dbk_id_khr; +typedef cl_properties cl_dbk_properties_khr; + +typedef union { + cl_char sc; + cl_uchar uc; + cl_short ss; + cl_ushort us; + cl_int si; + cl_uint ui; + cl_long sl; + cl_ulong ul; + cl_half fh; + cl_float ff; + cl_double fd; + void* raw; +} cl_tensor_datatype_union_khr; + +typedef struct cl_dbk_attributes_matmul_khr { + cl_tensor_desc a; + cl_tensor_desc b; + cl_tensor_desc c; + cl_int trans_a; + cl_int trans_b; + cl_dbk_properties_khr kernel_props[CL_MAX_DBK_PROPERTIES]; +} cl_dbk_attributes_matmul_khr; + +typedef struct cl_dbk_attributes_gemm_khr { + cl_tensor_desc a; + cl_tensor_desc b; + cl_tensor_desc c_in; + cl_tensor_desc c_out; + cl_bool trans_a; + cl_bool trans_b; + cl_tensor_datatype_union_khr alpha; + cl_tensor_datatype_union_khr beta; + cl_dbk_properties_khr kernel_props[CL_MAX_DBK_PROPERTIES]; +} cl_dbk_attributes_gemm_khr; + +typedef struct cl_dbk_attributes_leaky_relu_khr { + cl_tensor_datatype_union_khr coefficient; + cl_dbk_properties_khr kernel_props[CL_MAX_DBK_PROPERTIES]; +} cl_dbk_attributes_leaky_relu_khr; +---- + +== New API Enums + + +Accepted values to *cl_dbk_id_khr*: +[source,c] +---- +CL_DBK_MATMUL_KHR 0x???? +CL_DBK_GEMM_KHR 0x???? +CL_DBK_LEAKY_RELU_KHR 0x???? +---- + +accepted values to *cl_dbk_properties_khr*: + +[source,c] +---- +CL_DBK_PROPERTY_MAX_RELATIVE_ERROR_KHR 0x???? +CL_DBK_PROPERTY_NON_DETERMINISTIC_KHR 0x???? +---- + +New error codes: + +[source,c] +---- +CL_DBK_UNSUPPORTED_KHR 0x???? +CL_DBK_UNSUPPORTED_PROPERTY_KHR 0x???? +CL_DBK_INVALID_ATTRIBUTE_KHR 0x???? +CL_DBK_UNMET_MAX_RELATIVE_ERROR_KHR 0x???? +---- + +== Modifications to the OpenCL Specification + +(Add the following to section 5.8.1, *Creating Program Objects*) :: ++ +-- + +To create a program object for a context and to load the information +related to the defined built-in kernels into that object, call the +function: + +[source,c] +---- +clCreateProgramWithDefinedBuiltInKernels( + cl_context context, + cl_uint num_devices, + const cl_device_id* device_list, + cl_uint num_kernels, + const cl_dbk_id* kernel_ids, + const char** kernel_names, + const void** kernel_attributes, + cl_int* device_errcode_ret, + cl_int* errcode_ret); +---- + +* _context_ must be a valid OpenCL context. + +* _num_devices_ is the number of elements in _device_list_ and + _device_errcode_ret_ lists. + +* _device_list_ is a pointer to a list of devices that are in + _context_. _device_list_ must be a non-NULL value. The defined built-in + kernels are loaded for devices specified in this list. + +* _num_kernels_ is the number of elements in _kernel_ids_, + _kernel_attributes_, _kernel_names_ret_ and _device_errcode_ret_ lists. + +* _kernel_ids_ is the list of defined built-in kernels to + be loaded into the program. + +* _kernel_names_ is a list of names given for each kernel listed in + _kernel_ids_. Each string in the list must be non-NULL and unique. + +* _kernel_attributes_ is a list of pointers that point to the + respective attribute structure of each defined built-in kernel in + the _kernel_ids_ list. The respective attribute structures for each + kernel identifiers are listed in <>. + +* _device_errcode_ret_ will return an appropriate error code per + device. if _device_errcode_ret_ is NULL, no error code is returned. + +* _errcode_ret_ will return an appropriate error code. If + _errcode_ret_ is NULL, no error code is returned. + +The devices associated with the program object will be the list of +devices specified by _device_list_ or subset of it. The list of +devices specified by _device_list_ must be devices associated with +_context_. + +*clCreateProgramWithDefinedBuiltInKernels* returns a valid non-zero +program object and _errcode_ret_ is set to *CL_SUCCESS* if the program +object is created successfully. The returned program is created for +the devices that supports the requested built-in kernels indicated by +*CL_SUCCESS* in the _device_errcode_ret_ list. In case of a failure to +create program for a device, one of the following errors code is set +in _device_errcode_ret_ list for the respective device: + +* *CL_DBK_UNSUPPORTED_KHR* if the device does not support one of the + built-in kernels listed in _kernel_ids_. + +* *CL_INVALID_PROPERTY* if a property list for a defined built-in + kernel description is invalid. + +* *CL_DBK_UNMET_MAX_RELATIVE_ERROR_KHR* if a defined built-in kernel + does not meet the requested precision. + +* *CL_OUT_OF_RESOURCES* if there is a failure to allocate resources + required by the OpenCL implementation on the device. + +// TODO: if _device_errcode_ret_ is NULL should should an error be +// returned in _errcode_ret_ if a kernel is not supported in any +// device? + +If a program object is not created, +*clCreateProgramWithDefinedBuiltInKernels* returns a NULL value with +one of the following error codes returned in _errcode_ret_: + +* *CL_INVALID_CONTEXT* if _context_ is not a valid context. + +* *CL_INVALID_VALUE* if _device_list_ is NULL or _num_devices_ is zero. + +* *CL_INVALID_VALUE* if a kernel name is not unique within _kernel_names_. + +* *CL_INVALID_VALUE* if there is a NULL value in _kernel_names_. + +* *CL_INVALID_DBK_ID_KHR* if any value in the _kernel_ids_ is not a known + identifier for a built-in kernel. + +* *CL_INVALID_DBK_ATTRIBUTE_KHR* if a kernel attribute structure is + invalid for a built-in kernel. + +* *CL_DBK_UNSUPPORTED_KHR* if _device_errcode_ret_ is NULL and any + device in _device_list_ does not support a defined built-in kernel. + +* *CL_DBK_UNSUPPORTED_KHR* if _device_errcode_ret_ is non-NULL and all + devices in _device_list_ does not support a defined built-in kernel. + +* *CL_DBK_UNSUPPORTED_PROPERTY_KHR* If a kernel does not accept a + valid kernel property. + +* *CL_INVALID_DEVICE* if any device in _device_list_ is not in the list of + devices associated with _context_. + +* *CL_OUT_OF_RESOURCES* if there is a failure to allocate resources + required by the OpenCL implementation on the device. + +* *CL_OUT_OF_HOST_MEMORY* if there is a failure to allocate resources + required by the OpenCL implementation on the host. + +-- +// End (Add the following to section 5.8.1, *Creating Program Objects*) + +(Modify section 5.10, *Executing Kernels*) :: ++ +-- + +(Add following to *clEnqueueNDRangeKernel*) :: ++ +-- +For defined built-in kernels _work_dim_, _global_work_offset_, +_global_work_size_ and _local_work_size_ parameters are meaningless +and must be set to zero and NULL, respectively. OpenCL implementations +decide how they distribute the workloads of the defined built-in +kernels. +-- + +(Add the following to the list of error codes returned by *clEnqueueNDRangeKernel*) :: ++ +-- + +* *CL_INVALID_GLOBAL_WORK_SIZE* if the _kernel_ is a defined built-in + kernel and _global_work_size_ is not NULL. + +* *CL_INVALID_GLOBAL_WORK_OFFSET* if the _kernel_ is a defined built-in + kernel and _global_work_offset_ is not NULL. + +* *CL_INVALID_LOCAL_WORK_SIZE* if the _kernel_ is a defined built-in + kernel and _local_work_size_ is not NULL. +-- +-- +// End (Modify section 5.10, *Executing Kernels*) + + +[[appendix-dbk]] +=== Add new appendix "Defined Built-in Kernels" to OpenCL API Specification + +This chapter describes standard defined built-in kernels (DBK) with +well-defined semantics. They are loaded into a program using +*clCreateProgramWithDefinedBuiltinKernels* and the kernels in it are +launched using *clEnqueueNDRangeKernel* with _work_dim_ set to zero +and _global_work_offset_, _global_work_size_ and _local_work_size_ set +to NULL. + +The general client-side abstraction of the DBKs is similar to a call +to a C function of which implementation is hidden. The device driver +are free to implement a DBK by invoking one or more coarse and fine-grained hardware accelerators combined with +firmware to implement the semantics as efficiently as possible. + +It is the driver's responsibility to handle efficient synchronization and communication +to the hardware accelerator, the internal accelerator state management and resource sharing +across multiple OpenCL contexts. + +==== Reproducibility + +Identical DBKs or the same DBKs executed repeatedly with identical inputs are +guaranteed to produce identical results, unless otherwise stated in +the DBK's description, when: + +* enqueued to the same device, + +* on the same platform, + +* on the same vendor with the same driver version and + +* CL_DBK_PROPERTY_NON_DETERMINISTIC_KHR property is not set on. + +In other cases, the DBKs may produce different results. Two DBKs for a +device are considered identical if they are created using identical +kernel identifier, kernel attributes and kernel properties. The result +difference may occur because of different algorithms being used across +devices, for example. + +DBKs may produce approximated results and the error, respect to +infinitely precise result, can be optionally controlled by +CL_DBK_PROPERTY_MAX_RELATIVE_ERROR_KHR when the property name is listed in +the DBK's description. When the precision is not controlled by the +application using CL_DBK_PROPERTY_MAX_RELATIVE_ERROR_KHR, the OpenCL +precision of results are + +* chosen by the implementation for floating-point based tasks. + +* exact for integer based tasks. + +==== Kernel Interface + +DBKs operates on tensor objects, created with +*clCreateBufferWithProperties* using `CL_MEM_TENSOR` property, +generally in single-static assignment fashion. the Kernel arguments +used for reading and writing tensors may not reference the same tensor +object unless otherwise stated in the <>. + +==== The Defined Built-in Kernels + +The list of recognized defined built-in kernels are listed in the +following <>. It is expected to be +expanded and updated over the versions of this extensions, while +preserving backwards compatibility. + +Each defined built-in kernel entry is organized as follows: + +* *Name*: Name of the defined built-in kernel (an enumeration). + +* *Kernel attributes*: The kernel attributes required for creating the + defined built-in kernel via + *clCreateProgramWithDefinedBuiltinKernels*. Attribute values are + immutable. + +* *Kernel arguments*: The kernel arguments. + +* *Description*: The description of the kernel in detail. + +* *Attribute validation rules*: Conditions of the kernel attribute for + the kernel. Implementation must return CL_DBK_INVALID_ATTRIBUTE_KHR on + *clCreateProgramWithDefinedBuiltinKernels* call if any of the conditions + are violated. + +* *Kernel mode properties*: List of <> + (`cl_dbk_properties_khr`) the kernel may accept. The properties can + be used to tweak certain implementation details and behaviors in + the kernel execution. If a property not listed in the DBK + description is fed to *clCreateProgramWithDefinedBuiltinKernels* + call, then implementation must return + `CL_DBK_UNSUPPORTED_PROPERTY_KHR`. + +[[dbk-propery-table]] +.Table of defined built-in kernel properties +[cols="2,1,2",stripes=odd] +|=== +| *DBK Mode Property* | *Property Value* | *Description* + +| CL_DBK_PROPERTY_MAX_RELATIVE_ERROR_KHR | float + +a| Require that the DBK produces the results which do not deviate more +than the given amount value of ULPs (units in the last place) respect +to infnitely precise result. + +| CL_DBK_PROPERTY_NON_DETERMINISTIC_KHR | cl_bool + +a| Allow results of the kernel to be non-reproducible. This allows +implementation to switch algorithm of the kernel on each launch for +possibly better performance. +// Idea from https://pytorch.org/docs/stable/notes/randomness.html#cuda-convolution-benchmarking + +|=== + + +[[dbk-description-table]] +.Standard Built-in Kernels and Their Semantics. *The table has been populated with a small set of non-trivial example entries which are subject to change and the list to expand during drafting.* +|=== +| Name: *CL_DBK_GEMM_KHR* +| *Kernel Attributes* +a| + +[source,c] +---- +typedef struct cl_dbk_attributes_gemm_khr { + cl_tensor_desc a; + cl_tensor_desc b; + cl_tensor_desc c_in; + cl_tensor_desc c_out; + cl_bool trans_a; + cl_bool trans_b; + cl_tensor_datatype_union_khr alpha; + cl_tensor_datatype_union_khr beta; + cl_dbk_properties kernel_props[CL_MAX_DBK_PROPERTIES]; +} cl_dbk_attributes_gemm_khr; +---- + +* _a_ is a tensor description for input matrix A. + +* _b_ is a tensor description for input matrix B. + +* _c_in_ is a tensor description for output matrix CIN. + +* _c_out_ is a tensor description for output matrix COUT. + +* _trans_a_ instruct to transpose the A matrix if the value is CL_TRUE. + +* _trans_b_ instruct to transpose the B matrix if the value is CL_TRUE. + +* _alpha_ is a value or pointer to value corresponponding to the + element type of _c_out_. + +* _beta_ is a value or pointer to value corresponponding to the + element type of _c_out_. + +* _kernel_props_ defined additional kernel properties. + +| *Kernel Arguments* +a| +. cl_mem: a tensor object for matrix A (read only). +. cl_mem: a tensor object for matrix B (read only). +. cl_mem: a tensor object for matrix C_IN (read only). +. cl_mem: a tensor object for matrix C_OUT (write only). + +| *Description* a| Performs (batched) general matrix multiplication: + +[stem] +++++ +bb"COUT"_(b,m,n) = "beta" * bb"CIN"_(b,m,n) + "alpha" * sum_(k)trans(bb"A", "trans_a")_(b,m,k)trans(bb"B", "trans_b") _(b,k,n) +++++ + +Where: + +[stem] +++++ +trans(X_(b,i,j), tr) = {(X_(b,j,i), "if tr" = "CL_TRUE"), (X_(b,i,j), "otherwise") :} +++++ + +Second degree tensors of shape `(a, b)` are treated as third degree +tensors of shape `(1, a, b)`. + +Operations of the matrix muliplication are performed in the precision +of the `elementof\(COUT)`. + +If an overflow occurs in the accumulation of the products, then `R` +tensor's result will be undefined. + +`CIN` and `COUT` tensors may be the same object. + +| *Attribute validation rules* +a| + +* `rankof(A) == rankof(B) == rankof(CIN) == rankof(COUT)`. +* Let `shapeof(A~t~) == (b..., m, k)` and `shapeof(B~t~) = (b..., k, + n)` of tensors `A` and `B`, respectively, after possible tranposing. + `shapeof\(COUT)` must be `(b..., m, n)`. +* `shapeof(CIN) == shapeof(COUT)`. +* `elementof(A) == elementof(B)`. +* `elemkindof\(COUT) == elemkindof(A)`. +* `elementof\(COUT) == elementof(A)` or `elementof(A)` is promotable to + `elementof\(COUT)` without a loss of meaning. +// E.g. cl_int -> cl_uint: loses meaning of negative values. +| *Kernel mode properties* +a| +This DBK accepts the following kernel properties: + +* CL_DBK_PROPERTY_MAX_RELATIVE_ERROR_KHR +* CL_DBK_PROPERTY_NON_DETERMINISTIC_KHR +| + +| Name: *CL_DBK_MATMUL_KHR* +| *Kernel Attributes* +a| + +[source,c] +---- +typedef struct cl_dbk_attributes_matmul_khr { + cl_tensor_desc a; + cl_tensor_desc b; + cl_tensor_desc c; + cl_bool trans_a; + cl_bool trans_b; + cl_dbk_properties kernel_props[CL_MAX_DBK_PROPERTIES]; +} cl_dbk_attributes_matmul_khr; +---- + +* _a_ is a tensor description for input matrix A. + +* _b_ is a tensor description for input matrix B. + +* _c_ is a tensor description for output matrix C. + +* _trans_a_ instruct to transpose the A matrix if the value is CL_TRUE. + +* _trans_b_ instruct to transpose the B matrix if the value is CL_TRUE. + +* _kernel_props_ defined additional kernel properties. + +| *Kernel Arguments* +a| +. cl_mem: a tensor object for matrix A (read only). +. cl_mem: a tensor object for matrix B (read only). +. cl_mem: a tensor object for matrix C (write only). + +| *Description* a| Performs (batched) matrix multiplication: + +[stem] +++++ +bb"C"_(b,m,n) = sum_(k)trans(bb"A", "trans_a")_(b,m,k)trans(bb"B", "trans_b") _(b,k,n) +++++ + +Where: + +[stem] +++++ +trans(X_(b,i,j), tr) = {(X_(b,j,i), "if tr" = "CL_TRUE"), (X_(b,i,j), "otherwise") :} +++++ + +Second degree tensors of shape `(a, b)` are treated as third degree +tensors of shape `(1, a, b)`. + +Operations of the matrix muliplication are performed in the precision +of the `elementof\(COUT)`. + +If an overflow occurs in the accumulation of the products, then `R` +tensor's result will be undefined. + +| *Attribute validation rules* +a| + +* `rankof(A) == rankof(B) == rankof\(C)`. +* Let `shapeof(A~t~) == (b..., m, k)` and `shapeof(B~t~) = (b..., k, + n)` of tensors `A` and `B`, respectively, after possible tranposing. + `shapeof\(C)` must be `(b..., m, n)`. +* `elementof(A) == elementof(B)`. +* `elemkindof\(C) == elemkindof(A)`. +* `elementof\(C) == elementof(A)` or `elementof(A)` is promotable to + `elementof\(C)` without a loss of meaning. +// E.g. cl_int -> cl_uint: loses meaning of negative values. +| *Kernel mode properties* +a| +This DBK accepts the following kernel properties: + +* CL_DBK_PROPERTY_MAX_RELATIVE_ERROR_KHR +| + +| Name: *khr_leaky_relu* +| *Kernel Attributes* +a| + +[source,c] +---- +typedef struct cl_dbk_attributes_leaky_relu_khr { + cl_tensor_datatype_union_khr coefficient; + cl_dbk_properties kernel_props[CL_MAX_DBK_PROPERTIES]; +} cl_dbk_attributes_leaky_relu_khr; +---- +* _alpha_ is a coefficient of leakage, a positive value. +| *Kernel arguments* +a| +. cl_mem: a tensor object IN for input values. +. cl_mem: a tensor object OUT for output value. +| *Description* +a| + +This element-wise built-in kernel performs a leaky ReLU operation as followed: + +[stem] +++++ +"OUT"_(i) = {( -"alpha" * "IN"_(i), "if IN"_(i) \lt 0), ("IN"_(i), " otherwise") :} +++++ + +If target device does not support denormals, then the `alpha` value is +flushed to zero before the operation is applied. This DBK accepts +tensors of arbitrary rank. + +The `IN` and `OUT` tensors may be the same object. + +| *Kernel mode properties* +| This DBK accepts the following kernel properties: + +* CL_DBK_PROPERTY_MAX_RELATIVE_ERROR_KHR +* CL_DBK_PROPERTY_NON_DETERMINISTIC_KHR + +| *Attribute validation rules* +a| +* `shapeof(in) == shapeof(out)`. +* `elementof(in) == elementof(out)`. +* `coefficient` must be a positive, finite value. +|=== + +==== Launching DBKs from the Device Side + +DBKs are primarily meant to be launched as kernel commands via +host-side command queues. Optionally, they can be callable from +device-side via `enqueue_kernel`: + +TBC. This probably needs device-side function corresponding to +*clCreateProgramWithDefinedBuiltinKernels*. + +== Sample Code + +[source,c] +---- +constexpr size_t b = 64, m = 100, n = 200, k = 50; +cl_int err; + +std::vector lhs_data = ...; +std::vector rhs_data = ...; +std::vector bias_data = ...; +std::vector out_data(b * m * n); + +cl_tensor_layout_blas_exp row_major; +row_major.leading_dims[0] = 2, +row_major.leading_dims[1] = 1, + +cl_tensor_desc_exp lhs_desc; +lhs_desc.rank = 3; +lhs_desc.dtype = CL_TENSOR_FP32_EXP; +lhs_desc.properties[0] = 0; +lhs_desc.shape[0] = b; +lhs_desc.shape[1] = m; +lhs_desc.shape[2] = k; +lhs_desc.layout_type = CL_TENSOR_LAYOUT_BLAS_EXP; +lhs_desc.layout = &row_major; + +cl_tensor_desc_exp rhs_desc; +rhs_desc.rank = 3; +rhs_desc.dtype = CL_TENSOR_FP32_EXP; +rhs_desc.properties[0] = 0; +rhs_desc.shape[0] = b; +rhs_desc.shape[1] = k; +rhs_desc.shape[2] = n; +rhs_desc.layout_type = CL_TENSOR_LAYOUT_BLAS_EXP; +rhs_desc.layout = &row_major; + +cl_tensor_desc_exp out_desc; +out_desc.rank = 3; +out_desc.dtype = CL_TENSOR_FP32_EXP; +out_desc.properties[0] = 0; +out_desc.shape[0] = b; +out_desc.shape[1] = m; +out_desc.shape[2] = n; +out_desc.layout_type = CL_TENSOR_LAYOUT_BLAS_EXP; +out_desc.layout = &row_major; + +cl_mem lhs_tensor = clCreateBufferWithProperties( + ctx, {CL_MEM_TENSOR_EXP, lhs_desc, 0}, + CL_MEM_USE_HOST_PTR | CL_MEM_READ_ONLY, 0, lhs_data.data(), &err); +cl_mem rhs_tensor = clCreateBufferWithProperties( + ctx, {CL_MEM_TENSOR_EXP, rhs_desc, 0}, + CL_MEM_USE_HOST_PTR | CL_MEM_READ_ONLY, 0, rhs_data.data(), &err); +cl_mem bias_tensor = clCreateBufferWithProperties( + ctx, {CL_MEM_TENSOR_EXP, out_desc, 0}, + CL_MEM_USE_HOST_PTR | CL_MEM_READ_ONLY, 0, rhs_data.data(), &err); +cl_mem out_tensor = clCreateBufferWithProperties( + ctx, {CL_MEM_TENSOR_EXP, out_desc, 0}, + CL_MEM_USE_HOST_PTR | CL_MEM_READ_WRITE, 0, out_data.data(), &err); + +cl_tensor_datatype_union_khr alpha, beta, relu_coeff; +alpha.sf = 2.0f; +beta.sf = -1.0f; +relu_coeff.sf = 0.01f; + +cl_dkb_attributes_gemm_khr gemm_attrs = { + lhs_desc, rhs_desc, out_desc, out_desc, 0, 0, alpha, beta, {} +}; +gemm_attrs.kernel_props[0] = CL_DBK_PROPERTY_MAX_RELATIVE_ERROR_KHR; +gemm_attrs.kernel_props[1] = 100; // in ILPs +gemm_attrs.kernel_props[2] = 0; + +cl_dkb_attributes_leaky_relu_khr relu_attrs = { + out_desc, out_desc, relu_coeffs, {0} +}; + +cl_device_id target_devices[2] = {dev1, dev2}; +cl_int device_errcodes[2]; +auto prog = clCreateProgramWithDefinedBuiltInKernels( + ctx, 2, target_devices, 2, + {CL_DBK_GEMM_KHR, CL_DBK_LEAKY_RELU_KHR}, {"my_gemm", "my_relu"}, + {&gemm_attrs, &relu_attrs}, &device_errcodes, &err); + +std::vector supported_devs; +for (unsigned i = 0; i < 2; i++) { + if (device_errcodes[i] == CL_SUCCESS) { + supported_devs.push_back(target_devices[i]); + } else { + // Handle errors. Possible error cases (non-exhaustive): + // + // * CL_DBK_UNSUPPORTED_KHR: The DBK is not supported on the device. + // * CL_DBK_UNMET_MAX_RELATIVE_ERROR_KHR The DBK implementation does not + // meet the requested precision. + } +} + +err = clBuildProgram( + prog, supported_devs.size(), supported_devs.data(), "", nullptr, nullptr); + +auto gemm_kernel = clCreateKernel(prog, "my_gemm", &err); +clSetKernelArg(gemm_kernel, 0, sizeof(cl_mem), &lhs_tensor); +clSetKernelArg(gemm_kernel, 1, sizeof(cl_mem), &rhs_tensor); +clSetKernelArg(gemm_kernel, 2, sizeof(cl_mem), &bias_tensor); +clSetKernelArg(gemm_kernel, 3, sizeof(cl_mem), &out_tensor); + +auto relu_kernel = clCreateKernel(prog, "my_relu", &err); +clSetKernelArg(relu_kernel, 0, sizeof(cl_mem), &out_tensor); +clSetKernelArg(relu_kernel, 1, sizeof(cl_mem), &out_tensor); + +cmq_q = /* Create an in-order command queue. */; + +clEnqueueNDRangeKernel( + cmd_q, 0, nullptr, nullptr, nullptr, gemm_kernel, 0, nullptr, nullptr); +clEnqueueNDRangeKernel( + cmd_q, 0, nullptr, nullptr, nullptr, relu_kernel, 0, nullptr, nullptr); +clEnqueueMapBuffer( + cmd_q, out_tensor, CL_TRUE, CL_MAP_READ, 0, b * m * n, 0, nullptr, nullptr); +---- + +=== Open questions + +. Should we enable launching DBKs from the device side without requiring device-side enqueue? The main problem is those with NDRange as they are not simple single-WI helper functions. ++ +-- +*UNRESOLVED* + +-- + +. Should the NDRange be used at all in DBKs? It feels sort of unnatural as typically the NDRange is used to imply SPMD parallelism while the hardware/firmware is free to choose whatever parallelization strategy to implement the function. On the other hand, similar applies to software kernel launches as the NDRange-launched work-items can be executed serially if adhering to barrier semantics. ++ +-- +*UNRESOLVED* + +-- + +. Different accelerators prefer different channel orders (NHWC vs. NCHW...) for the processed data. Should the channel order be passed as a DBK argument (like in the example GEMM's row/column order) or is it better to have different DBK variations for each? ++ +-- +*UNRESOLVED* + +-- + +. How to denote preference? Some of the DBKs are more efficient on a given device as they map more naturally to the underlying HW accelerator, but the slower variations (for example, with unoptimal channel order in NN accelerators) might be still beneficially accelerated. ++ +-- +*UNRESOLVED* + +-- + +. Since the defined built-in kernel concept is basically just a C-like API inside another API, should it be made more generic and thus directly usable for SYCL and Vulkan as well? ++ +-- +*UNRESOLVED* + +-- + +. What other DBK mode properties we should have? Here are some ideas: +** Perform accumulation with saturation. +** Finite math only +** Flush denormals to zero. + ++ +-- +*UNRESOLVED* +-- + +. Should we reuse (and remove "deprecation" status on) clEnqueueTask +for launching DBKs as DBKs don't make use of global offset and size +and local size parameters? ++ +-- +*UNRESOLVED* +-- + +== Version History + +[cols="5,10,15,40",options="header",grid="rows"] +|==== +| *Version* | *Date* | *Author* | *Description* +| 0.1.0 | 2022-12-13 | +Pekka Jääskeläinen + +Ben Ashbaugh a| +First formulation as an extension specification like proposed by Ben Ashbaugh. + +| 0.2.0 | 2023-11-23 | +Henry Linjamäki + +Pekka Jääskeläinen + +Ben Ashbaugh +a| +Add APIs for defined built-in kernel (DBK) creation. Model DBKs on +tensor type. Add sample code. + +| 0.3.0 | 2024-8-20 | +Henry Linjamäki + +Pekka Jääskeläinen + +Freddie Witherden a| +* Rework document structure match to the cl_khr_extension_template. +* Reflect changes of the `cl_exp_tensor` extension here. +* Add "Kernel Interface" section into the DBK Appendix. +* Add GEMM DBK. +* Change DBK creation interface. +|==== diff --git a/extensions/cl_khr_defined_builtin_kernels.html b/extensions/cl_khr_defined_builtin_kernels.html new file mode 100644 index 000000000..e4188cbc7 --- /dev/null +++ b/extensions/cl_khr_defined_builtin_kernels.html @@ -0,0 +1,1888 @@ + + + + + + + +cl_khr_defined_builtin_kernels + + + + + + + +
    +
    +
    +
    +

    The purpose of this extension is to provide a standardized set of built-in +kernels with well-defined semantics useful for accelerating applications +from various domains. The extension specification is designed to rapidly +expand and "live" via addition of new well-defined built-in kernel +definitions and updating of previously defined ones.

    +
    +

    XXX - Not complete yet!!!

    +
    +
    +
    +

    Name Strings

    +
    +
    +

    cl_khr_defined_builtin_kernels

    +
    +
    +
    +
    +

    Contact

    +
    +
    +

    TODO

    +
    +
    +
    +
    +

    Contributors

    +
    +
    +

    Pekka Jääskeläinen, Intel and Tampere University.
    +Topi Leppänen, Tampere University.
    +Jan Solanti, Tampere University.
    +Ben Ashbaugh, Intel.
    +Henry Linjamäki, Intel.

    +
    +
    +
    +
    +

    Notice

    +
    +
    +

    TODO

    +
    +
    +
    +
    +

    Status

    +
    +
    +

    Draft spec, NOT APPROVED!!

    +
    +
    +
    +
    +

    Version

    +
    +
    +

    Built On: 2024-08-20
    +Version: 0.3.0

    +
    +
    +
    +
    +

    Dependencies

    +
    +
    +

    This extension is written against the OpenCL Specification version 3.0.12.

    +
    +
    +

    This extension requires OpenCL 1.2 or later.

    +
    +
    +

    This extension requires cl_exp_tensor.

    +
    +
    +
    +
    +

    Overview

    +
    +
    +

    OpenCL 1.2 specifies a built-in kernel as a kernel that is executed on +an OpenCL device or custom device by fixed-function hardware or in firmware. +Applications can query the built-in kernels supported by a device or custom +device.

    +
    +
    +

    Built-in kernels are referred to by a name (a C string) without any +semantics attached to the functionality. The semantics behind the name +is completely device specific, typically documented in vendor-specific +extension specifications.

    +
    +
    +

    The goal for this extension is to lower the bar for utilizing hardware +accelerated functions in drivers by providing a library of +well-defined built-in kernel with good coverage for common acceleration needs +and which is designed to easily evolve over time.

    +
    +
    +

    The device drivers that implement this extension can freely choose which +subset of defined built-in-kernels (DBKs) they implement and advertise to the clients. The +clients can use the DBKs to accelerate their applications by manually +executing invoking the DBKs. The extension is designed to also support using +automated task graph lowering tooling later.

    +
    +
    +

    Background

    +
    +

    ASIC-based coarse-grained hardware accelerators are specialized logic meant to +speed up execution of workloads of interest, or to provide improvements in +energy-efficiency. Examples of contemporary workloads that are beneficially hardware +accelerated over software-based implementations include video coding, deep learning, +cryptography, software-defined radio and graphics rendering.

    +
    +
    +

    FPGAs form a special case somewhere between instruction-set architectures and fixed +function hardware accelerators. While advances in high-level synthesis tools +have attempted to bridge the programmability gap between GPU and FPGA programming, +FPGAs are still considered as devices which are challenging to achieve efficient +implementations with. Due to extensive manual optimization work required for efficient +implementations of the accelerated functionality, defining FPGA designs as +a system of "hardware accelerator IPs" is still a widely used "application abstraction". +FPGAs can be thus seen as a platform that can realize and integrate any +hardware accelerator implementable with the programmable fabric.

    +
    +
    +

    The means to utilize hardware accelerators have typically been +vendor-specific and abstracted behind domain-specific libraries. +The overhead with the "bunch of libraries"-approach is seen in the lowest level +of integration: The libraries utilize a low level library (typically +vendor-specific) to interface with the actual hardware, and thus does not +integrate efficiently with other libraries or software-programmable processors +that might be available on the same chip.

    +
    +
    +
    +

    Rationale

    +
    +

    OpenCL’s built-in kernel abstraction allows pushing both hardware +accelerated and software defined kernels to the same command-queues, +providing a powerful means for asynchronous execution of heterogeneous +task graphs on diverse heterogeneous platforms. The ability to invoke hardware +accelerators while being able to synchronize and optimize data transfers at +the lowest levels of the driver stack can provide significant latency benefits, +especially when combined with the command-buffering mechanism.

    +
    +
    +

    However, the built-in kernel abstraction works well only when it is widely adopted by +vendors, and when multiple vendors implement the same definitions. Otherwise +each vendor specifies and implements their own built-in kernels closely matching their +own hardware accelerator properties, resulting in lack of cross-vendor +portability in the API abstraction presented to the upper layers of +heterogeneous computing software stacks.

    +
    +
    +

    This extension standardizes a set of well-defined built-in kernels the +clients can call from higher level programming stacks built with +different languages and multiple libraries, possibly mix accelerator +calls with calls to software kernel commands, and rely on the driver +stack to optimize the execution (especially the synchronization and +communication) as a low level heterogeneous task graph. The +heterogeneous task graph can be described using multiple +command-queues and optionally cached using the command buffer +extension (cl_khr_command_buffer). It aims to promote the use of +built-in kernels as a programming model for hardware accelerated +functionality, to improve cross-vendor portability of hardware +accelerated computing.

    +
    +
    +
    +
    +
    +

    New API Functions

    +
    +
    +
    +
    #define CL_MAX_DBK_PROPERTIES 16
    +
    +clCreateProgramWithDefinedBuiltInKernels(
    +    cl_context           context,
    +    cl_uint              num_devices,
    +    const cl_device_id*  device_list,
    +    cl_uint              num_kernels,
    +    const char**         kernel_names,
    +    const cl_dbk_id_khr* kernel_ids,
    +    const void**         kernel_attributes,
    +    cl_int*              device_support_ret,
    +    cl_int*              errcode_ret);
    +
    +
    +
    +
    +
    +

    New API Types

    +
    +
    +
    +
    typedef cl_uint       cl_dbk_id_khr;
    +typedef cl_properties cl_dbk_properties_khr;
    +
    +typedef union {
    +    cl_char    sc;
    +    cl_uchar   uc;
    +    cl_short   ss;
    +    cl_ushort  us;
    +    cl_int     si;
    +    cl_uint    ui;
    +    cl_long    sl;
    +    cl_ulong   ul;
    +    cl_half    fh;
    +    cl_float   ff;
    +    cl_double  fd;
    +    void*      raw;
    +} cl_tensor_datatype_union_khr;
    +
    +typedef struct cl_dbk_attributes_matmul_khr {
    +    cl_tensor_desc                a;
    +    cl_tensor_desc                b;
    +    cl_tensor_desc                c;
    +    cl_int                        trans_a;
    +    cl_int                        trans_b;
    +    cl_dbk_properties_khr         kernel_props[CL_MAX_DBK_PROPERTIES];
    +} cl_dbk_attributes_matmul_khr;
    +
    +typedef struct cl_dbk_attributes_gemm_khr {
    +    cl_tensor_desc                a;
    +    cl_tensor_desc                b;
    +    cl_tensor_desc                c_in;
    +    cl_tensor_desc                c_out;
    +    cl_bool                       trans_a;
    +    cl_bool                       trans_b;
    +    cl_tensor_datatype_union_khr  alpha;
    +    cl_tensor_datatype_union_khr  beta;
    +    cl_dbk_properties_khr         kernel_props[CL_MAX_DBK_PROPERTIES];
    +} cl_dbk_attributes_gemm_khr;
    +
    +typedef struct cl_dbk_attributes_leaky_relu_khr {
    +   cl_tensor_datatype_union_khr   coefficient;
    +   cl_dbk_properties_khr          kernel_props[CL_MAX_DBK_PROPERTIES];
    +} cl_dbk_attributes_leaky_relu_khr;
    +
    +
    +
    +
    +
    +

    New API Enums

    +
    +
    +

    Accepted values to cl_dbk_id_khr:

    +
    +
    +
    +
    CL_DBK_MATMUL_KHR      0x????
    +CL_DBK_GEMM_KHR        0x????
    +CL_DBK_LEAKY_RELU_KHR  0x????
    +
    +
    +
    +

    accepted values to cl_dbk_properties_khr:

    +
    +
    +
    +
    CL_DBK_PROPERTY_MAX_RELATIVE_ERROR_KHR  0x????
    +CL_DBK_PROPERTY_NON_DETERMINISTIC_KHR   0x????
    +
    +
    +
    +

    New error codes:

    +
    +
    +
    +
    CL_DBK_UNSUPPORTED_KHR                0x????
    +CL_DBK_UNSUPPORTED_PROPERTY_KHR       0x????
    +CL_DBK_INVALID_ATTRIBUTE_KHR          0x????
    +CL_DBK_UNMET_MAX_RELATIVE_ERROR_KHR   0x????
    +
    +
    +
    +
    +
    +

    Modifications to the OpenCL Specification

    +
    +
    +
    +
    (Add the following to section 5.8.1, Creating Program Objects)
    +
    +
    +
    +
    +

    To create a program object for a context and to load the information +related to the defined built-in kernels into that object, call the +function:

    +
    +
    +
    +
    clCreateProgramWithDefinedBuiltInKernels(
    +    cl_context          context,
    +    cl_uint             num_devices,
    +    const cl_device_id* device_list,
    +    cl_uint             num_kernels,
    +    const cl_dbk_id*    kernel_ids,
    +    const char**        kernel_names,
    +    const void**        kernel_attributes,
    +    cl_int*             device_errcode_ret,
    +    cl_int*             errcode_ret);
    +
    +
    +
    +
      +
    • +

      context must be a valid OpenCL context.

      +
    • +
    • +

      num_devices is the number of elements in device_list and +device_errcode_ret lists.

      +
    • +
    • +

      device_list is a pointer to a list of devices that are in +context. device_list must be a non-NULL value. The defined built-in +kernels are loaded for devices specified in this list.

      +
    • +
    • +

      num_kernels is the number of elements in kernel_ids, +kernel_attributes, kernel_names_ret and device_errcode_ret lists.

      +
    • +
    • +

      kernel_ids is the list of defined built-in kernels to +be loaded into the program.

      +
    • +
    • +

      kernel_names is a list of names given for each kernel listed in +kernel_ids. Each string in the list must be non-NULL and unique.

      +
    • +
    • +

      kernel_attributes is a list of pointers that point to the +respective attribute structure of each defined built-in kernel in +the kernel_ids list. The respective attribute structures for each +kernel identifiers are listed in Appendix TODO.

      +
    • +
    • +

      device_errcode_ret will return an appropriate error code per +device. if device_errcode_ret is NULL, no error code is returned.

      +
    • +
    • +

      errcode_ret will return an appropriate error code. If +errcode_ret is NULL, no error code is returned.

      +
    • +
    +
    +
    +

    The devices associated with the program object will be the list of +devices specified by device_list or subset of it. The list of +devices specified by device_list must be devices associated with +context.

    +
    +
    +

    clCreateProgramWithDefinedBuiltInKernels returns a valid non-zero +program object and errcode_ret is set to CL_SUCCESS if the program +object is created successfully. The returned program is created for +the devices that supports the requested built-in kernels indicated by +CL_SUCCESS in the device_errcode_ret list. In case of a failure to +create program for a device, one of the following errors code is set +in device_errcode_ret list for the respective device:

    +
    +
    +
      +
    • +

      CL_DBK_UNSUPPORTED_KHR if the device does not support one of the +built-in kernels listed in kernel_ids.

      +
    • +
    • +

      CL_INVALID_PROPERTY if a property list for a defined built-in +kernel description is invalid.

      +
    • +
    • +

      CL_DBK_UNMET_MAX_RELATIVE_ERROR_KHR if a defined built-in kernel +does not meet the requested precision.

      +
    • +
    • +

      CL_OUT_OF_RESOURCES if there is a failure to allocate resources +required by the OpenCL implementation on the device.

      +
    • +
    +
    +
    +

    If a program object is not created, +clCreateProgramWithDefinedBuiltInKernels returns a NULL value with +one of the following error codes returned in errcode_ret:

    +
    +
    +
      +
    • +

      CL_INVALID_CONTEXT if context is not a valid context.

      +
    • +
    • +

      CL_INVALID_VALUE if device_list is NULL or num_devices is zero.

      +
    • +
    • +

      CL_INVALID_VALUE if a kernel name is not unique within kernel_names.

      +
    • +
    • +

      CL_INVALID_VALUE if there is a NULL value in kernel_names.

      +
    • +
    • +

      CL_INVALID_DBK_ID_KHR if any value in the kernel_ids is not a known +identifier for a built-in kernel.

      +
    • +
    • +

      CL_INVALID_DBK_ATTRIBUTE_KHR if a kernel attribute structure is +invalid for a built-in kernel.

      +
    • +
    • +

      CL_DBK_UNSUPPORTED_KHR if device_errcode_ret is NULL and any +device in device_list does not support a defined built-in kernel.

      +
    • +
    • +

      CL_DBK_UNSUPPORTED_KHR if device_errcode_ret is non-NULL and all +devices in device_list does not support a defined built-in kernel.

      +
    • +
    • +

      CL_DBK_UNSUPPORTED_PROPERTY_KHR If a kernel does not accept a +valid kernel property.

      +
    • +
    • +

      CL_INVALID_DEVICE if any device in device_list is not in the list of +devices associated with context.

      +
    • +
    • +

      CL_OUT_OF_RESOURCES if there is a failure to allocate resources +required by the OpenCL implementation on the device.

      +
    • +
    • +

      CL_OUT_OF_HOST_MEMORY if there is a failure to allocate resources +required by the OpenCL implementation on the host.

      +
    • +
    +
    +
    +
    +
    +
    (Modify section 5.10, Executing Kernels)
    +
    +
    +
    +
    +
    +
    (Add following to clEnqueueNDRangeKernel)
    +
    +
    +
    +
    +
    +

    For defined built-in kernels work_dim, global_work_offset, +global_work_size and local_work_size parameters are meaningless +and must be set to zero and NULL, respectively. OpenCL implementations +decide how they distribute the workloads of the defined built-in +kernels.

    +
    +
    +
    +
    +
    +
    +
    +
    +
    (Add the following to the list of error codes returned by clEnqueueNDRangeKernel)
    +
    +
    +
    +
    +
    +
      +
    • +

      CL_INVALID_GLOBAL_WORK_SIZE if the kernel is a defined built-in +kernel and global_work_size is not NULL.

      +
    • +
    • +

      CL_INVALID_GLOBAL_WORK_OFFSET if the kernel is a defined built-in +kernel and global_work_offset is not NULL.

      +
    • +
    • +

      CL_INVALID_LOCAL_WORK_SIZE if the kernel is a defined built-in +kernel and local_work_size is not NULL.

      +
    • +
    +
    +
    +
    + +
    +
    +
    +

    Add new appendix "Defined Built-in Kernels" to OpenCL API Specification

    +
    +

    This chapter describes standard defined built-in kernels (DBK) with +well-defined semantics. They are loaded into a program using +clCreateProgramWithDefinedBuiltinKernels and the kernels in it are +launched using clEnqueueNDRangeKernel with work_dim set to zero +and global_work_offset, global_work_size and local_work_size set +to NULL.

    +
    +
    +

    The general client-side abstraction of the DBKs is similar to a call +to a C function of which implementation is hidden. The device driver +are free to implement a DBK by invoking one or more coarse and fine-grained hardware accelerators combined with +firmware to implement the semantics as efficiently as possible.

    +
    +
    +

    It is the driver’s responsibility to handle efficient synchronization and communication +to the hardware accelerator, the internal accelerator state management and resource sharing +across multiple OpenCL contexts.

    +
    +
    +

    Reproducibility

    +
    +

    Identical DBKs or the same DBKs executed repeatedly with identical inputs are +guaranteed to produce identical results, unless otherwise stated in +the DBK’s description, when:

    +
    +
    +
      +
    • +

      enqueued to the same device,

      +
    • +
    • +

      on the same platform,

      +
    • +
    • +

      on the same vendor with the same driver version and

      +
    • +
    • +

      CL_DBK_PROPERTY_NON_DETERMINISTIC_KHR property is not set on.

      +
    • +
    +
    +
    +

    In other cases, the DBKs may produce different results. Two DBKs for a +device are considered identical if they are created using identical +kernel identifier, kernel attributes and kernel properties. The result +difference may occur because of different algorithms being used across +devices, for example.

    +
    +
    +

    DBKs may produce approximated results and the error, respect to +infinitely precise result, can be optionally controlled by +CL_DBK_PROPERTY_MAX_RELATIVE_ERROR_KHR when the property name is listed in +the DBK’s description. When the precision is not controlled by the +application using CL_DBK_PROPERTY_MAX_RELATIVE_ERROR_KHR, the OpenCL +precision of results are

    +
    +
    +
      +
    • +

      chosen by the implementation for floating-point based tasks.

      +
    • +
    • +

      exact for integer based tasks.

      +
    • +
    +
    +
    +
    +

    Kernel Interface

    +
    +

    DBKs operates on tensor objects, created with +clCreateBufferWithProperties using CL_MEM_TENSOR property, +generally in single-static assignment fashion. the Kernel arguments +used for reading and writing tensors may not reference the same tensor +object unless otherwise stated in the DBK descriptions.

    +
    +
    +
    +

    The Defined Built-in Kernels

    +
    +

    The list of recognized defined built-in kernels are listed in the +following table. It is expected to be +expanded and updated over the versions of this extensions, while +preserving backwards compatibility.

    +
    +
    +

    Each defined built-in kernel entry is organized as follows:

    +
    +
    +
      +
    • +

      Name: Name of the defined built-in kernel (an enumeration).

      +
    • +
    • +

      Kernel attributes: The kernel attributes required for creating the +defined built-in kernel via +clCreateProgramWithDefinedBuiltinKernels. Attribute values are +immutable.

      +
    • +
    • +

      Kernel arguments: The kernel arguments.

      +
    • +
    • +

      Description: The description of the kernel in detail.

      +
    • +
    • +

      Attribute validation rules: Conditions of the kernel attribute for +the kernel. Implementation must return CL_DBK_INVALID_ATTRIBUTE_KHR on +clCreateProgramWithDefinedBuiltinKernels call if any of the conditions +are violated.

      +
    • +
    • +

      Kernel mode properties: List of kernel properties +(cl_dbk_properties_khr) the kernel may accept. The properties can +be used to tweak certain implementation details and behaviors in +the kernel execution. If a property not listed in the DBK +description is fed to clCreateProgramWithDefinedBuiltinKernels +call, then implementation must return +CL_DBK_UNSUPPORTED_PROPERTY_KHR.

      +
    • +
    +
    + + +++++ + + + + + + + + + + + + + + + + + + + +
    Table 1. Table of defined built-in kernel properties
    DBK Mode PropertyProperty ValueDescription

    CL_DBK_PROPERTY_MAX_RELATIVE_ERROR_KHR

    float

    +

    Require that the DBK produces the results which do not deviate more +than the given amount value of ULPs (units in the last place) respect +to infnitely precise result.

    +

    CL_DBK_PROPERTY_NON_DETERMINISTIC_KHR

    cl_bool

    +

    Allow results of the kernel to be non-reproducible. This allows +implementation to switch algorithm of the kernel on each launch for +possibly better performance.

    +
    + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    Table 2. Standard Built-in Kernels and Their Semantics. The table has been populated with a small set of non-trivial example entries which are subject to change and the list to expand during drafting.

    Name: CL_DBK_GEMM_KHR

    Kernel Attributes

    +
    +
    typedef struct cl_dbk_attributes_gemm_khr {
    +    cl_tensor_desc a;
    +    cl_tensor_desc b;
    +    cl_tensor_desc c_in;
    +    cl_tensor_desc c_out;
    +    cl_bool trans_a;
    +    cl_bool trans_b;
    +    cl_tensor_datatype_union_khr alpha;
    +    cl_tensor_datatype_union_khr beta;
    +    cl_dbk_properties kernel_props[CL_MAX_DBK_PROPERTIES];
    +} cl_dbk_attributes_gemm_khr;
    +
    +
    +
    +
      +
    • +

      a is a tensor description for input matrix A.

      +
    • +
    • +

      b is a tensor description for input matrix B.

      +
    • +
    • +

      c_in is a tensor description for output matrix CIN.

      +
    • +
    • +

      c_out is a tensor description for output matrix COUT.

      +
    • +
    • +

      trans_a instruct to transpose the A matrix if the value is CL_TRUE.

      +
    • +
    • +

      trans_b instruct to transpose the B matrix if the value is CL_TRUE.

      +
    • +
    • +

      alpha is a value or pointer to value corresponponding to the +element type of c_out.

      +
    • +
    • +

      beta is a value or pointer to value corresponponding to the +element type of c_out.

      +
    • +
    • +

      kernel_props defined additional kernel properties.

      +
    • +
    +

    Kernel Arguments

    +
      +
    1. +

      cl_mem: a tensor object for matrix A (read only).

      +
    2. +
    3. +

      cl_mem: a tensor object for matrix B (read only).

      +
    4. +
    5. +

      cl_mem: a tensor object for matrix C_IN (read only).

      +
    6. +
    7. +

      cl_mem: a tensor object for matrix C_OUT (write only).

      +
    8. +
    +

    Description

    +

    Performs (batched) general matrix multiplication:

    +
    +
    +
    +`{\mathbf{\text{COUT}}}_{b , m , n} = \text{beta} \cdot {\mathbf{\text{CIN}}}_{b , m , n} + \text{alpha} \cdot \sum_{k} t r a n s {\left ( \mathbf{\text{A}} , \text{trans\_a} \right )}_{b , m , k} t r a n s {\left ( \mathbf{\text{B}} , \text{trans\_b} \right )}_{b , k , n}` +
    +
    +
    +

    Where:

    +
    +
    +
    +`t r a n s ( X_{b , i , j} , t r ) = \left \{ \begin{matrix} X_{b , j , i} & \text{if tr} = \text{CL\_TRUE} \\ X_{b , i , j} & \text{otherwise} \end{matrix} \right .` +
    +
    +
    +

    Second degree tensors of shape (a, b) are treated as third degree +tensors of shape (1, a, b).

    +
    +
    +

    Operations of the matrix muliplication are performed in the precision +of the elementof\(COUT).

    +
    +
    +

    If an overflow occurs in the accumulation of the products, then R +tensor’s result will be undefined.

    +
    +
    +

    CIN and COUT tensors may be the same object.

    +

    Attribute validation rules

    +
      +
    • +

      rankof(A) == rankof(B) == rankof(CIN) == rankof(COUT).

      +
    • +
    • +

      Let shapeof(At) == (b…​, m, k) and shapeof(Bt) = (b…​, k, +n) of tensors A and B, respectively, after possible tranposing. +shapeof\(COUT) must be (b…​, m, n).

      +
    • +
    • +

      shapeof(CIN) == shapeof(COUT).

      +
    • +
    • +

      elementof(A) == elementof(B).

      +
    • +
    • +

      elemkindof\(COUT) == elemkindof(A).

      +
    • +
    • +

      elementof\(COUT) == elementof(A) or elementof(A) is promotable to +elementof\(COUT) without a loss of meaning.

      +
    • +
    +

    Kernel mode properties

    +

    This DBK accepts the following kernel properties:

    +
    +
    +
      +
    • +

      CL_DBK_PROPERTY_MAX_RELATIVE_ERROR_KHR

      +
    • +
    • +

      CL_DBK_PROPERTY_NON_DETERMINISTIC_KHR

      +
    • +
    +

    Name: CL_DBK_MATMUL_KHR

    Kernel Attributes

    +
    +
    typedef struct cl_dbk_attributes_matmul_khr {
    +    cl_tensor_desc a;
    +    cl_tensor_desc b;
    +    cl_tensor_desc c;
    +    cl_bool trans_a;
    +    cl_bool trans_b;
    +    cl_dbk_properties kernel_props[CL_MAX_DBK_PROPERTIES];
    +} cl_dbk_attributes_matmul_khr;
    +
    +
    +
    +
      +
    • +

      a is a tensor description for input matrix A.

      +
    • +
    • +

      b is a tensor description for input matrix B.

      +
    • +
    • +

      c is a tensor description for output matrix C.

      +
    • +
    • +

      trans_a instruct to transpose the A matrix if the value is CL_TRUE.

      +
    • +
    • +

      trans_b instruct to transpose the B matrix if the value is CL_TRUE.

      +
    • +
    • +

      kernel_props defined additional kernel properties.

      +
    • +
    +

    Kernel Arguments

    +
      +
    1. +

      cl_mem: a tensor object for matrix A (read only).

      +
    2. +
    3. +

      cl_mem: a tensor object for matrix B (read only).

      +
    4. +
    5. +

      cl_mem: a tensor object for matrix C (write only).

      +
    6. +
    +

    Description

    +

    Performs (batched) matrix multiplication:

    +
    +
    +
    +`{\mathbf{\text{C}}}_{b , m , n} = \sum_{k} t r a n s {\left ( \mathbf{\text{A}} , \text{trans\_a} \right )}_{b , m , k} t r a n s {\left ( \mathbf{\text{B}} , \text{trans\_b} \right )}_{b , k , n}` +
    +
    +
    +

    Where:

    +
    +
    +
    +`t r a n s ( X_{b , i , j} , t r ) = \left \{ \begin{matrix} X_{b , j , i} & \text{if tr} = \text{CL\_TRUE} \\ X_{b , i , j} & \text{otherwise} \end{matrix} \right .` +
    +
    +
    +

    Second degree tensors of shape (a, b) are treated as third degree +tensors of shape (1, a, b).

    +
    +
    +

    Operations of the matrix muliplication are performed in the precision +of the elementof\(COUT).

    +
    +
    +

    If an overflow occurs in the accumulation of the products, then R +tensor’s result will be undefined.

    +

    Attribute validation rules

    +
      +
    • +

      rankof(A) == rankof(B) == rankof(C).

      +
    • +
    • +

      Let shapeof(At) == (b…​, m, k) and shapeof(Bt) = (b…​, k, +n) of tensors A and B, respectively, after possible tranposing. +shapeof(C) must be (b…​, m, n).

      +
    • +
    • +

      elementof(A) == elementof(B).

      +
    • +
    • +

      elemkindof(C) == elemkindof(A).

      +
    • +
    • +

      elementof(C) == elementof(A) or elementof(A) is promotable to +elementof(C) without a loss of meaning.

      +
    • +
    +

    Kernel mode properties

    +

    This DBK accepts the following kernel properties:

    +
    +
    +
      +
    • +

      CL_DBK_PROPERTY_MAX_RELATIVE_ERROR_KHR

      +
    • +
    +

    Name: khr_leaky_relu

    Kernel Attributes

    +
    +
    typedef struct cl_dbk_attributes_leaky_relu_khr {
    +   cl_tensor_datatype_union_khr coefficient;
    +   cl_dbk_properties kernel_props[CL_MAX_DBK_PROPERTIES];
    +} cl_dbk_attributes_leaky_relu_khr;
    +
    +
    +
    +
      +
    • +

      alpha is a coefficient of leakage, a positive value.

      +
    • +
    +

    Kernel arguments

    +
      +
    1. +

      cl_mem: a tensor object IN for input values.

      +
    2. +
    3. +

      cl_mem: a tensor object OUT for output value.

      +
    4. +
    +

    Description

    +

    This element-wise built-in kernel performs a leaky ReLU operation as followed:

    +
    +
    +
    +`\text{OUT}_{i} = \left \{ \begin{matrix} - \text{alpha} \cdot \text{IN}_{i} & \text{if IN}_{i} \ < 0 \\ \text{IN}_{i} & \text{ otherwise} \end{matrix} \right .` +
    +
    +
    +

    If target device does not support denormals, then the alpha value is +flushed to zero before the operation is applied. This DBK accepts +tensors of arbitrary rank.

    +
    +
    +

    The IN and OUT tensors may be the same object.

    +

    Kernel mode properties

    This DBK accepts the following kernel properties:

    +

    * CL_DBK_PROPERTY_MAX_RELATIVE_ERROR_KHR +* CL_DBK_PROPERTY_NON_DETERMINISTIC_KHR

    Attribute validation rules

    +
      +
    • +

      shapeof(in) == shapeof(out).

      +
    • +
    • +

      elementof(in) == elementof(out).

      +
    • +
    • +

      coefficient must be a positive, finite value.

      +
    • +
    +
    +
    +
    +

    Launching DBKs from the Device Side

    +
    +

    DBKs are primarily meant to be launched as kernel commands via +host-side command queues. Optionally, they can be callable from +device-side via enqueue_kernel:

    +
    +
    +

    TBC. This probably needs device-side function corresponding to +clCreateProgramWithDefinedBuiltinKernels.

    +
    +
    +
    +
    +
    +
    +

    Sample Code

    +
    +
    +
    +
    constexpr size_t b = 64, m = 100, n = 200, k = 50;
    +cl_int err;
    +
    +std::vector<float> lhs_data = ...;
    +std::vector<float> rhs_data = ...;
    +std::vector<float> bias_data = ...;
    +std::vector<float> out_data(b * m * n);
    +
    +cl_tensor_layout_blas_exp row_major;
    +row_major.leading_dims[0] = 2,
    +row_major.leading_dims[1] = 1,
    +
    +cl_tensor_desc_exp lhs_desc;
    +lhs_desc.rank = 3;
    +lhs_desc.dtype = CL_TENSOR_FP32_EXP;
    +lhs_desc.properties[0] = 0;
    +lhs_desc.shape[0] = b;
    +lhs_desc.shape[1] = m;
    +lhs_desc.shape[2] = k;
    +lhs_desc.layout_type = CL_TENSOR_LAYOUT_BLAS_EXP;
    +lhs_desc.layout = &row_major;
    +
    +cl_tensor_desc_exp rhs_desc;
    +rhs_desc.rank = 3;
    +rhs_desc.dtype = CL_TENSOR_FP32_EXP;
    +rhs_desc.properties[0] = 0;
    +rhs_desc.shape[0] = b;
    +rhs_desc.shape[1] = k;
    +rhs_desc.shape[2] = n;
    +rhs_desc.layout_type = CL_TENSOR_LAYOUT_BLAS_EXP;
    +rhs_desc.layout = &row_major;
    +
    +cl_tensor_desc_exp out_desc;
    +out_desc.rank = 3;
    +out_desc.dtype = CL_TENSOR_FP32_EXP;
    +out_desc.properties[0] = 0;
    +out_desc.shape[0] = b;
    +out_desc.shape[1] = m;
    +out_desc.shape[2] = n;
    +out_desc.layout_type = CL_TENSOR_LAYOUT_BLAS_EXP;
    +out_desc.layout = &row_major;
    +
    +cl_mem lhs_tensor = clCreateBufferWithProperties(
    +  ctx, {CL_MEM_TENSOR_EXP, lhs_desc, 0},
    +  CL_MEM_USE_HOST_PTR | CL_MEM_READ_ONLY, 0, lhs_data.data(), &err);
    +cl_mem rhs_tensor = clCreateBufferWithProperties(
    +  ctx, {CL_MEM_TENSOR_EXP, rhs_desc, 0},
    +  CL_MEM_USE_HOST_PTR | CL_MEM_READ_ONLY, 0, rhs_data.data(), &err);
    +cl_mem bias_tensor = clCreateBufferWithProperties(
    +  ctx, {CL_MEM_TENSOR_EXP, out_desc, 0},
    +  CL_MEM_USE_HOST_PTR | CL_MEM_READ_ONLY, 0, rhs_data.data(), &err);
    +cl_mem out_tensor = clCreateBufferWithProperties(
    +  ctx, {CL_MEM_TENSOR_EXP, out_desc, 0},
    +  CL_MEM_USE_HOST_PTR | CL_MEM_READ_WRITE, 0, out_data.data(), &err);
    +
    +cl_tensor_datatype_union_khr alpha, beta, relu_coeff;
    +alpha.sf = 2.0f;
    +beta.sf = -1.0f;
    +relu_coeff.sf = 0.01f;
    +
    +cl_dkb_attributes_gemm_khr gemm_attrs = {
    +  lhs_desc, rhs_desc, out_desc, out_desc, 0, 0, alpha, beta, {}
    +};
    +gemm_attrs.kernel_props[0] = CL_DBK_PROPERTY_MAX_RELATIVE_ERROR_KHR;
    +gemm_attrs.kernel_props[1] = 100; // in ILPs
    +gemm_attrs.kernel_props[2] = 0;
    +
    +cl_dkb_attributes_leaky_relu_khr relu_attrs = {
    +  out_desc, out_desc, relu_coeffs, {0}
    +};
    +
    +cl_device_id target_devices[2] = {dev1, dev2};
    +cl_int device_errcodes[2];
    +auto prog = clCreateProgramWithDefinedBuiltInKernels(
    +  ctx, 2, target_devices, 2,
    +  {CL_DBK_GEMM_KHR, CL_DBK_LEAKY_RELU_KHR}, {"my_gemm", "my_relu"},
    +  {&gemm_attrs, &relu_attrs}, &device_errcodes, &err);
    +
    +std::vector<cl_device_id> supported_devs;
    +for (unsigned i = 0; i < 2; i++) {
    +  if (device_errcodes[i] == CL_SUCCESS) {
    +    supported_devs.push_back(target_devices[i]);
    +  } else {
    +     // Handle errors. Possible error cases (non-exhaustive):
    +     //
    +     // * CL_DBK_UNSUPPORTED_KHR: The DBK is not supported on the device.
    +     // * CL_DBK_UNMET_MAX_RELATIVE_ERROR_KHR The DBK implementation does not
    +     //   meet the requested precision.
    +  }
    +}
    +
    +err = clBuildProgram(
    +  prog, supported_devs.size(), supported_devs.data(), "", nullptr, nullptr);
    +
    +auto gemm_kernel = clCreateKernel(prog, "my_gemm", &err);
    +clSetKernelArg(gemm_kernel, 0, sizeof(cl_mem), &lhs_tensor);
    +clSetKernelArg(gemm_kernel, 1, sizeof(cl_mem), &rhs_tensor);
    +clSetKernelArg(gemm_kernel, 2, sizeof(cl_mem), &bias_tensor);
    +clSetKernelArg(gemm_kernel, 3, sizeof(cl_mem), &out_tensor);
    +
    +auto relu_kernel = clCreateKernel(prog, "my_relu", &err);
    +clSetKernelArg(relu_kernel, 0, sizeof(cl_mem), &out_tensor);
    +clSetKernelArg(relu_kernel, 1, sizeof(cl_mem), &out_tensor);
    +
    +cmq_q = /* Create an in-order command queue. */;
    +
    +clEnqueueNDRangeKernel(
    +  cmd_q, 0, nullptr, nullptr, nullptr, gemm_kernel, 0, nullptr, nullptr);
    +clEnqueueNDRangeKernel(
    +  cmd_q, 0, nullptr, nullptr, nullptr, relu_kernel, 0, nullptr, nullptr);
    +clEnqueueMapBuffer(
    +  cmd_q, out_tensor, CL_TRUE, CL_MAP_READ, 0, b * m * n, 0, nullptr, nullptr);
    +
    +
    +
    +

    Open questions

    +
    +
      +
    1. +

      Should we enable launching DBKs from the device side without requiring device-side enqueue? The main problem is those with NDRange as they are not simple single-WI helper functions.

      +
      +
      +
      +

      UNRESOLVED

      +
      +
      +
      +
    2. +
    3. +

      Should the NDRange be used at all in DBKs? It feels sort of unnatural as typically the NDRange is used to imply SPMD parallelism while the hardware/firmware is free to choose whatever parallelization strategy to implement the function. On the other hand, similar applies to software kernel launches as the NDRange-launched work-items can be executed serially if adhering to barrier semantics.

      +
      +
      +
      +

      UNRESOLVED

      +
      +
      +
      +
    4. +
    5. +

      Different accelerators prefer different channel orders (NHWC vs. NCHW…​) for the processed data. Should the channel order be passed as a DBK argument (like in the example GEMM’s row/column order) or is it better to have different DBK variations for each?

      +
      +
      +
      +

      UNRESOLVED

      +
      +
      +
      +
    6. +
    7. +

      How to denote preference? Some of the DBKs are more efficient on a given device as they map more naturally to the underlying HW accelerator, but the slower variations (for example, with unoptimal channel order in NN accelerators) might be still beneficially accelerated.

      +
      +
      +
      +

      UNRESOLVED

      +
      +
      +
      +
    8. +
    9. +

      Since the defined built-in kernel concept is basically just a C-like API inside another API, should it be made more generic and thus directly usable for SYCL and Vulkan as well?

      +
      +
      +
      +

      UNRESOLVED

      +
      +
      +
      +
    10. +
    11. +

      What other DBK mode properties we should have? Here are some ideas:

      +
      +
        +
      • +

        Perform accumulation with saturation.

        +
      • +
      • +

        Finite math only

        +
      • +
      • +

        Flush denormals to zero.

        +
      • +
      +
      +
      +
      +
      +

      UNRESOLVED

      +
      +
      +
      +
    12. +
    13. +

      Should we reuse (and remove "deprecation" status on) clEnqueueTask +for launching DBKs as DBKs don’t make use of global offset and size +and local size parameters?

      +
      +
      +
      +

      UNRESOLVED

      +
      +
      +
      +
    14. +
    +
    +
    +
    +
    +
    +

    Version History

    +
    + ++++++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    VersionDateAuthorDescription

    0.1.0

    2022-12-13

    Pekka Jääskeläinen
    +Ben Ashbaugh

    +

    First formulation as an extension specification like proposed by Ben Ashbaugh.

    +

    0.2.0

    2023-11-23

    Henry Linjamäki
    +Pekka Jääskeläinen
    +Ben Ashbaugh

    +

    Add APIs for defined built-in kernel (DBK) creation. Model DBKs on +tensor type. Add sample code.

    +

    0.3.0

    2024-8-20

    Henry Linjamäki
    +Pekka Jääskeläinen
    +Freddie Witherden

    +
      +
    • +

      Rework document structure match to the cl_khr_extension_template.

      +
    • +
    • +

      Reflect changes of the cl_exp_tensor extension here.

      +
    • +
    • +

      Add "Kernel Interface" section into the DBK Appendix.

      +
    • +
    • +

      Add GEMM DBK.

      +
    • +
    • +

      Change DBK creation interface.

      +
    • +
    +
    +
    +
    +
    + + + \ No newline at end of file From 9edee145f2f2a97240492ef7b56596dc5366f229 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Henry=20Linjam=C3=A4ki?= Date: Thu, 22 Aug 2024 13:19:37 +0300 Subject: [PATCH 9/9] khr -> exp, resolve open questions, small fixes --- ...> cl_exp_defined_builtin_kernels.asciidoc} | 172 +- .../cl_exp_defined_builtin_kernels.html | 1936 +++++++++++++++++ .../cl_khr_defined_builtin_kernels.html | 1888 ---------------- 3 files changed, 2028 insertions(+), 1968 deletions(-) rename extensions/{cl_khr_defined_builtin_kernels.asciidoc => cl_exp_defined_builtin_kernels.asciidoc} (86%) create mode 100644 extensions/cl_exp_defined_builtin_kernels.html delete mode 100644 extensions/cl_khr_defined_builtin_kernels.html diff --git a/extensions/cl_khr_defined_builtin_kernels.asciidoc b/extensions/cl_exp_defined_builtin_kernels.asciidoc similarity index 86% rename from extensions/cl_khr_defined_builtin_kernels.asciidoc rename to extensions/cl_exp_defined_builtin_kernels.asciidoc index a44aa8b9e..4ce4f52a1 100644 --- a/extensions/cl_khr_defined_builtin_kernels.asciidoc +++ b/extensions/cl_exp_defined_builtin_kernels.asciidoc @@ -8,7 +8,7 @@ include::../config/attribs.txt[] :source-highlighter: coderay :stem: -= cl_khr_defined_builtin_kernels += cl_exp_defined_builtin_kernels The purpose of this extension is to provide a standardized set of built-in kernels with well-defined semantics useful for accelerating applications @@ -22,7 +22,7 @@ definitions and updating of previously defined ones. == Name Strings -`cl_khr_defined_builtin_kernels` +`cl_exp_defined_builtin_kernels` == Contact @@ -148,7 +148,7 @@ clCreateProgramWithDefinedBuiltInKernels( const cl_device_id* device_list, cl_uint num_kernels, const char** kernel_names, - const cl_dbk_id_khr* kernel_ids, + const cl_dbk_id_exp* kernel_ids, const void** kernel_attributes, cl_int* device_support_ret, cl_int* errcode_ret); @@ -158,8 +158,8 @@ clCreateProgramWithDefinedBuiltInKernels( [source,c] ---- -typedef cl_uint cl_dbk_id_khr; -typedef cl_properties cl_dbk_properties_khr; +typedef cl_uint cl_dbk_id_exp; +typedef cl_properties cl_dbk_properties_exp; typedef union { cl_char sc; @@ -174,62 +174,62 @@ typedef union { cl_float ff; cl_double fd; void* raw; -} cl_tensor_datatype_union_khr; +} cl_tensor_datatype_union_exp; -typedef struct cl_dbk_attributes_matmul_khr { +typedef struct cl_dbk_attributes_matmul_exp { cl_tensor_desc a; cl_tensor_desc b; cl_tensor_desc c; cl_int trans_a; cl_int trans_b; - cl_dbk_properties_khr kernel_props[CL_MAX_DBK_PROPERTIES]; -} cl_dbk_attributes_matmul_khr; + cl_dbk_properties_exp kernel_props[CL_MAX_DBK_PROPERTIES]; +} cl_dbk_attributes_matmul_exp; -typedef struct cl_dbk_attributes_gemm_khr { +typedef struct cl_dbk_attributes_gemm_exp { cl_tensor_desc a; cl_tensor_desc b; cl_tensor_desc c_in; cl_tensor_desc c_out; cl_bool trans_a; cl_bool trans_b; - cl_tensor_datatype_union_khr alpha; - cl_tensor_datatype_union_khr beta; - cl_dbk_properties_khr kernel_props[CL_MAX_DBK_PROPERTIES]; -} cl_dbk_attributes_gemm_khr; - -typedef struct cl_dbk_attributes_leaky_relu_khr { - cl_tensor_datatype_union_khr coefficient; - cl_dbk_properties_khr kernel_props[CL_MAX_DBK_PROPERTIES]; -} cl_dbk_attributes_leaky_relu_khr; + cl_tensor_datatype_union_exp alpha; + cl_tensor_datatype_union_exp beta; + cl_dbk_properties_exp kernel_props[CL_MAX_DBK_PROPERTIES]; +} cl_dbk_attributes_gemm_exp; + +typedef struct cl_dbk_attributes_leaky_relu_exp { + cl_tensor_datatype_union_exp coefficient; + cl_dbk_properties_exp kernel_props[CL_MAX_DBK_PROPERTIES]; +} cl_dbk_attributes_leaky_relu_exp; ---- == New API Enums -Accepted values to *cl_dbk_id_khr*: +Accepted values to *cl_dbk_id_exp*: [source,c] ---- -CL_DBK_MATMUL_KHR 0x???? -CL_DBK_GEMM_KHR 0x???? -CL_DBK_LEAKY_RELU_KHR 0x???? +CL_DBK_MATMUL_EXP 0x???? +CL_DBK_GEMM_EXP 0x???? +CL_DBK_LEAKY_RELU_EXP 0x???? ---- -accepted values to *cl_dbk_properties_khr*: +accepted values to *cl_dbk_properties_exp*: [source,c] ---- -CL_DBK_PROPERTY_MAX_RELATIVE_ERROR_KHR 0x???? -CL_DBK_PROPERTY_NON_DETERMINISTIC_KHR 0x???? +CL_DBK_PROPERTY_MAX_RELATIVE_ERROR_EXP 0x???? +CL_DBK_PROPERTY_NON_DETERMINISTIC_EXP 0x???? ---- New error codes: [source,c] ---- -CL_DBK_UNSUPPORTED_KHR 0x???? -CL_DBK_UNSUPPORTED_PROPERTY_KHR 0x???? -CL_DBK_INVALID_ATTRIBUTE_KHR 0x???? -CL_DBK_UNMET_MAX_RELATIVE_ERROR_KHR 0x???? +CL_DBK_UNSUPPORTED_EXP 0x???? +CL_DBK_UNSUPPORTED_PROPERTY_EXP 0x???? +CL_DBK_INVALID_ATTRIBUTE_EXP 0x???? +CL_DBK_UNMET_MAX_RELATIVE_ERROR_EXP 0x???? ---- == Modifications to the OpenCL Specification @@ -298,13 +298,13 @@ the devices that supports the requested built-in kernels indicated by create program for a device, one of the following errors code is set in _device_errcode_ret_ list for the respective device: -* *CL_DBK_UNSUPPORTED_KHR* if the device does not support one of the +* *CL_DBK_UNSUPPORTED_EXP* if the device does not support one of the built-in kernels listed in _kernel_ids_. * *CL_INVALID_PROPERTY* if a property list for a defined built-in kernel description is invalid. -* *CL_DBK_UNMET_MAX_RELATIVE_ERROR_KHR* if a defined built-in kernel +* *CL_DBK_UNMET_MAX_RELATIVE_ERROR_EXP* if a defined built-in kernel does not meet the requested precision. * *CL_OUT_OF_RESOURCES* if there is a failure to allocate resources @@ -326,19 +326,19 @@ one of the following error codes returned in _errcode_ret_: * *CL_INVALID_VALUE* if there is a NULL value in _kernel_names_. -* *CL_INVALID_DBK_ID_KHR* if any value in the _kernel_ids_ is not a known +* *CL_INVALID_DBK_ID_EXP* if any value in the _kernel_ids_ is not a known identifier for a built-in kernel. -* *CL_INVALID_DBK_ATTRIBUTE_KHR* if a kernel attribute structure is +* *CL_INVALID_DBK_ATTRIBUTE_EXP* if a kernel attribute structure is invalid for a built-in kernel. -* *CL_DBK_UNSUPPORTED_KHR* if _device_errcode_ret_ is NULL and any +* *CL_DBK_UNSUPPORTED_EXP* if _device_errcode_ret_ is NULL and any device in _device_list_ does not support a defined built-in kernel. -* *CL_DBK_UNSUPPORTED_KHR* if _device_errcode_ret_ is non-NULL and all +* *CL_DBK_UNSUPPORTED_EXP* if _device_errcode_ret_ is non-NULL and all devices in _device_list_ does not support a defined built-in kernel. -* *CL_DBK_UNSUPPORTED_PROPERTY_KHR* If a kernel does not accept a +* *CL_DBK_UNSUPPORTED_PROPERTY_EXP* If a kernel does not accept a valid kernel property. * *CL_INVALID_DEVICE* if any device in _device_list_ is not in the list of @@ -415,7 +415,7 @@ the DBK's description, when: * on the same vendor with the same driver version and -* CL_DBK_PROPERTY_NON_DETERMINISTIC_KHR property is not set on. +* CL_DBK_PROPERTY_NON_DETERMINISTIC_EXP property is not set on. In other cases, the DBKs may produce different results. Two DBKs for a device are considered identical if they are created using identical @@ -425,9 +425,9 @@ devices, for example. DBKs may produce approximated results and the error, respect to infinitely precise result, can be optionally controlled by -CL_DBK_PROPERTY_MAX_RELATIVE_ERROR_KHR when the property name is listed in +CL_DBK_PROPERTY_MAX_RELATIVE_ERROR_EXP when the property name is listed in the DBK's description. When the precision is not controlled by the -application using CL_DBK_PROPERTY_MAX_RELATIVE_ERROR_KHR, the OpenCL +application using CL_DBK_PROPERTY_MAX_RELATIVE_ERROR_EXP, the OpenCL precision of results are * chosen by the implementation for floating-point based tasks. @@ -463,17 +463,17 @@ Each defined built-in kernel entry is organized as follows: * *Description*: The description of the kernel in detail. * *Attribute validation rules*: Conditions of the kernel attribute for - the kernel. Implementation must return CL_DBK_INVALID_ATTRIBUTE_KHR on + the kernel. Implementation must return CL_DBK_INVALID_ATTRIBUTE_EXP on *clCreateProgramWithDefinedBuiltinKernels* call if any of the conditions are violated. * *Kernel mode properties*: List of <> - (`cl_dbk_properties_khr`) the kernel may accept. The properties can + (`cl_dbk_properties_exp`) the kernel may accept. The properties can be used to tweak certain implementation details and behaviors in the kernel execution. If a property not listed in the DBK description is fed to *clCreateProgramWithDefinedBuiltinKernels* call, then implementation must return - `CL_DBK_UNSUPPORTED_PROPERTY_KHR`. + `CL_DBK_UNSUPPORTED_PROPERTY_EXP`. [[dbk-propery-table]] .Table of defined built-in kernel properties @@ -481,13 +481,13 @@ Each defined built-in kernel entry is organized as follows: |=== | *DBK Mode Property* | *Property Value* | *Description* -| CL_DBK_PROPERTY_MAX_RELATIVE_ERROR_KHR | float +| CL_DBK_PROPERTY_MAX_RELATIVE_ERROR_EXP | float a| Require that the DBK produces the results which do not deviate more than the given amount value of ULPs (units in the last place) respect to infnitely precise result. -| CL_DBK_PROPERTY_NON_DETERMINISTIC_KHR | cl_bool +| CL_DBK_PROPERTY_NON_DETERMINISTIC_EXP | cl_bool a| Allow results of the kernel to be non-reproducible. This allows implementation to switch algorithm of the kernel on each launch for @@ -500,23 +500,23 @@ possibly better performance. [[dbk-description-table]] .Standard Built-in Kernels and Their Semantics. *The table has been populated with a small set of non-trivial example entries which are subject to change and the list to expand during drafting.* |=== -| Name: *CL_DBK_GEMM_KHR* +| Name: *CL_DBK_GEMM_EXP* | *Kernel Attributes* a| [source,c] ---- -typedef struct cl_dbk_attributes_gemm_khr { +typedef struct cl_dbk_attributes_gemm_exp { cl_tensor_desc a; cl_tensor_desc b; cl_tensor_desc c_in; cl_tensor_desc c_out; cl_bool trans_a; cl_bool trans_b; - cl_tensor_datatype_union_khr alpha; - cl_tensor_datatype_union_khr beta; + cl_tensor_datatype_union_exp alpha; + cl_tensor_datatype_union_exp beta; cl_dbk_properties kernel_props[CL_MAX_DBK_PROPERTIES]; -} cl_dbk_attributes_gemm_khr; +} cl_dbk_attributes_gemm_exp; ---- * _a_ is a tensor description for input matrix A. @@ -564,7 +564,7 @@ Second degree tensors of shape `(a, b)` are treated as third degree tensors of shape `(1, a, b)`. Operations of the matrix muliplication are performed in the precision -of the `elementof\(COUT)`. +of the `elementof(COUT)`. If an overflow occurs in the accumulation of the products, then `R` tensor's result will be undefined. @@ -577,35 +577,35 @@ a| * `rankof(A) == rankof(B) == rankof(CIN) == rankof(COUT)`. * Let `shapeof(A~t~) == (b..., m, k)` and `shapeof(B~t~) = (b..., k, n)` of tensors `A` and `B`, respectively, after possible tranposing. - `shapeof\(COUT)` must be `(b..., m, n)`. + `shapeof(COUT)` must be `(b..., m, n)`. * `shapeof(CIN) == shapeof(COUT)`. * `elementof(A) == elementof(B)`. -* `elemkindof\(COUT) == elemkindof(A)`. -* `elementof\(COUT) == elementof(A)` or `elementof(A)` is promotable to - `elementof\(COUT)` without a loss of meaning. +* `elemkindof(COUT) == elemkindof(A)`. +* `elementof(COUT) == elementof(A)` or `elementof(A)` is promotable to + `elementof(COUT)` without a loss of meaning. // E.g. cl_int -> cl_uint: loses meaning of negative values. | *Kernel mode properties* a| This DBK accepts the following kernel properties: -* CL_DBK_PROPERTY_MAX_RELATIVE_ERROR_KHR -* CL_DBK_PROPERTY_NON_DETERMINISTIC_KHR +* CL_DBK_PROPERTY_MAX_RELATIVE_ERROR_EXP +* CL_DBK_PROPERTY_NON_DETERMINISTIC_EXP | -| Name: *CL_DBK_MATMUL_KHR* +| Name: *CL_DBK_MATMUL_EXP* | *Kernel Attributes* a| [source,c] ---- -typedef struct cl_dbk_attributes_matmul_khr { +typedef struct cl_dbk_attributes_matmul_exp { cl_tensor_desc a; cl_tensor_desc b; cl_tensor_desc c; cl_bool trans_a; cl_bool trans_b; cl_dbk_properties kernel_props[CL_MAX_DBK_PROPERTIES]; -} cl_dbk_attributes_matmul_khr; +} cl_dbk_attributes_matmul_exp; ---- * _a_ is a tensor description for input matrix A. @@ -644,7 +644,7 @@ Second degree tensors of shape `(a, b)` are treated as third degree tensors of shape `(1, a, b)`. Operations of the matrix muliplication are performed in the precision -of the `elementof\(COUT)`. +of the `elementof(COUT)`. If an overflow occurs in the accumulation of the products, then `R` tensor's result will be undefined. @@ -665,19 +665,19 @@ a| a| This DBK accepts the following kernel properties: -* CL_DBK_PROPERTY_MAX_RELATIVE_ERROR_KHR +* CL_DBK_PROPERTY_MAX_RELATIVE_ERROR_EXP | -| Name: *khr_leaky_relu* +| Name: *CL_DBK_LEAKY_RELU_DBK* | *Kernel Attributes* a| [source,c] ---- -typedef struct cl_dbk_attributes_leaky_relu_khr { - cl_tensor_datatype_union_khr coefficient; +typedef struct cl_dbk_attributes_leaky_relu_exp { + cl_tensor_datatype_union_exp coefficient; cl_dbk_properties kernel_props[CL_MAX_DBK_PROPERTIES]; -} cl_dbk_attributes_leaky_relu_khr; +} cl_dbk_attributes_leaky_relu_exp; ---- * _alpha_ is a coefficient of leakage, a positive value. | *Kernel arguments* @@ -703,8 +703,8 @@ The `IN` and `OUT` tensors may be the same object. | *Kernel mode properties* | This DBK accepts the following kernel properties: -* CL_DBK_PROPERTY_MAX_RELATIVE_ERROR_KHR -* CL_DBK_PROPERTY_NON_DETERMINISTIC_KHR +* CL_DBK_PROPERTY_MAX_RELATIVE_ERROR_EXP +* CL_DBK_PROPERTY_NON_DETERMINISTIC_EXP | *Attribute validation rules* a| @@ -781,19 +781,19 @@ cl_mem out_tensor = clCreateBufferWithProperties( ctx, {CL_MEM_TENSOR_EXP, out_desc, 0}, CL_MEM_USE_HOST_PTR | CL_MEM_READ_WRITE, 0, out_data.data(), &err); -cl_tensor_datatype_union_khr alpha, beta, relu_coeff; +cl_tensor_datatype_union_exp alpha, beta, relu_coeff; alpha.sf = 2.0f; beta.sf = -1.0f; relu_coeff.sf = 0.01f; -cl_dkb_attributes_gemm_khr gemm_attrs = { +cl_dkb_attributes_gemm_exp gemm_attrs = { lhs_desc, rhs_desc, out_desc, out_desc, 0, 0, alpha, beta, {} }; -gemm_attrs.kernel_props[0] = CL_DBK_PROPERTY_MAX_RELATIVE_ERROR_KHR; +gemm_attrs.kernel_props[0] = CL_DBK_PROPERTY_MAX_RELATIVE_ERROR_EXP; gemm_attrs.kernel_props[1] = 100; // in ILPs gemm_attrs.kernel_props[2] = 0; -cl_dkb_attributes_leaky_relu_khr relu_attrs = { +cl_dkb_attributes_leaky_relu_exp relu_attrs = { out_desc, out_desc, relu_coeffs, {0} }; @@ -801,7 +801,7 @@ cl_device_id target_devices[2] = {dev1, dev2}; cl_int device_errcodes[2]; auto prog = clCreateProgramWithDefinedBuiltInKernels( ctx, 2, target_devices, 2, - {CL_DBK_GEMM_KHR, CL_DBK_LEAKY_RELU_KHR}, {"my_gemm", "my_relu"}, + {CL_DBK_GEMM_EXP, CL_DBK_LEAKY_RELU_EXP}, {"my_gemm", "my_relu"}, {&gemm_attrs, &relu_attrs}, &device_errcodes, &err); std::vector supported_devs; @@ -811,8 +811,8 @@ for (unsigned i = 0; i < 2; i++) { } else { // Handle errors. Possible error cases (non-exhaustive): // - // * CL_DBK_UNSUPPORTED_KHR: The DBK is not supported on the device. - // * CL_DBK_UNMET_MAX_RELATIVE_ERROR_KHR The DBK implementation does not + // * CL_DBK_UNSUPPORTED_EXP: The DBK is not supported on the device. + // * CL_DBK_UNMET_MAX_RELATIVE_ERROR_EXP The DBK implementation does not // meet the requested precision. } } @@ -852,21 +852,25 @@ clEnqueueMapBuffer( . Should the NDRange be used at all in DBKs? It feels sort of unnatural as typically the NDRange is used to imply SPMD parallelism while the hardware/firmware is free to choose whatever parallelization strategy to implement the function. On the other hand, similar applies to software kernel launches as the NDRange-launched work-items can be executed serially if adhering to barrier semantics. + -- -*UNRESOLVED* +*RESOLVED*. Decided to go forward without NDRange (and global offset + as consequence), as there are currently no known uses for the + NDRange, and let OpenCL implementations decide the parallelization + strategy. -- . Different accelerators prefer different channel orders (NHWC vs. NCHW...) for the processed data. Should the channel order be passed as a DBK argument (like in the example GEMM's row/column order) or is it better to have different DBK variations for each? + -- -*UNRESOLVED* +*RESOLVED*. The memory layout information is a property of the tensors so + there is no need for DBK arguments for the layout or DBK variants. -- -. How to denote preference? Some of the DBKs are more efficient on a given device as they map more naturally to the underlying HW accelerator, but the slower variations (for example, with unoptimal channel order in NN accelerators) might be still beneficially accelerated. +. How to denote tensors' memory layout preference? Some of the DBKs are more efficient on a given device as they map more naturally to the underlying HW accelerator, but the slower variations (for example, with unoptimal channel order in NN accelerators) might be still beneficially accelerated. + -- -*UNRESOLVED* +*UNRESOLVED*. -- @@ -917,9 +921,17 @@ tensor type. Add sample code. Henry Linjamäki + Pekka Jääskeläinen + Freddie Witherden a| -* Rework document structure match to the cl_khr_extension_template. +* Rework document structure match to the cl_exp_extension_template. * Reflect changes of the `cl_exp_tensor` extension here. * Add "Kernel Interface" section into the DBK Appendix. * Add GEMM DBK. * Change DBK creation interface. + +| 0.3.1 | 2024-8-22 | +Henry Linjamäki + +Pekka Jääskekäinen + +RABijl (@GitHub) a| +* Rename extension name from 'khr' to 'exp'. +* Resolve two open questions. +* Small fixes. |==== diff --git a/extensions/cl_exp_defined_builtin_kernels.html b/extensions/cl_exp_defined_builtin_kernels.html new file mode 100644 index 000000000..49807f978 --- /dev/null +++ b/extensions/cl_exp_defined_builtin_kernels.html @@ -0,0 +1,1936 @@ + + + + + + + +cl_exp_defined_builtin_kernels + + + + + + + +
    +
    +
    +
    +

    The purpose of this extension is to provide a standardized set of built-in +kernels with well-defined semantics useful for accelerating applications +from various domains. The extension specification is designed to rapidly +expand and "live" via addition of new well-defined built-in kernel +definitions and updating of previously defined ones.

    +
    +

    XXX - Not complete yet!!!

    +
    +
    +
    +

    Name Strings

    +
    +
    +

    cl_exp_defined_builtin_kernels

    +
    +
    +
    +
    +

    Contact

    +
    +
    +

    TODO

    +
    +
    +
    +
    +

    Contributors

    +
    +
    +

    Pekka Jääskeläinen, Intel and Tampere University.
    +Topi Leppänen, Tampere University.
    +Jan Solanti, Tampere University.
    +Ben Ashbaugh, Intel.
    +Henry Linjamäki, Intel.

    +
    +
    +
    +
    +

    Notice

    +
    +
    +

    TODO

    +
    +
    +
    +
    +

    Status

    +
    +
    +

    Draft spec, NOT APPROVED!!

    +
    +
    +
    +
    +

    Version

    +
    +
    +

    Built On: 2024-08-22
    +Version: 0.3.0

    +
    +
    +
    +
    +

    Dependencies

    +
    +
    +

    This extension is written against the OpenCL Specification version 3.0.12.

    +
    +
    +

    This extension requires OpenCL 1.2 or later.

    +
    +
    +

    This extension requires cl_exp_tensor.

    +
    +
    +
    +
    +

    Overview

    +
    +
    +

    OpenCL 1.2 specifies a built-in kernel as a kernel that is executed on +an OpenCL device or custom device by fixed-function hardware or in firmware. +Applications can query the built-in kernels supported by a device or custom +device.

    +
    +
    +

    Built-in kernels are referred to by a name (a C string) without any +semantics attached to the functionality. The semantics behind the name +is completely device specific, typically documented in vendor-specific +extension specifications.

    +
    +
    +

    The goal for this extension is to lower the bar for utilizing hardware +accelerated functions in drivers by providing a library of +well-defined built-in kernel with good coverage for common acceleration needs +and which is designed to easily evolve over time.

    +
    +
    +

    The device drivers that implement this extension can freely choose which +subset of defined built-in-kernels (DBKs) they implement and advertise to the clients. The +clients can use the DBKs to accelerate their applications by manually +executing invoking the DBKs. The extension is designed to also support using +automated task graph lowering tooling later.

    +
    +
    +

    Background

    +
    +

    ASIC-based coarse-grained hardware accelerators are specialized logic meant to +speed up execution of workloads of interest, or to provide improvements in +energy-efficiency. Examples of contemporary workloads that are beneficially hardware +accelerated over software-based implementations include video coding, deep learning, +cryptography, software-defined radio and graphics rendering.

    +
    +
    +

    FPGAs form a special case somewhere between instruction-set architectures and fixed +function hardware accelerators. While advances in high-level synthesis tools +have attempted to bridge the programmability gap between GPU and FPGA programming, +FPGAs are still considered as devices which are challenging to achieve efficient +implementations with. Due to extensive manual optimization work required for efficient +implementations of the accelerated functionality, defining FPGA designs as +a system of "hardware accelerator IPs" is still a widely used "application abstraction". +FPGAs can be thus seen as a platform that can realize and integrate any +hardware accelerator implementable with the programmable fabric.

    +
    +
    +

    The means to utilize hardware accelerators have typically been +vendor-specific and abstracted behind domain-specific libraries. +The overhead with the "bunch of libraries"-approach is seen in the lowest level +of integration: The libraries utilize a low level library (typically +vendor-specific) to interface with the actual hardware, and thus does not +integrate efficiently with other libraries or software-programmable processors +that might be available on the same chip.

    +
    +
    +
    +

    Rationale

    +
    +

    OpenCL’s built-in kernel abstraction allows pushing both hardware +accelerated and software defined kernels to the same command-queues, +providing a powerful means for asynchronous execution of heterogeneous +task graphs on diverse heterogeneous platforms. The ability to invoke hardware +accelerators while being able to synchronize and optimize data transfers at +the lowest levels of the driver stack can provide significant latency benefits, +especially when combined with the command-buffering mechanism.

    +
    +
    +

    However, the built-in kernel abstraction works well only when it is widely adopted by +vendors, and when multiple vendors implement the same definitions. Otherwise +each vendor specifies and implements their own built-in kernels closely matching their +own hardware accelerator properties, resulting in lack of cross-vendor +portability in the API abstraction presented to the upper layers of +heterogeneous computing software stacks.

    +
    +
    +

    This extension standardizes a set of well-defined built-in kernels the +clients can call from higher level programming stacks built with +different languages and multiple libraries, possibly mix accelerator +calls with calls to software kernel commands, and rely on the driver +stack to optimize the execution (especially the synchronization and +communication) as a low level heterogeneous task graph. The +heterogeneous task graph can be described using multiple +command-queues and optionally cached using the command buffer +extension (cl_khr_command_buffer). It aims to promote the use of +built-in kernels as a programming model for hardware accelerated +functionality, to improve cross-vendor portability of hardware +accelerated computing.

    +
    +
    +
    +
    +
    +

    New API Functions

    +
    +
    +
    +
    #define CL_MAX_DBK_PROPERTIES 16
    +
    +clCreateProgramWithDefinedBuiltInKernels(
    +    cl_context           context,
    +    cl_uint              num_devices,
    +    const cl_device_id*  device_list,
    +    cl_uint              num_kernels,
    +    const char**         kernel_names,
    +    const cl_dbk_id_exp* kernel_ids,
    +    const void**         kernel_attributes,
    +    cl_int*              device_support_ret,
    +    cl_int*              errcode_ret);
    +
    +
    +
    +
    +
    +

    New API Types

    +
    +
    +
    +
    typedef cl_uint       cl_dbk_id_exp;
    +typedef cl_properties cl_dbk_properties_exp;
    +
    +typedef union {
    +    cl_char    sc;
    +    cl_uchar   uc;
    +    cl_short   ss;
    +    cl_ushort  us;
    +    cl_int     si;
    +    cl_uint    ui;
    +    cl_long    sl;
    +    cl_ulong   ul;
    +    cl_half    fh;
    +    cl_float   ff;
    +    cl_double  fd;
    +    void*      raw;
    +} cl_tensor_datatype_union_exp;
    +
    +typedef struct cl_dbk_attributes_matmul_exp {
    +    cl_tensor_desc                a;
    +    cl_tensor_desc                b;
    +    cl_tensor_desc                c;
    +    cl_int                        trans_a;
    +    cl_int                        trans_b;
    +    cl_dbk_properties_exp         kernel_props[CL_MAX_DBK_PROPERTIES];
    +} cl_dbk_attributes_matmul_exp;
    +
    +typedef struct cl_dbk_attributes_gemm_exp {
    +    cl_tensor_desc                a;
    +    cl_tensor_desc                b;
    +    cl_tensor_desc                c_in;
    +    cl_tensor_desc                c_out;
    +    cl_bool                       trans_a;
    +    cl_bool                       trans_b;
    +    cl_tensor_datatype_union_exp  alpha;
    +    cl_tensor_datatype_union_exp  beta;
    +    cl_dbk_properties_exp         kernel_props[CL_MAX_DBK_PROPERTIES];
    +} cl_dbk_attributes_gemm_exp;
    +
    +typedef struct cl_dbk_attributes_leaky_relu_exp {
    +   cl_tensor_datatype_union_exp   coefficient;
    +   cl_dbk_properties_exp          kernel_props[CL_MAX_DBK_PROPERTIES];
    +} cl_dbk_attributes_leaky_relu_exp;
    +
    +
    +
    +
    +
    +

    New API Enums

    +
    +
    +

    Accepted values to cl_dbk_id_exp:

    +
    +
    +
    +
    CL_DBK_MATMUL_EXP      0x????
    +CL_DBK_GEMM_EXP        0x????
    +CL_DBK_LEAKY_RELU_EXP  0x????
    +
    +
    +
    +

    accepted values to cl_dbk_properties_exp:

    +
    +
    +
    +
    CL_DBK_PROPERTY_MAX_RELATIVE_ERROR_EXP  0x????
    +CL_DBK_PROPERTY_NON_DETERMINISTIC_EXP   0x????
    +
    +
    +
    +

    New error codes:

    +
    +
    +
    +
    CL_DBK_UNSUPPORTED_EXP                0x????
    +CL_DBK_UNSUPPORTED_PROPERTY_EXP       0x????
    +CL_DBK_INVALID_ATTRIBUTE_EXP          0x????
    +CL_DBK_UNMET_MAX_RELATIVE_ERROR_EXP   0x????
    +
    +
    +
    +
    +
    +

    Modifications to the OpenCL Specification

    +
    +
    +
    +
    (Add the following to section 5.8.1, Creating Program Objects)
    +
    +
    +
    +
    +

    To create a program object for a context and to load the information +related to the defined built-in kernels into that object, call the +function:

    +
    +
    +
    +
    clCreateProgramWithDefinedBuiltInKernels(
    +    cl_context          context,
    +    cl_uint             num_devices,
    +    const cl_device_id* device_list,
    +    cl_uint             num_kernels,
    +    const cl_dbk_id*    kernel_ids,
    +    const char**        kernel_names,
    +    const void**        kernel_attributes,
    +    cl_int*             device_errcode_ret,
    +    cl_int*             errcode_ret);
    +
    +
    +
    +
      +
    • +

      context must be a valid OpenCL context.

      +
    • +
    • +

      num_devices is the number of elements in device_list and +device_errcode_ret lists.

      +
    • +
    • +

      device_list is a pointer to a list of devices that are in +context. device_list must be a non-NULL value. The defined built-in +kernels are loaded for devices specified in this list.

      +
    • +
    • +

      num_kernels is the number of elements in kernel_ids, +kernel_attributes, kernel_names_ret and device_errcode_ret lists.

      +
    • +
    • +

      kernel_ids is the list of defined built-in kernels to +be loaded into the program.

      +
    • +
    • +

      kernel_names is a list of names given for each kernel listed in +kernel_ids. Each string in the list must be non-NULL and unique.

      +
    • +
    • +

      kernel_attributes is a list of pointers that point to the +respective attribute structure of each defined built-in kernel in +the kernel_ids list. The respective attribute structures for each +kernel identifiers are listed in Appendix TODO.

      +
    • +
    • +

      device_errcode_ret will return an appropriate error code per +device. if device_errcode_ret is NULL, no error code is returned.

      +
    • +
    • +

      errcode_ret will return an appropriate error code. If +errcode_ret is NULL, no error code is returned.

      +
    • +
    +
    +
    +

    The devices associated with the program object will be the list of +devices specified by device_list or subset of it. The list of +devices specified by device_list must be devices associated with +context.

    +
    +
    +

    clCreateProgramWithDefinedBuiltInKernels returns a valid non-zero +program object and errcode_ret is set to CL_SUCCESS if the program +object is created successfully. The returned program is created for +the devices that supports the requested built-in kernels indicated by +CL_SUCCESS in the device_errcode_ret list. In case of a failure to +create program for a device, one of the following errors code is set +in device_errcode_ret list for the respective device:

    +
    +
    +
      +
    • +

      CL_DBK_UNSUPPORTED_EXP if the device does not support one of the +built-in kernels listed in kernel_ids.

      +
    • +
    • +

      CL_INVALID_PROPERTY if a property list for a defined built-in +kernel description is invalid.

      +
    • +
    • +

      CL_DBK_UNMET_MAX_RELATIVE_ERROR_EXP if a defined built-in kernel +does not meet the requested precision.

      +
    • +
    • +

      CL_OUT_OF_RESOURCES if there is a failure to allocate resources +required by the OpenCL implementation on the device.

      +
    • +
    +
    +
    +

    If a program object is not created, +clCreateProgramWithDefinedBuiltInKernels returns a NULL value with +one of the following error codes returned in errcode_ret:

    +
    +
    +
      +
    • +

      CL_INVALID_CONTEXT if context is not a valid context.

      +
    • +
    • +

      CL_INVALID_VALUE if device_list is NULL or num_devices is zero.

      +
    • +
    • +

      CL_INVALID_VALUE if a kernel name is not unique within kernel_names.

      +
    • +
    • +

      CL_INVALID_VALUE if there is a NULL value in kernel_names.

      +
    • +
    • +

      CL_INVALID_DBK_ID_EXP if any value in the kernel_ids is not a known +identifier for a built-in kernel.

      +
    • +
    • +

      CL_INVALID_DBK_ATTRIBUTE_EXP if a kernel attribute structure is +invalid for a built-in kernel.

      +
    • +
    • +

      CL_DBK_UNSUPPORTED_EXP if device_errcode_ret is NULL and any +device in device_list does not support a defined built-in kernel.

      +
    • +
    • +

      CL_DBK_UNSUPPORTED_EXP if device_errcode_ret is non-NULL and all +devices in device_list does not support a defined built-in kernel.

      +
    • +
    • +

      CL_DBK_UNSUPPORTED_PROPERTY_EXP If a kernel does not accept a +valid kernel property.

      +
    • +
    • +

      CL_INVALID_DEVICE if any device in device_list is not in the list of +devices associated with context.

      +
    • +
    • +

      CL_OUT_OF_RESOURCES if there is a failure to allocate resources +required by the OpenCL implementation on the device.

      +
    • +
    • +

      CL_OUT_OF_HOST_MEMORY if there is a failure to allocate resources +required by the OpenCL implementation on the host.

      +
    • +
    +
    +
    +
    +
    +
    (Modify section 5.10, Executing Kernels)
    +
    +
    +
    +
    +
    +
    (Add following to clEnqueueNDRangeKernel)
    +
    +
    +
    +
    +
    +

    For defined built-in kernels work_dim, global_work_offset, +global_work_size and local_work_size parameters are meaningless +and must be set to zero and NULL, respectively. OpenCL implementations +decide how they distribute the workloads of the defined built-in +kernels.

    +
    +
    +
    +
    +
    +
    +
    +
    +
    (Add the following to the list of error codes returned by clEnqueueNDRangeKernel)
    +
    +
    +
    +
    +
    +
      +
    • +

      CL_INVALID_GLOBAL_WORK_SIZE if the kernel is a defined built-in +kernel and global_work_size is not NULL.

      +
    • +
    • +

      CL_INVALID_GLOBAL_WORK_OFFSET if the kernel is a defined built-in +kernel and global_work_offset is not NULL.

      +
    • +
    • +

      CL_INVALID_LOCAL_WORK_SIZE if the kernel is a defined built-in +kernel and local_work_size is not NULL.

      +
    • +
    +
    +
    +
    + +
    +
    +
    +

    Add new appendix "Defined Built-in Kernels" to OpenCL API Specification

    +
    +

    This chapter describes standard defined built-in kernels (DBK) with +well-defined semantics. They are loaded into a program using +clCreateProgramWithDefinedBuiltinKernels and the kernels in it are +launched using clEnqueueNDRangeKernel with work_dim set to zero +and global_work_offset, global_work_size and local_work_size set +to NULL.

    +
    +
    +

    The general client-side abstraction of the DBKs is similar to a call +to a C function of which implementation is hidden. The device driver +are free to implement a DBK by invoking one or more coarse and fine-grained hardware accelerators combined with +firmware to implement the semantics as efficiently as possible.

    +
    +
    +

    It is the driver’s responsibility to handle efficient synchronization and communication +to the hardware accelerator, the internal accelerator state management and resource sharing +across multiple OpenCL contexts.

    +
    +
    +

    Reproducibility

    +
    +

    Identical DBKs or the same DBKs executed repeatedly with identical inputs are +guaranteed to produce identical results, unless otherwise stated in +the DBK’s description, when:

    +
    +
    +
      +
    • +

      enqueued to the same device,

      +
    • +
    • +

      on the same platform,

      +
    • +
    • +

      on the same vendor with the same driver version and

      +
    • +
    • +

      CL_DBK_PROPERTY_NON_DETERMINISTIC_EXP property is not set on.

      +
    • +
    +
    +
    +

    In other cases, the DBKs may produce different results. Two DBKs for a +device are considered identical if they are created using identical +kernel identifier, kernel attributes and kernel properties. The result +difference may occur because of different algorithms being used across +devices, for example.

    +
    +
    +

    DBKs may produce approximated results and the error, respect to +infinitely precise result, can be optionally controlled by +CL_DBK_PROPERTY_MAX_RELATIVE_ERROR_EXP when the property name is listed in +the DBK’s description. When the precision is not controlled by the +application using CL_DBK_PROPERTY_MAX_RELATIVE_ERROR_EXP, the OpenCL +precision of results are

    +
    +
    +
      +
    • +

      chosen by the implementation for floating-point based tasks.

      +
    • +
    • +

      exact for integer based tasks.

      +
    • +
    +
    +
    +
    +

    Kernel Interface

    +
    +

    DBKs operates on tensor objects, created with +clCreateBufferWithProperties using CL_MEM_TENSOR property, +generally in single-static assignment fashion. the Kernel arguments +used for reading and writing tensors may not reference the same tensor +object unless otherwise stated in the DBK descriptions.

    +
    +
    +
    +

    The Defined Built-in Kernels

    +
    +

    The list of recognized defined built-in kernels are listed in the +following table. It is expected to be +expanded and updated over the versions of this extensions, while +preserving backwards compatibility.

    +
    +
    +

    Each defined built-in kernel entry is organized as follows:

    +
    +
    +
      +
    • +

      Name: Name of the defined built-in kernel (an enumeration).

      +
    • +
    • +

      Kernel attributes: The kernel attributes required for creating the +defined built-in kernel via +clCreateProgramWithDefinedBuiltinKernels. Attribute values are +immutable.

      +
    • +
    • +

      Kernel arguments: The kernel arguments.

      +
    • +
    • +

      Description: The description of the kernel in detail.

      +
    • +
    • +

      Attribute validation rules: Conditions of the kernel attribute for +the kernel. Implementation must return CL_DBK_INVALID_ATTRIBUTE_EXP on +clCreateProgramWithDefinedBuiltinKernels call if any of the conditions +are violated.

      +
    • +
    • +

      Kernel mode properties: List of kernel properties +(cl_dbk_properties_exp) the kernel may accept. The properties can +be used to tweak certain implementation details and behaviors in +the kernel execution. If a property not listed in the DBK +description is fed to clCreateProgramWithDefinedBuiltinKernels +call, then implementation must return +CL_DBK_UNSUPPORTED_PROPERTY_EXP.

      +
    • +
    +
    + + +++++ + + + + + + + + + + + + + + + + + + + +
    Table 1. Table of defined built-in kernel properties
    DBK Mode PropertyProperty ValueDescription

    CL_DBK_PROPERTY_MAX_RELATIVE_ERROR_EXP

    float

    +

    Require that the DBK produces the results which do not deviate more +than the given amount value of ULPs (units in the last place) respect +to infnitely precise result.

    +

    CL_DBK_PROPERTY_NON_DETERMINISTIC_EXP

    cl_bool

    +

    Allow results of the kernel to be non-reproducible. This allows +implementation to switch algorithm of the kernel on each launch for +possibly better performance.

    +
    + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    Table 2. Standard Built-in Kernels and Their Semantics. The table has been populated with a small set of non-trivial example entries which are subject to change and the list to expand during drafting.

    Name: CL_DBK_GEMM_EXP

    Kernel Attributes

    +
    +
    typedef struct cl_dbk_attributes_gemm_exp {
    +    cl_tensor_desc a;
    +    cl_tensor_desc b;
    +    cl_tensor_desc c_in;
    +    cl_tensor_desc c_out;
    +    cl_bool trans_a;
    +    cl_bool trans_b;
    +    cl_tensor_datatype_union_exp alpha;
    +    cl_tensor_datatype_union_exp beta;
    +    cl_dbk_properties kernel_props[CL_MAX_DBK_PROPERTIES];
    +} cl_dbk_attributes_gemm_exp;
    +
    +
    +
    +
      +
    • +

      a is a tensor description for input matrix A.

      +
    • +
    • +

      b is a tensor description for input matrix B.

      +
    • +
    • +

      c_in is a tensor description for output matrix CIN.

      +
    • +
    • +

      c_out is a tensor description for output matrix COUT.

      +
    • +
    • +

      trans_a instruct to transpose the A matrix if the value is CL_TRUE.

      +
    • +
    • +

      trans_b instruct to transpose the B matrix if the value is CL_TRUE.

      +
    • +
    • +

      alpha is a value or pointer to value corresponponding to the +element type of c_out.

      +
    • +
    • +

      beta is a value or pointer to value corresponponding to the +element type of c_out.

      +
    • +
    • +

      kernel_props defined additional kernel properties.

      +
    • +
    +

    Kernel Arguments

    +
      +
    1. +

      cl_mem: a tensor object for matrix A (read only).

      +
    2. +
    3. +

      cl_mem: a tensor object for matrix B (read only).

      +
    4. +
    5. +

      cl_mem: a tensor object for matrix C_IN (read only).

      +
    6. +
    7. +

      cl_mem: a tensor object for matrix C_OUT (write only).

      +
    8. +
    +

    Description

    +

    Performs (batched) general matrix multiplication:

    +
    +
    +
    +\$bb"COUT"_(b,m,n) = "beta" * bb"CIN"_(b,m,n) + "alpha" * sum_(k)trans(bb"A", "trans_a")_(b,m,k)trans(bb"B", "trans_b") _(b,k,n)\$ +
    +
    +
    +

    Where:

    +
    +
    +
    +\$trans(X_(b,i,j), tr) = {(X_(b,j,i), "if tr" = "CL_TRUE"), (X_(b,i,j), "otherwise") :}\$ +
    +
    +
    +

    Second degree tensors of shape (a, b) are treated as third degree +tensors of shape (1, a, b).

    +
    +
    +

    Operations of the matrix muliplication are performed in the precision +of the elementof(COUT).

    +
    +
    +

    If an overflow occurs in the accumulation of the products, then R +tensor’s result will be undefined.

    +
    +
    +

    CIN and COUT tensors may be the same object.

    +

    Attribute validation rules

    +
      +
    • +

      rankof(A) == rankof(B) == rankof(CIN) == rankof(COUT).

      +
    • +
    • +

      Let shapeof(At) == (b…​, m, k) and shapeof(Bt) = (b…​, k, +n) of tensors A and B, respectively, after possible tranposing. +shapeof(COUT) must be (b…​, m, n).

      +
    • +
    • +

      shapeof(CIN) == shapeof(COUT).

      +
    • +
    • +

      elementof(A) == elementof(B).

      +
    • +
    • +

      elemkindof(COUT) == elemkindof(A).

      +
    • +
    • +

      elementof(COUT) == elementof(A) or elementof(A) is promotable to +elementof(COUT) without a loss of meaning.

      +
    • +
    +

    Kernel mode properties

    +

    This DBK accepts the following kernel properties:

    +
    +
    +
      +
    • +

      CL_DBK_PROPERTY_MAX_RELATIVE_ERROR_EXP

      +
    • +
    • +

      CL_DBK_PROPERTY_NON_DETERMINISTIC_EXP

      +
    • +
    +

    Name: CL_DBK_MATMUL_EXP

    Kernel Attributes

    +
    +
    typedef struct cl_dbk_attributes_matmul_exp {
    +    cl_tensor_desc a;
    +    cl_tensor_desc b;
    +    cl_tensor_desc c;
    +    cl_bool trans_a;
    +    cl_bool trans_b;
    +    cl_dbk_properties kernel_props[CL_MAX_DBK_PROPERTIES];
    +} cl_dbk_attributes_matmul_exp;
    +
    +
    +
    +
      +
    • +

      a is a tensor description for input matrix A.

      +
    • +
    • +

      b is a tensor description for input matrix B.

      +
    • +
    • +

      c is a tensor description for output matrix C.

      +
    • +
    • +

      trans_a instruct to transpose the A matrix if the value is CL_TRUE.

      +
    • +
    • +

      trans_b instruct to transpose the B matrix if the value is CL_TRUE.

      +
    • +
    • +

      kernel_props defined additional kernel properties.

      +
    • +
    +

    Kernel Arguments

    +
      +
    1. +

      cl_mem: a tensor object for matrix A (read only).

      +
    2. +
    3. +

      cl_mem: a tensor object for matrix B (read only).

      +
    4. +
    5. +

      cl_mem: a tensor object for matrix C (write only).

      +
    6. +
    +

    Description

    +

    Performs (batched) matrix multiplication:

    +
    +
    +
    +\$bb"C"_(b,m,n) = sum_(k)trans(bb"A", "trans_a")_(b,m,k)trans(bb"B", "trans_b") _(b,k,n)\$ +
    +
    +
    +

    Where:

    +
    +
    +
    +\$trans(X_(b,i,j), tr) = {(X_(b,j,i), "if tr" = "CL_TRUE"), (X_(b,i,j), "otherwise") :}\$ +
    +
    +
    +

    Second degree tensors of shape (a, b) are treated as third degree +tensors of shape (1, a, b).

    +
    +
    +

    Operations of the matrix muliplication are performed in the precision +of the elementof(COUT).

    +
    +
    +

    If an overflow occurs in the accumulation of the products, then R +tensor’s result will be undefined.

    +

    Attribute validation rules

    +
      +
    • +

      rankof(A) == rankof(B) == rankof(C).

      +
    • +
    • +

      Let shapeof(At) == (b…​, m, k) and shapeof(Bt) = (b…​, k, +n) of tensors A and B, respectively, after possible tranposing. +shapeof(C) must be (b…​, m, n).

      +
    • +
    • +

      elementof(A) == elementof(B).

      +
    • +
    • +

      elemkindof(C) == elemkindof(A).

      +
    • +
    • +

      elementof(C) == elementof(A) or elementof(A) is promotable to +elementof(C) without a loss of meaning.

      +
    • +
    +

    Kernel mode properties

    +

    This DBK accepts the following kernel properties:

    +
    +
    +
      +
    • +

      CL_DBK_PROPERTY_MAX_RELATIVE_ERROR_EXP

      +
    • +
    +

    Name: CL_DBK_LEAKY_RELU_DBK

    Kernel Attributes

    +
    +
    typedef struct cl_dbk_attributes_leaky_relu_exp {
    +   cl_tensor_datatype_union_exp coefficient;
    +   cl_dbk_properties kernel_props[CL_MAX_DBK_PROPERTIES];
    +} cl_dbk_attributes_leaky_relu_exp;
    +
    +
    +
    +
      +
    • +

      alpha is a coefficient of leakage, a positive value.

      +
    • +
    +

    Kernel arguments

    +
      +
    1. +

      cl_mem: a tensor object IN for input values.

      +
    2. +
    3. +

      cl_mem: a tensor object OUT for output value.

      +
    4. +
    +

    Description

    +

    This element-wise built-in kernel performs a leaky ReLU operation as followed:

    +
    +
    +
    +\$"OUT"_(i) = {( -"alpha" * "IN"_(i), "if IN"_(i) \lt 0), ("IN"_(i), " otherwise") :}\$ +
    +
    +
    +

    If target device does not support denormals, then the alpha value is +flushed to zero before the operation is applied. This DBK accepts +tensors of arbitrary rank.

    +
    +
    +

    The IN and OUT tensors may be the same object.

    +

    Kernel mode properties

    This DBK accepts the following kernel properties:

    +

    * CL_DBK_PROPERTY_MAX_RELATIVE_ERROR_EXP +* CL_DBK_PROPERTY_NON_DETERMINISTIC_EXP

    Attribute validation rules

    +
      +
    • +

      shapeof(in) == shapeof(out).

      +
    • +
    • +

      elementof(in) == elementof(out).

      +
    • +
    • +

      coefficient must be a positive, finite value.

      +
    • +
    +
    +
    +
    +

    Launching DBKs from the Device Side

    +
    +

    DBKs are primarily meant to be launched as kernel commands via +host-side command queues. Optionally, they can be callable from +device-side via enqueue_kernel:

    +
    +
    +

    TBC. This probably needs device-side function corresponding to +clCreateProgramWithDefinedBuiltinKernels.

    +
    +
    +
    +
    +
    +
    +

    Sample Code

    +
    +
    +
    +
    constexpr size_t b = 64, m = 100, n = 200, k = 50;
    +cl_int err;
    +
    +std::vector<float> lhs_data = ...;
    +std::vector<float> rhs_data = ...;
    +std::vector<float> bias_data = ...;
    +std::vector<float> out_data(b * m * n);
    +
    +cl_tensor_layout_blas_exp row_major;
    +row_major.leading_dims[0] = 2,
    +row_major.leading_dims[1] = 1,
    +
    +cl_tensor_desc_exp lhs_desc;
    +lhs_desc.rank = 3;
    +lhs_desc.dtype = CL_TENSOR_FP32_EXP;
    +lhs_desc.properties[0] = 0;
    +lhs_desc.shape[0] = b;
    +lhs_desc.shape[1] = m;
    +lhs_desc.shape[2] = k;
    +lhs_desc.layout_type = CL_TENSOR_LAYOUT_BLAS_EXP;
    +lhs_desc.layout = &row_major;
    +
    +cl_tensor_desc_exp rhs_desc;
    +rhs_desc.rank = 3;
    +rhs_desc.dtype = CL_TENSOR_FP32_EXP;
    +rhs_desc.properties[0] = 0;
    +rhs_desc.shape[0] = b;
    +rhs_desc.shape[1] = k;
    +rhs_desc.shape[2] = n;
    +rhs_desc.layout_type = CL_TENSOR_LAYOUT_BLAS_EXP;
    +rhs_desc.layout = &row_major;
    +
    +cl_tensor_desc_exp out_desc;
    +out_desc.rank = 3;
    +out_desc.dtype = CL_TENSOR_FP32_EXP;
    +out_desc.properties[0] = 0;
    +out_desc.shape[0] = b;
    +out_desc.shape[1] = m;
    +out_desc.shape[2] = n;
    +out_desc.layout_type = CL_TENSOR_LAYOUT_BLAS_EXP;
    +out_desc.layout = &row_major;
    +
    +cl_mem lhs_tensor = clCreateBufferWithProperties(
    +  ctx, {CL_MEM_TENSOR_EXP, lhs_desc, 0},
    +  CL_MEM_USE_HOST_PTR | CL_MEM_READ_ONLY, 0, lhs_data.data(), &err);
    +cl_mem rhs_tensor = clCreateBufferWithProperties(
    +  ctx, {CL_MEM_TENSOR_EXP, rhs_desc, 0},
    +  CL_MEM_USE_HOST_PTR | CL_MEM_READ_ONLY, 0, rhs_data.data(), &err);
    +cl_mem bias_tensor = clCreateBufferWithProperties(
    +  ctx, {CL_MEM_TENSOR_EXP, out_desc, 0},
    +  CL_MEM_USE_HOST_PTR | CL_MEM_READ_ONLY, 0, rhs_data.data(), &err);
    +cl_mem out_tensor = clCreateBufferWithProperties(
    +  ctx, {CL_MEM_TENSOR_EXP, out_desc, 0},
    +  CL_MEM_USE_HOST_PTR | CL_MEM_READ_WRITE, 0, out_data.data(), &err);
    +
    +cl_tensor_datatype_union_exp alpha, beta, relu_coeff;
    +alpha.sf = 2.0f;
    +beta.sf = -1.0f;
    +relu_coeff.sf = 0.01f;
    +
    +cl_dkb_attributes_gemm_exp gemm_attrs = {
    +  lhs_desc, rhs_desc, out_desc, out_desc, 0, 0, alpha, beta, {}
    +};
    +gemm_attrs.kernel_props[0] = CL_DBK_PROPERTY_MAX_RELATIVE_ERROR_EXP;
    +gemm_attrs.kernel_props[1] = 100; // in ILPs
    +gemm_attrs.kernel_props[2] = 0;
    +
    +cl_dkb_attributes_leaky_relu_exp relu_attrs = {
    +  out_desc, out_desc, relu_coeffs, {0}
    +};
    +
    +cl_device_id target_devices[2] = {dev1, dev2};
    +cl_int device_errcodes[2];
    +auto prog = clCreateProgramWithDefinedBuiltInKernels(
    +  ctx, 2, target_devices, 2,
    +  {CL_DBK_GEMM_EXP, CL_DBK_LEAKY_RELU_EXP}, {"my_gemm", "my_relu"},
    +  {&gemm_attrs, &relu_attrs}, &device_errcodes, &err);
    +
    +std::vector<cl_device_id> supported_devs;
    +for (unsigned i = 0; i < 2; i++) {
    +  if (device_errcodes[i] == CL_SUCCESS) {
    +    supported_devs.push_back(target_devices[i]);
    +  } else {
    +     // Handle errors. Possible error cases (non-exhaustive):
    +     //
    +     // * CL_DBK_UNSUPPORTED_EXP: The DBK is not supported on the device.
    +     // * CL_DBK_UNMET_MAX_RELATIVE_ERROR_EXP The DBK implementation does not
    +     //   meet the requested precision.
    +  }
    +}
    +
    +err = clBuildProgram(
    +  prog, supported_devs.size(), supported_devs.data(), "", nullptr, nullptr);
    +
    +auto gemm_kernel = clCreateKernel(prog, "my_gemm", &err);
    +clSetKernelArg(gemm_kernel, 0, sizeof(cl_mem), &lhs_tensor);
    +clSetKernelArg(gemm_kernel, 1, sizeof(cl_mem), &rhs_tensor);
    +clSetKernelArg(gemm_kernel, 2, sizeof(cl_mem), &bias_tensor);
    +clSetKernelArg(gemm_kernel, 3, sizeof(cl_mem), &out_tensor);
    +
    +auto relu_kernel = clCreateKernel(prog, "my_relu", &err);
    +clSetKernelArg(relu_kernel, 0, sizeof(cl_mem), &out_tensor);
    +clSetKernelArg(relu_kernel, 1, sizeof(cl_mem), &out_tensor);
    +
    +cmq_q = /* Create an in-order command queue. */;
    +
    +clEnqueueNDRangeKernel(
    +  cmd_q, 0, nullptr, nullptr, nullptr, gemm_kernel, 0, nullptr, nullptr);
    +clEnqueueNDRangeKernel(
    +  cmd_q, 0, nullptr, nullptr, nullptr, relu_kernel, 0, nullptr, nullptr);
    +clEnqueueMapBuffer(
    +  cmd_q, out_tensor, CL_TRUE, CL_MAP_READ, 0, b * m * n, 0, nullptr, nullptr);
    +
    +
    +
    +

    Open questions

    +
    +
      +
    1. +

      Should we enable launching DBKs from the device side without requiring device-side enqueue? The main problem is those with NDRange as they are not simple single-WI helper functions.

      +
      +
      +
      +

      UNRESOLVED

      +
      +
      +
      +
    2. +
    3. +

      Should the NDRange be used at all in DBKs? It feels sort of unnatural as typically the NDRange is used to imply SPMD parallelism while the hardware/firmware is free to choose whatever parallelization strategy to implement the function. On the other hand, similar applies to software kernel launches as the NDRange-launched work-items can be executed serially if adhering to barrier semantics.

      +
      +
      +
      +

      RESOLVED. Decided to go forward without NDRange (and global offset + as consequence), as there are currently no known uses for the + NDRange, and let OpenCL implementations decide the parallelization + strategy.

      +
      +
      +
      +
    4. +
    5. +

      Different accelerators prefer different channel orders (NHWC vs. NCHW…​) for the processed data. Should the channel order be passed as a DBK argument (like in the example GEMM’s row/column order) or is it better to have different DBK variations for each?

      +
      +
      +
      +

      RESOLVED. The memory layout information is a property of the tensors so + there is no need for DBK arguments for the layout or DBK variants.

      +
      +
      +
      +
    6. +
    7. +

      How to denote tensors' memory layout preference? Some of the DBKs are more efficient on a given device as they map more naturally to the underlying HW accelerator, but the slower variations (for example, with unoptimal channel order in NN accelerators) might be still beneficially accelerated.

      +
      +
      +
      +

      UNRESOLVED.

      +
      +
      +
      +
    8. +
    9. +

      Since the defined built-in kernel concept is basically just a C-like API inside another API, should it be made more generic and thus directly usable for SYCL and Vulkan as well?

      +
      +
      +
      +

      UNRESOLVED

      +
      +
      +
      +
    10. +
    11. +

      What other DBK mode properties we should have? Here are some ideas:

      +
      +
        +
      • +

        Perform accumulation with saturation.

        +
      • +
      • +

        Finite math only

        +
      • +
      • +

        Flush denormals to zero.

        +
      • +
      +
      +
      +
      +
      +

      UNRESOLVED

      +
      +
      +
      +
    12. +
    13. +

      Should we reuse (and remove "deprecation" status on) clEnqueueTask +for launching DBKs as DBKs don’t make use of global offset and size +and local size parameters?

      +
      +
      +
      +

      UNRESOLVED

      +
      +
      +
      +
    14. +
    +
    +
    +
    +
    +
    +

    Version History

    +
    + ++++++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    VersionDateAuthorDescription

    0.1.0

    2022-12-13

    Pekka Jääskeläinen
    +Ben Ashbaugh

    +

    First formulation as an extension specification like proposed by Ben Ashbaugh.

    +

    0.2.0

    2023-11-23

    Henry Linjamäki
    +Pekka Jääskeläinen
    +Ben Ashbaugh

    +

    Add APIs for defined built-in kernel (DBK) creation. Model DBKs on +tensor type. Add sample code.

    +

    0.3.0

    2024-8-20

    Henry Linjamäki
    +Pekka Jääskeläinen
    +Freddie Witherden

    +
      +
    • +

      Rework document structure match to the cl_exp_extension_template.

      +
    • +
    • +

      Reflect changes of the cl_exp_tensor extension here.

      +
    • +
    • +

      Add "Kernel Interface" section into the DBK Appendix.

      +
    • +
    • +

      Add GEMM DBK.

      +
    • +
    • +

      Change DBK creation interface.

      +
    • +
    +

    0.3.1

    2024-8-22

    Henry Linjamäki
    +Pekka Jääskekäinen
    +RABijl (@GitHub)

    +
      +
    • +

      Rename extension name from 'khr' to 'exp'.

      +
    • +
    • +

      Resolve two open questions.

      +
    • +
    • +

      Small fixes.

      +
    • +
    +
    +
    +
    +
    + + + + + \ No newline at end of file diff --git a/extensions/cl_khr_defined_builtin_kernels.html b/extensions/cl_khr_defined_builtin_kernels.html deleted file mode 100644 index e4188cbc7..000000000 --- a/extensions/cl_khr_defined_builtin_kernels.html +++ /dev/null @@ -1,1888 +0,0 @@ - - - - - - - -cl_khr_defined_builtin_kernels - - - - - - - -
    -
    -
    -
    -

    The purpose of this extension is to provide a standardized set of built-in -kernels with well-defined semantics useful for accelerating applications -from various domains. The extension specification is designed to rapidly -expand and "live" via addition of new well-defined built-in kernel -definitions and updating of previously defined ones.

    -
    -

    XXX - Not complete yet!!!

    -
    -
    -
    -

    Name Strings

    -
    -
    -

    cl_khr_defined_builtin_kernels

    -
    -
    -
    -
    -

    Contact

    -
    -
    -

    TODO

    -
    -
    -
    -
    -

    Contributors

    -
    -
    -

    Pekka Jääskeläinen, Intel and Tampere University.
    -Topi Leppänen, Tampere University.
    -Jan Solanti, Tampere University.
    -Ben Ashbaugh, Intel.
    -Henry Linjamäki, Intel.

    -
    -
    -
    -
    -

    Notice

    -
    -
    -

    TODO

    -
    -
    -
    -
    -

    Status

    -
    -
    -

    Draft spec, NOT APPROVED!!

    -
    -
    -
    -
    -

    Version

    -
    -
    -

    Built On: 2024-08-20
    -Version: 0.3.0

    -
    -
    -
    -
    -

    Dependencies

    -
    -
    -

    This extension is written against the OpenCL Specification version 3.0.12.

    -
    -
    -

    This extension requires OpenCL 1.2 or later.

    -
    -
    -

    This extension requires cl_exp_tensor.

    -
    -
    -
    -
    -

    Overview

    -
    -
    -

    OpenCL 1.2 specifies a built-in kernel as a kernel that is executed on -an OpenCL device or custom device by fixed-function hardware or in firmware. -Applications can query the built-in kernels supported by a device or custom -device.

    -
    -
    -

    Built-in kernels are referred to by a name (a C string) without any -semantics attached to the functionality. The semantics behind the name -is completely device specific, typically documented in vendor-specific -extension specifications.

    -
    -
    -

    The goal for this extension is to lower the bar for utilizing hardware -accelerated functions in drivers by providing a library of -well-defined built-in kernel with good coverage for common acceleration needs -and which is designed to easily evolve over time.

    -
    -
    -

    The device drivers that implement this extension can freely choose which -subset of defined built-in-kernels (DBKs) they implement and advertise to the clients. The -clients can use the DBKs to accelerate their applications by manually -executing invoking the DBKs. The extension is designed to also support using -automated task graph lowering tooling later.

    -
    -
    -

    Background

    -
    -

    ASIC-based coarse-grained hardware accelerators are specialized logic meant to -speed up execution of workloads of interest, or to provide improvements in -energy-efficiency. Examples of contemporary workloads that are beneficially hardware -accelerated over software-based implementations include video coding, deep learning, -cryptography, software-defined radio and graphics rendering.

    -
    -
    -

    FPGAs form a special case somewhere between instruction-set architectures and fixed -function hardware accelerators. While advances in high-level synthesis tools -have attempted to bridge the programmability gap between GPU and FPGA programming, -FPGAs are still considered as devices which are challenging to achieve efficient -implementations with. Due to extensive manual optimization work required for efficient -implementations of the accelerated functionality, defining FPGA designs as -a system of "hardware accelerator IPs" is still a widely used "application abstraction". -FPGAs can be thus seen as a platform that can realize and integrate any -hardware accelerator implementable with the programmable fabric.

    -
    -
    -

    The means to utilize hardware accelerators have typically been -vendor-specific and abstracted behind domain-specific libraries. -The overhead with the "bunch of libraries"-approach is seen in the lowest level -of integration: The libraries utilize a low level library (typically -vendor-specific) to interface with the actual hardware, and thus does not -integrate efficiently with other libraries or software-programmable processors -that might be available on the same chip.

    -
    -
    -
    -

    Rationale

    -
    -

    OpenCL’s built-in kernel abstraction allows pushing both hardware -accelerated and software defined kernels to the same command-queues, -providing a powerful means for asynchronous execution of heterogeneous -task graphs on diverse heterogeneous platforms. The ability to invoke hardware -accelerators while being able to synchronize and optimize data transfers at -the lowest levels of the driver stack can provide significant latency benefits, -especially when combined with the command-buffering mechanism.

    -
    -
    -

    However, the built-in kernel abstraction works well only when it is widely adopted by -vendors, and when multiple vendors implement the same definitions. Otherwise -each vendor specifies and implements their own built-in kernels closely matching their -own hardware accelerator properties, resulting in lack of cross-vendor -portability in the API abstraction presented to the upper layers of -heterogeneous computing software stacks.

    -
    -
    -

    This extension standardizes a set of well-defined built-in kernels the -clients can call from higher level programming stacks built with -different languages and multiple libraries, possibly mix accelerator -calls with calls to software kernel commands, and rely on the driver -stack to optimize the execution (especially the synchronization and -communication) as a low level heterogeneous task graph. The -heterogeneous task graph can be described using multiple -command-queues and optionally cached using the command buffer -extension (cl_khr_command_buffer). It aims to promote the use of -built-in kernels as a programming model for hardware accelerated -functionality, to improve cross-vendor portability of hardware -accelerated computing.

    -
    -
    -
    -
    -
    -

    New API Functions

    -
    -
    -
    -
    #define CL_MAX_DBK_PROPERTIES 16
    -
    -clCreateProgramWithDefinedBuiltInKernels(
    -    cl_context           context,
    -    cl_uint              num_devices,
    -    const cl_device_id*  device_list,
    -    cl_uint              num_kernels,
    -    const char**         kernel_names,
    -    const cl_dbk_id_khr* kernel_ids,
    -    const void**         kernel_attributes,
    -    cl_int*              device_support_ret,
    -    cl_int*              errcode_ret);
    -
    -
    -
    -
    -
    -

    New API Types

    -
    -
    -
    -
    typedef cl_uint       cl_dbk_id_khr;
    -typedef cl_properties cl_dbk_properties_khr;
    -
    -typedef union {
    -    cl_char    sc;
    -    cl_uchar   uc;
    -    cl_short   ss;
    -    cl_ushort  us;
    -    cl_int     si;
    -    cl_uint    ui;
    -    cl_long    sl;
    -    cl_ulong   ul;
    -    cl_half    fh;
    -    cl_float   ff;
    -    cl_double  fd;
    -    void*      raw;
    -} cl_tensor_datatype_union_khr;
    -
    -typedef struct cl_dbk_attributes_matmul_khr {
    -    cl_tensor_desc                a;
    -    cl_tensor_desc                b;
    -    cl_tensor_desc                c;
    -    cl_int                        trans_a;
    -    cl_int                        trans_b;
    -    cl_dbk_properties_khr         kernel_props[CL_MAX_DBK_PROPERTIES];
    -} cl_dbk_attributes_matmul_khr;
    -
    -typedef struct cl_dbk_attributes_gemm_khr {
    -    cl_tensor_desc                a;
    -    cl_tensor_desc                b;
    -    cl_tensor_desc                c_in;
    -    cl_tensor_desc                c_out;
    -    cl_bool                       trans_a;
    -    cl_bool                       trans_b;
    -    cl_tensor_datatype_union_khr  alpha;
    -    cl_tensor_datatype_union_khr  beta;
    -    cl_dbk_properties_khr         kernel_props[CL_MAX_DBK_PROPERTIES];
    -} cl_dbk_attributes_gemm_khr;
    -
    -typedef struct cl_dbk_attributes_leaky_relu_khr {
    -   cl_tensor_datatype_union_khr   coefficient;
    -   cl_dbk_properties_khr          kernel_props[CL_MAX_DBK_PROPERTIES];
    -} cl_dbk_attributes_leaky_relu_khr;
    -
    -
    -
    -
    -
    -

    New API Enums

    -
    -
    -

    Accepted values to cl_dbk_id_khr:

    -
    -
    -
    -
    CL_DBK_MATMUL_KHR      0x????
    -CL_DBK_GEMM_KHR        0x????
    -CL_DBK_LEAKY_RELU_KHR  0x????
    -
    -
    -
    -

    accepted values to cl_dbk_properties_khr:

    -
    -
    -
    -
    CL_DBK_PROPERTY_MAX_RELATIVE_ERROR_KHR  0x????
    -CL_DBK_PROPERTY_NON_DETERMINISTIC_KHR   0x????
    -
    -
    -
    -

    New error codes:

    -
    -
    -
    -
    CL_DBK_UNSUPPORTED_KHR                0x????
    -CL_DBK_UNSUPPORTED_PROPERTY_KHR       0x????
    -CL_DBK_INVALID_ATTRIBUTE_KHR          0x????
    -CL_DBK_UNMET_MAX_RELATIVE_ERROR_KHR   0x????
    -
    -
    -
    -
    -
    -

    Modifications to the OpenCL Specification

    -
    -
    -
    -
    (Add the following to section 5.8.1, Creating Program Objects)
    -
    -
    -
    -
    -

    To create a program object for a context and to load the information -related to the defined built-in kernels into that object, call the -function:

    -
    -
    -
    -
    clCreateProgramWithDefinedBuiltInKernels(
    -    cl_context          context,
    -    cl_uint             num_devices,
    -    const cl_device_id* device_list,
    -    cl_uint             num_kernels,
    -    const cl_dbk_id*    kernel_ids,
    -    const char**        kernel_names,
    -    const void**        kernel_attributes,
    -    cl_int*             device_errcode_ret,
    -    cl_int*             errcode_ret);
    -
    -
    -
    -
      -
    • -

      context must be a valid OpenCL context.

      -
    • -
    • -

      num_devices is the number of elements in device_list and -device_errcode_ret lists.

      -
    • -
    • -

      device_list is a pointer to a list of devices that are in -context. device_list must be a non-NULL value. The defined built-in -kernels are loaded for devices specified in this list.

      -
    • -
    • -

      num_kernels is the number of elements in kernel_ids, -kernel_attributes, kernel_names_ret and device_errcode_ret lists.

      -
    • -
    • -

      kernel_ids is the list of defined built-in kernels to -be loaded into the program.

      -
    • -
    • -

      kernel_names is a list of names given for each kernel listed in -kernel_ids. Each string in the list must be non-NULL and unique.

      -
    • -
    • -

      kernel_attributes is a list of pointers that point to the -respective attribute structure of each defined built-in kernel in -the kernel_ids list. The respective attribute structures for each -kernel identifiers are listed in Appendix TODO.

      -
    • -
    • -

      device_errcode_ret will return an appropriate error code per -device. if device_errcode_ret is NULL, no error code is returned.

      -
    • -
    • -

      errcode_ret will return an appropriate error code. If -errcode_ret is NULL, no error code is returned.

      -
    • -
    -
    -
    -

    The devices associated with the program object will be the list of -devices specified by device_list or subset of it. The list of -devices specified by device_list must be devices associated with -context.

    -
    -
    -

    clCreateProgramWithDefinedBuiltInKernels returns a valid non-zero -program object and errcode_ret is set to CL_SUCCESS if the program -object is created successfully. The returned program is created for -the devices that supports the requested built-in kernels indicated by -CL_SUCCESS in the device_errcode_ret list. In case of a failure to -create program for a device, one of the following errors code is set -in device_errcode_ret list for the respective device:

    -
    -
    -
      -
    • -

      CL_DBK_UNSUPPORTED_KHR if the device does not support one of the -built-in kernels listed in kernel_ids.

      -
    • -
    • -

      CL_INVALID_PROPERTY if a property list for a defined built-in -kernel description is invalid.

      -
    • -
    • -

      CL_DBK_UNMET_MAX_RELATIVE_ERROR_KHR if a defined built-in kernel -does not meet the requested precision.

      -
    • -
    • -

      CL_OUT_OF_RESOURCES if there is a failure to allocate resources -required by the OpenCL implementation on the device.

      -
    • -
    -
    -
    -

    If a program object is not created, -clCreateProgramWithDefinedBuiltInKernels returns a NULL value with -one of the following error codes returned in errcode_ret:

    -
    -
    -
      -
    • -

      CL_INVALID_CONTEXT if context is not a valid context.

      -
    • -
    • -

      CL_INVALID_VALUE if device_list is NULL or num_devices is zero.

      -
    • -
    • -

      CL_INVALID_VALUE if a kernel name is not unique within kernel_names.

      -
    • -
    • -

      CL_INVALID_VALUE if there is a NULL value in kernel_names.

      -
    • -
    • -

      CL_INVALID_DBK_ID_KHR if any value in the kernel_ids is not a known -identifier for a built-in kernel.

      -
    • -
    • -

      CL_INVALID_DBK_ATTRIBUTE_KHR if a kernel attribute structure is -invalid for a built-in kernel.

      -
    • -
    • -

      CL_DBK_UNSUPPORTED_KHR if device_errcode_ret is NULL and any -device in device_list does not support a defined built-in kernel.

      -
    • -
    • -

      CL_DBK_UNSUPPORTED_KHR if device_errcode_ret is non-NULL and all -devices in device_list does not support a defined built-in kernel.

      -
    • -
    • -

      CL_DBK_UNSUPPORTED_PROPERTY_KHR If a kernel does not accept a -valid kernel property.

      -
    • -
    • -

      CL_INVALID_DEVICE if any device in device_list is not in the list of -devices associated with context.

      -
    • -
    • -

      CL_OUT_OF_RESOURCES if there is a failure to allocate resources -required by the OpenCL implementation on the device.

      -
    • -
    • -

      CL_OUT_OF_HOST_MEMORY if there is a failure to allocate resources -required by the OpenCL implementation on the host.

      -
    • -
    -
    -
    -
    -
    -
    (Modify section 5.10, Executing Kernels)
    -
    -
    -
    -
    -
    -
    (Add following to clEnqueueNDRangeKernel)
    -
    -
    -
    -
    -
    -

    For defined built-in kernels work_dim, global_work_offset, -global_work_size and local_work_size parameters are meaningless -and must be set to zero and NULL, respectively. OpenCL implementations -decide how they distribute the workloads of the defined built-in -kernels.

    -
    -
    -
    -
    -
    -
    -
    -
    -
    (Add the following to the list of error codes returned by clEnqueueNDRangeKernel)
    -
    -
    -
    -
    -
    -
      -
    • -

      CL_INVALID_GLOBAL_WORK_SIZE if the kernel is a defined built-in -kernel and global_work_size is not NULL.

      -
    • -
    • -

      CL_INVALID_GLOBAL_WORK_OFFSET if the kernel is a defined built-in -kernel and global_work_offset is not NULL.

      -
    • -
    • -

      CL_INVALID_LOCAL_WORK_SIZE if the kernel is a defined built-in -kernel and local_work_size is not NULL.

      -
    • -
    -
    -
    -
    - -
    -
    -
    -

    Add new appendix "Defined Built-in Kernels" to OpenCL API Specification

    -
    -

    This chapter describes standard defined built-in kernels (DBK) with -well-defined semantics. They are loaded into a program using -clCreateProgramWithDefinedBuiltinKernels and the kernels in it are -launched using clEnqueueNDRangeKernel with work_dim set to zero -and global_work_offset, global_work_size and local_work_size set -to NULL.

    -
    -
    -

    The general client-side abstraction of the DBKs is similar to a call -to a C function of which implementation is hidden. The device driver -are free to implement a DBK by invoking one or more coarse and fine-grained hardware accelerators combined with -firmware to implement the semantics as efficiently as possible.

    -
    -
    -

    It is the driver’s responsibility to handle efficient synchronization and communication -to the hardware accelerator, the internal accelerator state management and resource sharing -across multiple OpenCL contexts.

    -
    -
    -

    Reproducibility

    -
    -

    Identical DBKs or the same DBKs executed repeatedly with identical inputs are -guaranteed to produce identical results, unless otherwise stated in -the DBK’s description, when:

    -
    -
    -
      -
    • -

      enqueued to the same device,

      -
    • -
    • -

      on the same platform,

      -
    • -
    • -

      on the same vendor with the same driver version and

      -
    • -
    • -

      CL_DBK_PROPERTY_NON_DETERMINISTIC_KHR property is not set on.

      -
    • -
    -
    -
    -

    In other cases, the DBKs may produce different results. Two DBKs for a -device are considered identical if they are created using identical -kernel identifier, kernel attributes and kernel properties. The result -difference may occur because of different algorithms being used across -devices, for example.

    -
    -
    -

    DBKs may produce approximated results and the error, respect to -infinitely precise result, can be optionally controlled by -CL_DBK_PROPERTY_MAX_RELATIVE_ERROR_KHR when the property name is listed in -the DBK’s description. When the precision is not controlled by the -application using CL_DBK_PROPERTY_MAX_RELATIVE_ERROR_KHR, the OpenCL -precision of results are

    -
    -
    -
      -
    • -

      chosen by the implementation for floating-point based tasks.

      -
    • -
    • -

      exact for integer based tasks.

      -
    • -
    -
    -
    -
    -

    Kernel Interface

    -
    -

    DBKs operates on tensor objects, created with -clCreateBufferWithProperties using CL_MEM_TENSOR property, -generally in single-static assignment fashion. the Kernel arguments -used for reading and writing tensors may not reference the same tensor -object unless otherwise stated in the DBK descriptions.

    -
    -
    -
    -

    The Defined Built-in Kernels

    -
    -

    The list of recognized defined built-in kernels are listed in the -following table. It is expected to be -expanded and updated over the versions of this extensions, while -preserving backwards compatibility.

    -
    -
    -

    Each defined built-in kernel entry is organized as follows:

    -
    -
    -
      -
    • -

      Name: Name of the defined built-in kernel (an enumeration).

      -
    • -
    • -

      Kernel attributes: The kernel attributes required for creating the -defined built-in kernel via -clCreateProgramWithDefinedBuiltinKernels. Attribute values are -immutable.

      -
    • -
    • -

      Kernel arguments: The kernel arguments.

      -
    • -
    • -

      Description: The description of the kernel in detail.

      -
    • -
    • -

      Attribute validation rules: Conditions of the kernel attribute for -the kernel. Implementation must return CL_DBK_INVALID_ATTRIBUTE_KHR on -clCreateProgramWithDefinedBuiltinKernels call if any of the conditions -are violated.

      -
    • -
    • -

      Kernel mode properties: List of kernel properties -(cl_dbk_properties_khr) the kernel may accept. The properties can -be used to tweak certain implementation details and behaviors in -the kernel execution. If a property not listed in the DBK -description is fed to clCreateProgramWithDefinedBuiltinKernels -call, then implementation must return -CL_DBK_UNSUPPORTED_PROPERTY_KHR.

      -
    • -
    -
    - - ----- - - - - - - - - - - - - - - - - - - - -
    Table 1. Table of defined built-in kernel properties
    DBK Mode PropertyProperty ValueDescription

    CL_DBK_PROPERTY_MAX_RELATIVE_ERROR_KHR

    float

    -

    Require that the DBK produces the results which do not deviate more -than the given amount value of ULPs (units in the last place) respect -to infnitely precise result.

    -

    CL_DBK_PROPERTY_NON_DETERMINISTIC_KHR

    cl_bool

    -

    Allow results of the kernel to be non-reproducible. This allows -implementation to switch algorithm of the kernel on each launch for -possibly better performance.

    -
    - - --- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
    Table 2. Standard Built-in Kernels and Their Semantics. The table has been populated with a small set of non-trivial example entries which are subject to change and the list to expand during drafting.

    Name: CL_DBK_GEMM_KHR

    Kernel Attributes

    -
    -
    typedef struct cl_dbk_attributes_gemm_khr {
    -    cl_tensor_desc a;
    -    cl_tensor_desc b;
    -    cl_tensor_desc c_in;
    -    cl_tensor_desc c_out;
    -    cl_bool trans_a;
    -    cl_bool trans_b;
    -    cl_tensor_datatype_union_khr alpha;
    -    cl_tensor_datatype_union_khr beta;
    -    cl_dbk_properties kernel_props[CL_MAX_DBK_PROPERTIES];
    -} cl_dbk_attributes_gemm_khr;
    -
    -
    -
    -
      -
    • -

      a is a tensor description for input matrix A.

      -
    • -
    • -

      b is a tensor description for input matrix B.

      -
    • -
    • -

      c_in is a tensor description for output matrix CIN.

      -
    • -
    • -

      c_out is a tensor description for output matrix COUT.

      -
    • -
    • -

      trans_a instruct to transpose the A matrix if the value is CL_TRUE.

      -
    • -
    • -

      trans_b instruct to transpose the B matrix if the value is CL_TRUE.

      -
    • -
    • -

      alpha is a value or pointer to value corresponponding to the -element type of c_out.

      -
    • -
    • -

      beta is a value or pointer to value corresponponding to the -element type of c_out.

      -
    • -
    • -

      kernel_props defined additional kernel properties.

      -
    • -
    -

    Kernel Arguments

    -
      -
    1. -

      cl_mem: a tensor object for matrix A (read only).

      -
    2. -
    3. -

      cl_mem: a tensor object for matrix B (read only).

      -
    4. -
    5. -

      cl_mem: a tensor object for matrix C_IN (read only).

      -
    6. -
    7. -

      cl_mem: a tensor object for matrix C_OUT (write only).

      -
    8. -
    -

    Description

    -

    Performs (batched) general matrix multiplication:

    -
    -
    -
    -`{\mathbf{\text{COUT}}}_{b , m , n} = \text{beta} \cdot {\mathbf{\text{CIN}}}_{b , m , n} + \text{alpha} \cdot \sum_{k} t r a n s {\left ( \mathbf{\text{A}} , \text{trans\_a} \right )}_{b , m , k} t r a n s {\left ( \mathbf{\text{B}} , \text{trans\_b} \right )}_{b , k , n}` -
    -
    -
    -

    Where:

    -
    -
    -
    -`t r a n s ( X_{b , i , j} , t r ) = \left \{ \begin{matrix} X_{b , j , i} & \text{if tr} = \text{CL\_TRUE} \\ X_{b , i , j} & \text{otherwise} \end{matrix} \right .` -
    -
    -
    -

    Second degree tensors of shape (a, b) are treated as third degree -tensors of shape (1, a, b).

    -
    -
    -

    Operations of the matrix muliplication are performed in the precision -of the elementof\(COUT).

    -
    -
    -

    If an overflow occurs in the accumulation of the products, then R -tensor’s result will be undefined.

    -
    -
    -

    CIN and COUT tensors may be the same object.

    -

    Attribute validation rules

    -
      -
    • -

      rankof(A) == rankof(B) == rankof(CIN) == rankof(COUT).

      -
    • -
    • -

      Let shapeof(At) == (b…​, m, k) and shapeof(Bt) = (b…​, k, -n) of tensors A and B, respectively, after possible tranposing. -shapeof\(COUT) must be (b…​, m, n).

      -
    • -
    • -

      shapeof(CIN) == shapeof(COUT).

      -
    • -
    • -

      elementof(A) == elementof(B).

      -
    • -
    • -

      elemkindof\(COUT) == elemkindof(A).

      -
    • -
    • -

      elementof\(COUT) == elementof(A) or elementof(A) is promotable to -elementof\(COUT) without a loss of meaning.

      -
    • -
    -

    Kernel mode properties

    -

    This DBK accepts the following kernel properties:

    -
    -
    -
      -
    • -

      CL_DBK_PROPERTY_MAX_RELATIVE_ERROR_KHR

      -
    • -
    • -

      CL_DBK_PROPERTY_NON_DETERMINISTIC_KHR

      -
    • -
    -

    Name: CL_DBK_MATMUL_KHR

    Kernel Attributes

    -
    -
    typedef struct cl_dbk_attributes_matmul_khr {
    -    cl_tensor_desc a;
    -    cl_tensor_desc b;
    -    cl_tensor_desc c;
    -    cl_bool trans_a;
    -    cl_bool trans_b;
    -    cl_dbk_properties kernel_props[CL_MAX_DBK_PROPERTIES];
    -} cl_dbk_attributes_matmul_khr;
    -
    -
    -
    -
      -
    • -

      a is a tensor description for input matrix A.

      -
    • -
    • -

      b is a tensor description for input matrix B.

      -
    • -
    • -

      c is a tensor description for output matrix C.

      -
    • -
    • -

      trans_a instruct to transpose the A matrix if the value is CL_TRUE.

      -
    • -
    • -

      trans_b instruct to transpose the B matrix if the value is CL_TRUE.

      -
    • -
    • -

      kernel_props defined additional kernel properties.

      -
    • -
    -

    Kernel Arguments

    -
      -
    1. -

      cl_mem: a tensor object for matrix A (read only).

      -
    2. -
    3. -

      cl_mem: a tensor object for matrix B (read only).

      -
    4. -
    5. -

      cl_mem: a tensor object for matrix C (write only).

      -
    6. -
    -

    Description

    -

    Performs (batched) matrix multiplication:

    -
    -
    -
    -`{\mathbf{\text{C}}}_{b , m , n} = \sum_{k} t r a n s {\left ( \mathbf{\text{A}} , \text{trans\_a} \right )}_{b , m , k} t r a n s {\left ( \mathbf{\text{B}} , \text{trans\_b} \right )}_{b , k , n}` -
    -
    -
    -

    Where:

    -
    -
    -
    -`t r a n s ( X_{b , i , j} , t r ) = \left \{ \begin{matrix} X_{b , j , i} & \text{if tr} = \text{CL\_TRUE} \\ X_{b , i , j} & \text{otherwise} \end{matrix} \right .` -
    -
    -
    -

    Second degree tensors of shape (a, b) are treated as third degree -tensors of shape (1, a, b).

    -
    -
    -

    Operations of the matrix muliplication are performed in the precision -of the elementof\(COUT).

    -
    -
    -

    If an overflow occurs in the accumulation of the products, then R -tensor’s result will be undefined.

    -

    Attribute validation rules

    -
      -
    • -

      rankof(A) == rankof(B) == rankof(C).

      -
    • -
    • -

      Let shapeof(At) == (b…​, m, k) and shapeof(Bt) = (b…​, k, -n) of tensors A and B, respectively, after possible tranposing. -shapeof(C) must be (b…​, m, n).

      -
    • -
    • -

      elementof(A) == elementof(B).

      -
    • -
    • -

      elemkindof(C) == elemkindof(A).

      -
    • -
    • -

      elementof(C) == elementof(A) or elementof(A) is promotable to -elementof(C) without a loss of meaning.

      -
    • -
    -

    Kernel mode properties

    -

    This DBK accepts the following kernel properties:

    -
    -
    -
      -
    • -

      CL_DBK_PROPERTY_MAX_RELATIVE_ERROR_KHR

      -
    • -
    -

    Name: khr_leaky_relu

    Kernel Attributes

    -
    -
    typedef struct cl_dbk_attributes_leaky_relu_khr {
    -   cl_tensor_datatype_union_khr coefficient;
    -   cl_dbk_properties kernel_props[CL_MAX_DBK_PROPERTIES];
    -} cl_dbk_attributes_leaky_relu_khr;
    -
    -
    -
    -
      -
    • -

      alpha is a coefficient of leakage, a positive value.

      -
    • -
    -

    Kernel arguments

    -
      -
    1. -

      cl_mem: a tensor object IN for input values.

      -
    2. -
    3. -

      cl_mem: a tensor object OUT for output value.

      -
    4. -
    -

    Description

    -

    This element-wise built-in kernel performs a leaky ReLU operation as followed:

    -
    -
    -
    -`\text{OUT}_{i} = \left \{ \begin{matrix} - \text{alpha} \cdot \text{IN}_{i} & \text{if IN}_{i} \ < 0 \\ \text{IN}_{i} & \text{ otherwise} \end{matrix} \right .` -
    -
    -
    -

    If target device does not support denormals, then the alpha value is -flushed to zero before the operation is applied. This DBK accepts -tensors of arbitrary rank.

    -
    -
    -

    The IN and OUT tensors may be the same object.

    -

    Kernel mode properties

    This DBK accepts the following kernel properties:

    -

    * CL_DBK_PROPERTY_MAX_RELATIVE_ERROR_KHR -* CL_DBK_PROPERTY_NON_DETERMINISTIC_KHR

    Attribute validation rules

    -
      -
    • -

      shapeof(in) == shapeof(out).

      -
    • -
    • -

      elementof(in) == elementof(out).

      -
    • -
    • -

      coefficient must be a positive, finite value.

      -
    • -
    -
    -
    -
    -

    Launching DBKs from the Device Side

    -
    -

    DBKs are primarily meant to be launched as kernel commands via -host-side command queues. Optionally, they can be callable from -device-side via enqueue_kernel:

    -
    -
    -

    TBC. This probably needs device-side function corresponding to -clCreateProgramWithDefinedBuiltinKernels.

    -
    -
    -
    -
    -
    -
    -

    Sample Code

    -
    -
    -
    -
    constexpr size_t b = 64, m = 100, n = 200, k = 50;
    -cl_int err;
    -
    -std::vector<float> lhs_data = ...;
    -std::vector<float> rhs_data = ...;
    -std::vector<float> bias_data = ...;
    -std::vector<float> out_data(b * m * n);
    -
    -cl_tensor_layout_blas_exp row_major;
    -row_major.leading_dims[0] = 2,
    -row_major.leading_dims[1] = 1,
    -
    -cl_tensor_desc_exp lhs_desc;
    -lhs_desc.rank = 3;
    -lhs_desc.dtype = CL_TENSOR_FP32_EXP;
    -lhs_desc.properties[0] = 0;
    -lhs_desc.shape[0] = b;
    -lhs_desc.shape[1] = m;
    -lhs_desc.shape[2] = k;
    -lhs_desc.layout_type = CL_TENSOR_LAYOUT_BLAS_EXP;
    -lhs_desc.layout = &row_major;
    -
    -cl_tensor_desc_exp rhs_desc;
    -rhs_desc.rank = 3;
    -rhs_desc.dtype = CL_TENSOR_FP32_EXP;
    -rhs_desc.properties[0] = 0;
    -rhs_desc.shape[0] = b;
    -rhs_desc.shape[1] = k;
    -rhs_desc.shape[2] = n;
    -rhs_desc.layout_type = CL_TENSOR_LAYOUT_BLAS_EXP;
    -rhs_desc.layout = &row_major;
    -
    -cl_tensor_desc_exp out_desc;
    -out_desc.rank = 3;
    -out_desc.dtype = CL_TENSOR_FP32_EXP;
    -out_desc.properties[0] = 0;
    -out_desc.shape[0] = b;
    -out_desc.shape[1] = m;
    -out_desc.shape[2] = n;
    -out_desc.layout_type = CL_TENSOR_LAYOUT_BLAS_EXP;
    -out_desc.layout = &row_major;
    -
    -cl_mem lhs_tensor = clCreateBufferWithProperties(
    -  ctx, {CL_MEM_TENSOR_EXP, lhs_desc, 0},
    -  CL_MEM_USE_HOST_PTR | CL_MEM_READ_ONLY, 0, lhs_data.data(), &err);
    -cl_mem rhs_tensor = clCreateBufferWithProperties(
    -  ctx, {CL_MEM_TENSOR_EXP, rhs_desc, 0},
    -  CL_MEM_USE_HOST_PTR | CL_MEM_READ_ONLY, 0, rhs_data.data(), &err);
    -cl_mem bias_tensor = clCreateBufferWithProperties(
    -  ctx, {CL_MEM_TENSOR_EXP, out_desc, 0},
    -  CL_MEM_USE_HOST_PTR | CL_MEM_READ_ONLY, 0, rhs_data.data(), &err);
    -cl_mem out_tensor = clCreateBufferWithProperties(
    -  ctx, {CL_MEM_TENSOR_EXP, out_desc, 0},
    -  CL_MEM_USE_HOST_PTR | CL_MEM_READ_WRITE, 0, out_data.data(), &err);
    -
    -cl_tensor_datatype_union_khr alpha, beta, relu_coeff;
    -alpha.sf = 2.0f;
    -beta.sf = -1.0f;
    -relu_coeff.sf = 0.01f;
    -
    -cl_dkb_attributes_gemm_khr gemm_attrs = {
    -  lhs_desc, rhs_desc, out_desc, out_desc, 0, 0, alpha, beta, {}
    -};
    -gemm_attrs.kernel_props[0] = CL_DBK_PROPERTY_MAX_RELATIVE_ERROR_KHR;
    -gemm_attrs.kernel_props[1] = 100; // in ILPs
    -gemm_attrs.kernel_props[2] = 0;
    -
    -cl_dkb_attributes_leaky_relu_khr relu_attrs = {
    -  out_desc, out_desc, relu_coeffs, {0}
    -};
    -
    -cl_device_id target_devices[2] = {dev1, dev2};
    -cl_int device_errcodes[2];
    -auto prog = clCreateProgramWithDefinedBuiltInKernels(
    -  ctx, 2, target_devices, 2,
    -  {CL_DBK_GEMM_KHR, CL_DBK_LEAKY_RELU_KHR}, {"my_gemm", "my_relu"},
    -  {&gemm_attrs, &relu_attrs}, &device_errcodes, &err);
    -
    -std::vector<cl_device_id> supported_devs;
    -for (unsigned i = 0; i < 2; i++) {
    -  if (device_errcodes[i] == CL_SUCCESS) {
    -    supported_devs.push_back(target_devices[i]);
    -  } else {
    -     // Handle errors. Possible error cases (non-exhaustive):
    -     //
    -     // * CL_DBK_UNSUPPORTED_KHR: The DBK is not supported on the device.
    -     // * CL_DBK_UNMET_MAX_RELATIVE_ERROR_KHR The DBK implementation does not
    -     //   meet the requested precision.
    -  }
    -}
    -
    -err = clBuildProgram(
    -  prog, supported_devs.size(), supported_devs.data(), "", nullptr, nullptr);
    -
    -auto gemm_kernel = clCreateKernel(prog, "my_gemm", &err);
    -clSetKernelArg(gemm_kernel, 0, sizeof(cl_mem), &lhs_tensor);
    -clSetKernelArg(gemm_kernel, 1, sizeof(cl_mem), &rhs_tensor);
    -clSetKernelArg(gemm_kernel, 2, sizeof(cl_mem), &bias_tensor);
    -clSetKernelArg(gemm_kernel, 3, sizeof(cl_mem), &out_tensor);
    -
    -auto relu_kernel = clCreateKernel(prog, "my_relu", &err);
    -clSetKernelArg(relu_kernel, 0, sizeof(cl_mem), &out_tensor);
    -clSetKernelArg(relu_kernel, 1, sizeof(cl_mem), &out_tensor);
    -
    -cmq_q = /* Create an in-order command queue. */;
    -
    -clEnqueueNDRangeKernel(
    -  cmd_q, 0, nullptr, nullptr, nullptr, gemm_kernel, 0, nullptr, nullptr);
    -clEnqueueNDRangeKernel(
    -  cmd_q, 0, nullptr, nullptr, nullptr, relu_kernel, 0, nullptr, nullptr);
    -clEnqueueMapBuffer(
    -  cmd_q, out_tensor, CL_TRUE, CL_MAP_READ, 0, b * m * n, 0, nullptr, nullptr);
    -
    -
    -
    -

    Open questions

    -
    -
      -
    1. -

      Should we enable launching DBKs from the device side without requiring device-side enqueue? The main problem is those with NDRange as they are not simple single-WI helper functions.

      -
      -
      -
      -

      UNRESOLVED

      -
      -
      -
      -
    2. -
    3. -

      Should the NDRange be used at all in DBKs? It feels sort of unnatural as typically the NDRange is used to imply SPMD parallelism while the hardware/firmware is free to choose whatever parallelization strategy to implement the function. On the other hand, similar applies to software kernel launches as the NDRange-launched work-items can be executed serially if adhering to barrier semantics.

      -
      -
      -
      -

      UNRESOLVED

      -
      -
      -
      -
    4. -
    5. -

      Different accelerators prefer different channel orders (NHWC vs. NCHW…​) for the processed data. Should the channel order be passed as a DBK argument (like in the example GEMM’s row/column order) or is it better to have different DBK variations for each?

      -
      -
      -
      -

      UNRESOLVED

      -
      -
      -
      -
    6. -
    7. -

      How to denote preference? Some of the DBKs are more efficient on a given device as they map more naturally to the underlying HW accelerator, but the slower variations (for example, with unoptimal channel order in NN accelerators) might be still beneficially accelerated.

      -
      -
      -
      -

      UNRESOLVED

      -
      -
      -
      -
    8. -
    9. -

      Since the defined built-in kernel concept is basically just a C-like API inside another API, should it be made more generic and thus directly usable for SYCL and Vulkan as well?

      -
      -
      -
      -

      UNRESOLVED

      -
      -
      -
      -
    10. -
    11. -

      What other DBK mode properties we should have? Here are some ideas:

      -
      -
        -
      • -

        Perform accumulation with saturation.

        -
      • -
      • -

        Finite math only

        -
      • -
      • -

        Flush denormals to zero.

        -
      • -
      -
      -
      -
      -
      -

      UNRESOLVED

      -
      -
      -
      -
    12. -
    13. -

      Should we reuse (and remove "deprecation" status on) clEnqueueTask -for launching DBKs as DBKs don’t make use of global offset and size -and local size parameters?

      -
      -
      -
      -

      UNRESOLVED

      -
      -
      -
      -
    14. -
    -
    -
    -
    -
    -
    -

    Version History

    -
    - ------ - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
    VersionDateAuthorDescription

    0.1.0

    2022-12-13

    Pekka Jääskeläinen
    -Ben Ashbaugh

    -

    First formulation as an extension specification like proposed by Ben Ashbaugh.

    -

    0.2.0

    2023-11-23

    Henry Linjamäki
    -Pekka Jääskeläinen
    -Ben Ashbaugh

    -

    Add APIs for defined built-in kernel (DBK) creation. Model DBKs on -tensor type. Add sample code.

    -

    0.3.0

    2024-8-20

    Henry Linjamäki
    -Pekka Jääskeläinen
    -Freddie Witherden

    -
      -
    • -

      Rework document structure match to the cl_khr_extension_template.

      -
    • -
    • -

      Reflect changes of the cl_exp_tensor extension here.

      -
    • -
    • -

      Add "Kernel Interface" section into the DBK Appendix.

      -
    • -
    • -

      Add GEMM DBK.

      -
    • -
    • -

      Change DBK creation interface.

      -
    • -
    -
    -
    -
    -
    - - - \ No newline at end of file