OpenCL

一、OpenCL host开发流程

建立Platform环境（Platform、Device、contest）
- 平台：一台服务器可以有GPU和FPGA多个平台
  - cl_platform_id X=findPlatform("Intel(R) FPGA");或
  - clGetPlatformIDs(1, &myp, NULL);
- 设备：通过平台获得设备个数，每个平台可以有多个Device，获得设备ID：
  - clGetDeviceIDs(cl_platform_id, CL_DEVICE_TYPE_ALL, cl_uint, cl_device_id, cl_uint*);
- 上下文：上下文可以指定一个或多个设备作为当前的操作对象，上下文context用来管理command-queue, memory, program和kernel，以及指定kernel在上下文中的一个或多个设备上执行
  - cl_context = clCreateContext(0, cl_uint, cl_device_id*, callbackfunction, NULL, status);
指定Program与Kernel：要知道GPU上跑的是什么程序，程序接口是什么样的
- 创建program对象
- 创建kernel对象
Host与Kernel的交互（Host Buffer、Kernel Buffer、Read/Write Buffer）：怎么把数据写入GPU、怎么把数据从GPU上读出来
- 创建命令队列：例如cl_command_queue clCreateCommandQueue(cl_context, cl_device_id, 0, status);
- 创建kernel端内存：例如cl_mem = clCreateBuffer(context, CL_MEM_READ_WRITE, size, void*, status);
- 参数映射，将kernel端内存与kernel的参数建立关系：例如status=clSetKernelArg(kernel_0, 0, sizeof(cl_mem), &in_0); status=clSetKernelArg(kernel_0, 1, sizeof(cl_mem), &out_0);
- 创建Host内存（C语言中常规创建空间的方式）：例如unsigned int *in_buf_0=(unsigned int*) aligned_alloc(64, n*sizeof(unsigned int)); unsigned int *out_buf_0=(unsigned int*) aligned_alloc(64, n*sizeof(unsigned int));
- 将Host内存写入kernel内容：例如clEnqueueWriteBuffer(queue0[0], in_0, CL_TRUE, 0, n*sizeof(unsigned int), in_buf_0, 0, NULL, NULL);
Kernel的执行：执行GPU程序，执行完后读出数据并释放内存
- 法1-执行任务（单工作项）
- 法2-NDRange（多工作项执行方式）
内存释放

二、OpenCL: High-Level Overview

OpenCL Components:
- C Host API: basically saying what devices do I want to use, what do I want them to do, what functions I want to call, where should memory be
  - What you call from the host
  - Directs devices
- OpenCL C
  - Used to program the device
  - Based on C99
  - Many built-in functions
- Models
  - Memory, execution, etc

如上图，the host is going to call the host API to manage devices on the right. Devices are programmed using OpenCL C. Underneath all of these are models. These models are here to guide everything.

OpenCL Model:
- Device Model: what devices look like inside
  - Inside the Device:
    
    The device is broken down into further pieces, each of the small rectangle is a compute unit (CU), in this picture we have 15 compute units, and we have 8 processing elements per CU
  - Inside the Compute Unit:
    
    PE stands for processing element. Let’s take apart one of these blocks of the PE and private memory and see what that’s about.
  - Inside the Processing Element (PE):
    Think of the PE as a very simple processor. In particular, all instructions are executed on the processing element (means that everything that you are going to do in terms of actually making devices do work, the PE is going to be responsible for all of that.
- Execution Model: How work gets done on devices
  - Kernel Functions:
    - OpenCL executes kernel functions on the device. The kernel functions are just ordinary functions with a special signature
    - Kernel calls have two parts:
      - Ordinary function argument list
      - External execution parameters that control parallelism
  - Role of the host in kernel execution
    - Coordinates execution (the host tells the device to call this function, but does not participate itself)
    - Provides arguments to the kernel ( the host tell the device what to do and to provide it arguments)
    - Provides execution parameters to launch the kernel
  - NDRange: execution strategy
    - The same kernel function will be invoked many times
      - The argument list is identical for all invocations
    - Basically we call the same function over and over
      - How many times we do this is dictated by the execution parameters.
    - Host sets extra execution parameters prior to launch
  - NDRange: Identifying the call
    - How do kernel functions know what to work on?
      - The argument list is identical
    - Insight: execution paramenters provide an index space
      - each function invocation can access its index
    - The index space is n-dimensional
  - NDRange: Some Definitions
    - Work-item: invocation of the kernel for a particular index
    - Global ID: globally unique id for a work-item (from index space)
    - Global Work Size: the number of work-items (per dimension)
    - Work Dimension: dimension of the index space
    - Work-groups: Partition the global work into smaller pieces. Work-groups execute on compute units, work-items (inside a work-group) mapped to CU PEs. All work-items in a work-group share local memory
      - work-group size has a physical meaning: it is device specific
        Maximum work-group size is a device characteristic: you can query a device to determine this value
        Maximum work-group size is an integer: Handle n-dimensional work-groups in a special way
        How to determine the best work-group size: this is too advanced for now
      - work-items can find out: their work-group id, size of work-groups, global id, global work size
    - The work-item perspective: each work-item has its own private memory, all of the work-items within the work-group or compute unit are able to share the local memory. Every work-item on the device can access the constant memory and the global memory
  - 一些Kernel Call Points
    - Host will provide execution dimensions to the device, this creates an index space
    - Parameters can be values or global memory objects
    - Global memory is persistent between calls. But constant、local、private memory is just scratch space, they are going to be reset per kernel call
    - OpenCL implementation has considerable flexibility:
      - How to map work-items to PEs?
      - How to schedule work?
- Memory Model: How devices and host see data
  - Global Memory: where you load data and run functions
    - Shared with all processing elements
    - Host can access this memory too
      - Memory map
      - Copy data to/from global memory
    - This is OpenCL persistent storage (the memory remains across subsequent executions)
      - Other memory regions are scratch space
  - Constant Memory:
    - Shared with all processing elements
    - Read-only memory
    - Very different way to share data with all device PEs
    - Not persistent (will change over time)
  - Local Memory:
    - Shared with all PEs in a CU
    - Very efficient way to share data with all CU PEs
    - Cannot be accessed by other compute units
    - Not persistent
  - Private Memory:
    - Accessible by a single processing element (PE)
    - No other PE can access this memory
    - Not persistent
- Host API: How the host control the devices
  - Platform
    - A platform is an implementation of OpenCL
    - Platforms are like drivers for particular devices: platforms expose devices to you
    - Example: A system with two GPUs (AMD+nVIDIA) and a Xeon Phi (Intel)
      - A platform from AMD for one GPU and the CPU
      - A platform from Intel for the Xeon Phi
      - A platform from nVIDIA for the other GPU
    - Use the platform to discover devices available to you
  - Context: when you write an OpenCL program, creating a context is the first thing you do. What you’re going to do is: discover the platform -> get a context -> start locating memory -> start controlling devices
    - You create a context for a particular platform (you cannot have multiple platforms in a context)
    - A context is a container:
      - Contains devices
      - Contains memory
    - Most operations are related to a context (Implicitly or explicitly)
  - Program:
    - Programs are just collections of kernels (you extract kernels from your program to call them)
    - OpenCL applications have to load kernels
      - Compile OpenCL C source code
      - Load binary representation
    - Programs are device specific
  - Asynchronous Device Calls:
    The host manages devices asynchronously. You can have multiple devices attached to your host (for example you may have a Xeon Phi, an AMD GPU, an Nvidia GPU and you can use a CPU as another device). Now you want to manage all of these devices asynchronously for best performance. OpenCL has an asynchronous interface to do this.
    - Asynchronous Device Management
      - Host issues commands to device
      - Commands tell the device to do something
      - Device take commands and do as they say
      - Host waits for commands to complete: this means the device has completed that action
      - Commands can be dependent on other commands
      - OpenCL commands are issued by clEnqueue* calls:
        A cl_event object returned by clEnqueue* calls is used for dependencies
    - Command overview:
      - clEnqueueFoo enqueue the command “Foo” to run on a particular device
      - e1 is a handle to represent this command
      - {deps}: this is a set of previously issued commands that have to be finished before. Commands take a list of dependencies
    - An example:
      
      I have 2 commands I’m going to run called Foo and 1 command called bar. e1 and e2 have no dependencies because their dependent set is empty. But the command bar cannot be completed until these two previous calls to Foo have been finished. In real life, Foo might be doing memory copies and bar might be a kernel
    - Where do commands go? Now we have talked about enqueueing thins but haven’t really saied where they go or what they do
      - OpenCL has command-queues
      - A command-queue is attached to a single device
      - You can create as many command-queues as you want
      - clEnqueue* commands have a command-queue parameter
        
        1
        
        2
  - Host API Summary:
    - Host API controls the device (Devices can’t do anything themselves)
    - Asynchronous execution model
      - Important for speed
      - A bit different from traditional asynchronous APIs (because of the command queue system and everything else)
Mapping NDRange to Devices
- Remember the PE runs instructions
  - So work-items should run on PEs
- Assign multiple work-items to each PE
  - Need to handle the case that global work size > number PEs
- Partition the global work into smaller pieces (work-groups)
  - Work-groups execute on compute units. All work-items in the work-group share local memory and mapped to CU PEs.
Conceptual Work-Group Launching
Geometric Visualization:
- 1D:
- 2D:
- 3D

三、OpenCL C

What is OpenCL C:

OpenCL device programming language: the OpenCL C is a modification of the C programming language to actually target the devices
The main actor in OpenCL programming
OpenCL C is like C99
The other part of the OpenCL specification

OpenCL C != C:

No function pointers
No recursion
Function calls might be inlined
OpenCL C is not a subset of C: OpenCL C has features not in C
The specification outlines the full set of differences

Types:
- OpenCL C vs C
  - OpenCL C provides a concrete representation
    - 带符号整数用二进制补码表示
    - types have fixed sizes
  - OpenCL C provides vector types and operatons
  - OpenCL C provides image types: example of an opaque type
    - Opaque type is something that you don’t have direct access to its memory representation. You use other functions to extract information from it.
  - OpenCL C types are mostly C types
- Host and Device Types:
  
  如上图，因为在device端，我们知道int类型是用2进制补码表示、32比特的；而在host我们并不清楚它具体的表示方式和size，因此不能直接复制。So be careful of host-device data exchange!
- Types restricted to device: means that you can’t transfer them between the host and the device
Memory regions
- OpenCL C memory pointers: __global int* x
  - __global: specifying the memory region where do we want to point to
  - __global int*: pointer to an integer in global memory
  - 如果有两个__global int*变量x和y，就可以运行x=y，也就是让x指向y所指的地方；如果x是__global int*，y是__private int*，就不可以运行x=y，but we can still copy values（即运行*x=*y）
Vector operations：就是向量，类似于C++中的数组
- OpenCL V Vector Types:
- Vector operaions:
  - vector-vector: 如下面的代码，就是作component-wise operation
```
float4 x, y, z;
z = x + y;
```
  - scalar-vector: When we mix scalars and vectors, the scalars will be padded out. 如下面的代码，结果应为z=(float4)x+y
```
float x;
float4 y;
z = x + y;
```
- Vector Components: vec.<component>
- Why use OpenCL vector types / OpenCL C vector type advantages:
  - Clear communication of vector operations (you and the compiler both know these are vectors, i.e. a bundle of data here)
  - Simplifies code
  - Excellent performance: the complier can do a great job of vectorizing when you are using vectors in this context
Structures
- OpenCL C has structures and unions, just like C
- But there are good reasons to not use them (performance)
- Be careful of data exchange
  - Binary layout of struct must be same on device and host
  - Almost impossible to get right
Functions
- Overview
  - Ordinary C functions: nothing special
  - Recursion is forbidden
  - Functions might be expanded and inlined (by the compiler, not something affects you, but you should still know)
- Example
Kernels: this is really what you’re calling to do work on the device. The time on studying execution model is going to pay off here
- Introducing Kernels:
  - Kernels are entry points to device execution (likeint main(int argc, char** argv, except我们可以将main改成任意名字)
  - Kernels are called by the host
    - Host will setup parameters for the call
    - Host will supply execution parameters for the call
    - Device runs function
  - Kernel arguments are pointers to __global (something in the global space) or just values
- Kernel example: adds two arrays together
  - 这里get_global_id(0)中的0就是the zeroth dimension of the id
  - 函数前面的__kernel is always required
- Review of Execution Model: 下面这些concepts对于write kernel functions 来说非常重要，而也会有一些relevant functions to access these
  - Execution has dimensions
  - Global work size
  - Global offset
  - Work-group size
- Relevant functions
  - get_global_id(n): give us the work-item id in dimension n
  - get_global_offset(n)
  - get_local_id(n): says which work-item am I inside my work-group
- Local memory
  - Memory shared by work-items within the work-group (might be implemented in hardware)
  - Often key to top-performance, so how do we declare something to use local memory? 如下图的两段代码
- Constant memory
  - Read-only memory shared by all work-items
    - Very fast to read
    - But relatively small amount of space
  - Useful in some circumstances: e.g. lookup tables
- Kernel limitations
  - Kernels might execute concurrently on the device, but there is no mechanism for them to cooperate
  - A single kernel is limited in what it can do, so you might need to launch several kernels to get a job done
  - Kernels cannot allocate memory: everything is fixed prior to kernel execution
- Kernel attributes
  - vec_type_hint: hint to the compiler for vectorization
  - reqd_work_group_size: forces a work-group size (very useful for performance). It can do very special and very particular optimization and do a very good job of doing things like register allocation
Quick Topics
- OpenCL supports image operations: Load an image, do something, write an image
- Built-in OpenCL C functions (kind of like a standard library)
  - Work-item functions: figure out the kernel launch parameters
  - Math functions
  - Integer functions
  - Geometric functions
  - see the documentation for details
- Synchronization: complex topic, need to watch for another video
- Extensions: These are extra features that you can enable with #pragma