一、OpenCL host开发流程
- 建立Platform环境(Platform、Device、contest)
- 平台:一台服务器可以有GPU和FPGA多个平台
cl_platform_id X=findPlatform("Intel(R) FPGA");
或clGetPlatformIDs(1, &myp, NULL);
- 设备:通过平台获得设备个数,每个平台可以有多个Device,获得设备ID:
clGetDeviceIDs(cl_platform_id, CL_DEVICE_TYPE_ALL, cl_uint, cl_device_id, cl_uint*);
- 上下文:上下文可以指定一个或多个设备作为当前的操作对象,上下文context用来管理command-queue, memory, program和kernel,以及指定kernel在上下文中的一个或多个设备上执行
cl_context = clCreateContext(0, cl_uint, cl_device_id*, callbackfunction, NULL, status);
- 平台:一台服务器可以有GPU和FPGA多个平台
- 指定Program与Kernel:要知道GPU上跑的是什么程序,程序接口是什么样的
- 创建program对象
- 创建kernel对象
- 创建program对象
- Host与Kernel的交互(Host Buffer、Kernel Buffer、Read/Write Buffer):怎么把数据写入GPU、怎么把数据从GPU上读出来
- 创建命令队列:例如
cl_command_queue clCreateCommandQueue(cl_context, cl_device_id, 0, status);
- 创建kernel端内存:例如
cl_mem = clCreateBuffer(context, CL_MEM_READ_WRITE, size, void*, status);
- 参数映射,将kernel端内存与kernel的参数建立关系:例如
status=clSetKernelArg(kernel_0, 0, sizeof(cl_mem), &in_0);
status=clSetKernelArg(kernel_0, 1, sizeof(cl_mem), &out_0);
- 创建Host内存(C语言中常规创建空间的方式):例如
unsigned int *in_buf_0=(unsigned int*) aligned_alloc(64, n*sizeof(unsigned int));
unsigned int *out_buf_0=(unsigned int*) aligned_alloc(64, n*sizeof(unsigned int));
- 将Host内存写入kernel内容:例如
clEnqueueWriteBuffer(queue0[0], in_0, CL_TRUE, 0, n*sizeof(unsigned int), in_buf_0, 0, NULL, NULL);
- 创建命令队列:例如
- Kernel的执行:执行GPU程序,执行完后读出数据并释放内存
- 法1-执行任务(单工作项)
- 法2-NDRange(多工作项执行方式)
- 法1-执行任务(单工作项)
- 内存释放
二、OpenCL: High-Level Overview
- OpenCL Components:
- C Host API: basically saying what devices do I want to use, what do I want them to do, what functions I want to call, where should memory be
- What you call from the host
- Directs devices
- OpenCL C
- Used to program the device
- Based on C99
- Many built-in functions
- Models
- Memory, execution, etc
- C Host API: basically saying what devices do I want to use, what do I want them to do, what functions I want to call, where should memory be
如上图,the host is going to call the host API to manage devices on the right. Devices are programmed using OpenCL C. Underneath all of these are models. These models are here to guide everything.
- OpenCL Model:
- Device Model: what devices look like inside
- Inside the Device:
The device is broken down into further pieces, each of the small rectangle is a compute unit (CU), in this picture we have 15 compute units, and we have 8 processing elements per CU - Inside the Compute Unit:
PE stands for processing element. Let’s take apart one of these blocks of the PE and private memory and see what that’s about. - Inside the Processing Element (PE):
Think of the PE as a very simple processor. In particular, all instructions are executed on the processing element (means that everything that you are going to do in terms of actually making devices do work, the PE is going to be responsible for all of that.
- Inside the Device:
- Execution Model: How work gets done on devices
- Kernel Functions:
- OpenCL executes kernel functions on the device. The kernel functions are just ordinary functions with a special signature
- Kernel calls have two parts:
- Ordinary function argument list
- External execution parameters that control parallelism
- Role of the host in kernel execution
- Coordinates execution (the host tells the device to call this function, but does not participate itself)
- Provides arguments to the kernel ( the host tell the device what to do and to provide it arguments)
- Provides execution parameters to launch the kernel
- NDRange: execution strategy
- The same kernel function will be invoked many times
- The argument list is identical for all invocations
- Basically we call the same function over and over
- How many times we do this is dictated by the execution parameters.
- Host sets extra execution parameters prior to launch
- The same kernel function will be invoked many times
- NDRange: Identifying the call
- How do kernel functions know what to work on?
- The argument list is identical
- Insight: execution paramenters provide an index space
- each function invocation can access its index
- The index space is n-dimensional
- How do kernel functions know what to work on?
- NDRange: Some Definitions
- Work-item: invocation of the kernel for a particular index
- Global ID: globally unique id for a work-item (from index space)
- Global Work Size: the number of work-items (per dimension)
- Work Dimension: dimension of the index space
- Work-groups: Partition the global work into smaller pieces. Work-groups execute on compute units, work-items (inside a work-group) mapped to CU PEs. All work-items in a work-group share local memory
- work-group size has a physical meaning: it is device specific
- Maximum work-group size is a device characteristic: you can query a device to determine this value
- Maximum work-group size is an integer: Handle n-dimensional work-groups in a special way
- How to determine the best work-group size: this is too advanced for now
- work-items can find out: their work-group id, size of work-groups, global id, global work size
- work-group size has a physical meaning: it is device specific
- The work-item perspective: each work-item has its own private memory, all of the work-items within the work-group or compute unit are able to share the local memory. Every work-item on the device can access the constant memory and the global memory
- 一些Kernel Call Points
- Host will provide execution dimensions to the device, this creates an index space
- Parameters can be values or global memory objects
- Global memory is persistent between calls. But constant、local、private memory is just scratch space, they are going to be reset per kernel call
- OpenCL implementation has considerable flexibility:
- How to map work-items to PEs?
- How to schedule work?
- Kernel Functions:
- Memory Model: How devices and host see data
- Global Memory: where you load data and run functions
- Shared with all processing elements
- Host can access this memory too
- Memory map
- Copy data to/from global memory
- This is OpenCL persistent storage (the memory remains across subsequent executions)
- Other memory regions are scratch space
- Constant Memory:
- Shared with all processing elements
- Read-only memory
- Very different way to share data with all device PEs
- Not persistent (will change over time)
- Local Memory:
- Shared with all PEs in a CU
- Very efficient way to share data with all CU PEs
- Cannot be accessed by other compute units
- Not persistent
- Private Memory:
- Accessible by a single processing element (PE)
- No other PE can access this memory
- Not persistent
- Global Memory: where you load data and run functions
- Host API: How the host control the devices
-
Platform
- A platform is an implementation of OpenCL
- Platforms are like drivers for particular devices: platforms expose devices to you
- Example: A system with two GPUs (AMD+nVIDIA) and a Xeon Phi (Intel)
- A platform from AMD for one GPU and the CPU
- A platform from Intel for the Xeon Phi
- A platform from nVIDIA for the other GPU
- Use the platform to discover devices available to you
-
Context: when you write an OpenCL program, creating a context is the first thing you do. What you’re going to do is: discover the platform -> get a context -> start locating memory -> start controlling devices
- You create a context for a particular platform (you cannot have multiple platforms in a context)
- A context is a container:
- Contains devices
- Contains memory
- Most operations are related to a context (Implicitly or explicitly)
-
Program:
- Programs are just collections of kernels (you extract kernels from your program to call them)
- OpenCL applications have to load kernels
- Compile OpenCL C source code
- Load binary representation
- Programs are device specific
-
Asynchronous Device Calls:
The host manages devices asynchronously. You can have multiple devices attached to your host (for example you may have a Xeon Phi, an AMD GPU, an Nvidia GPU and you can use a CPU as another device). Now you want to manage all of these devices asynchronously for best performance. OpenCL has an asynchronous interface to do this.- Asynchronous Device Management
- Host issues commands to device
- Commands tell the device to do something
- Device take commands and do as they say
- Host waits for commands to complete: this means the device has completed that action
- Commands can be dependent on other commands
- OpenCL commands are issued by
clEnqueue*
calls:- A
cl_event
object returned byclEnqueue*
calls is used for dependencies
- A
- Command overview:
clEnqueueFoo
enqueue the command “Foo” to run on a particular devicee1
is a handle to represent this command{deps}
: this is a set of previously issued commands that have to be finished before. Commands take a list of dependencies
- An example:
I have 2 commands I’m going to run called Foo and 1 command called bar.e1
ande2
have no dependencies because their dependent set is empty. But the command bar cannot be completed until these two previous calls to Foo have been finished. In real life, Foo might be doing memory copies and bar might be a kernel - Where do commands go? Now we have talked about enqueueing thins but haven’t really saied where they go or what they do
- OpenCL has command-queues
- A command-queue is attached to a single device
- You can create as many command-queues as you want
clEnqueue*
commands have a command-queue parameter
1
2
- Asynchronous Device Management
-
Host API Summary:
- Host API controls the device (Devices can’t do anything themselves)
- Asynchronous execution model
- Important for speed
- A bit different from traditional asynchronous APIs (because of the command queue system and everything else)
-
- Device Model: what devices look like inside
- Mapping NDRange to Devices
- Remember the PE runs instructions
- So work-items should run on PEs
- Assign multiple work-items to each PE
- Need to handle the case that global work size > number PEs
- Partition the global work into smaller pieces (work-groups)
- Work-groups execute on compute units. All work-items in the work-group share local memory and mapped to CU PEs.
- Remember the PE runs instructions
- Conceptual Work-Group Launching
- Geometric Visualization:
- 1D:
- 2D:
- 3D
- 1D:
三、OpenCL C
What is OpenCL C:
- OpenCL device programming language: the OpenCL C is a modification of the C programming language to actually target the devices
- The main actor in OpenCL programming
- OpenCL C is like C99
- The other part of the OpenCL specification
OpenCL C != C:
- No function pointers
- No recursion
- Function calls might be inlined
- OpenCL C is not a subset of C: OpenCL C has features not in C
- The specification outlines the full set of differences
- Types:
- OpenCL C vs C
- OpenCL C provides a concrete representation
- 带符号整数用二进制补码表示
- types have fixed sizes
- OpenCL C provides vector types and operatons
- OpenCL C provides image types: example of an opaque type
- Opaque type is something that you don’t have direct access to its memory representation. You use other functions to extract information from it.
- OpenCL C types are mostly C types
- OpenCL C provides a concrete representation
- Host and Device Types:
如上图,因为在device端,我们知道int类型是用2进制补码表示、32比特的;而在host我们并不清楚它具体的表示方式和size,因此不能直接复制。So be careful of host-device data exchange! - Types restricted to device: means that you can’t transfer them between the host and the device
- OpenCL C vs C
- Memory regions
- OpenCL C memory pointers:
__global int* x
__global
: specifying the memory region where do we want to point to__global int*
: pointer to an integer in global memory- 如果有两个
__global int*
变量x和y,就可以运行x=y
,也就是让x指向y所指的地方;如果x是__global int*
,y是__private int*
,就不可以运行x=y
,but we can still copy values(即运行*x=*y
)
- OpenCL C memory pointers:
- Vector operations:就是向量,类似于C++中的数组
- OpenCL V Vector Types:
- Vector operaions:
- vector-vector: 如下面的代码,就是作component-wise operation
float4 x, y, z; z = x + y;
- scalar-vector: When we mix scalars and vectors, the scalars will be padded out. 如下面的代码,结果应为
z=(float4)x+y
float x; float4 y; z = x + y;
- vector-vector: 如下面的代码,就是作component-wise operation
- Vector Components:
vec.<component>
- Why use OpenCL vector types / OpenCL C vector type advantages:
- Clear communication of vector operations (you and the compiler both know these are vectors, i.e. a bundle of data here)
- Simplifies code
- Excellent performance: the complier can do a great job of vectorizing when you are using vectors in this context
- OpenCL V Vector Types:
- Structures
- OpenCL C has structures and unions, just like C
- But there are good reasons to not use them (performance)
- Be careful of data exchange
- Binary layout of struct must be same on device and host
- Almost impossible to get right
- Functions
- Overview
- Ordinary C functions: nothing special
- Recursion is forbidden
- Functions might be expanded and inlined (by the compiler, not something affects you, but you should still know)
- Example
- Overview
- Kernels: this is really what you’re calling to do work on the device. The time on studying execution model is going to pay off here
- Introducing Kernels:
- Kernels are entry points to device execution (like
int main(int argc, char** argv
, except我们可以将main改成任意名字) - Kernels are called by the host
- Host will setup parameters for the call
- Host will supply execution parameters for the call
- Device runs function
- Kernel arguments are pointers to
__global
(something in the global space) or just values
- Kernels are entry points to device execution (like
- Kernel example: adds two arrays together
- 这里
get_global_id(0)
中的0就是the zeroth dimension of the id - 函数前面的
__kernel
is always required
- 这里
- Review of Execution Model: 下面这些concepts对于write kernel functions 来说非常重要,而也会有一些relevant functions to access these
- Execution has dimensions
- Global work size
- Global offset
- Work-group size
- Relevant functions
get_global_id(n)
: give us the work-item id in dimension nget_global_offset(n)
get_local_id(n)
: says which work-item am I inside my work-group
- Local memory
- Memory shared by work-items within the work-group (might be implemented in hardware)
- Often key to top-performance, so how do we declare something to use local memory? 如下图的两段代码
- Constant memory
- Read-only memory shared by all work-items
- Very fast to read
- But relatively small amount of space
- Useful in some circumstances: e.g. lookup tables
- Read-only memory shared by all work-items
- Kernel limitations
- Kernels might execute concurrently on the device, but there is no mechanism for them to cooperate
- A single kernel is limited in what it can do, so you might need to launch several kernels to get a job done
- Kernels cannot allocate memory: everything is fixed prior to kernel execution
- Kernel attributes
vec_type_hint
: hint to the compiler for vectorizationreqd_work_group_size
: forces a work-group size (very useful for performance). It can do very special and very particular optimization and do a very good job of doing things like register allocation
- Introducing Kernels:
- Quick Topics
- OpenCL supports image operations: Load an image, do something, write an image
- Built-in OpenCL C functions (kind of like a standard library)
- Work-item functions: figure out the kernel launch parameters
- Math functions
- Integer functions
- Geometric functions
- see the documentation for details
- Synchronization: complex topic, need to watch for another video
- Extensions: These are extra features that you can enable with
#pragma