OpenCL

一、OpenCL host开发流程

  1. 建立Platform环境(Platform、Device、contest)
    • 平台:一台服务器可以有GPU和FPGA多个平台
      • cl_platform_id X=findPlatform("Intel(R) FPGA");
      • clGetPlatformIDs(1, &myp, NULL);
    • 设备:通过平台获得设备个数,每个平台可以有多个Device,获得设备ID:
      • clGetDeviceIDs(cl_platform_id, CL_DEVICE_TYPE_ALL, cl_uint, cl_device_id, cl_uint*);
    • 上下文:上下文可以指定一个或多个设备作为当前的操作对象,上下文context用来管理command-queue, memory, program和kernel,以及指定kernel在上下文中的一个或多个设备上执行
      • cl_context = clCreateContext(0, cl_uint, cl_device_id*, callbackfunction, NULL, status);
        请添加图片描述
  2. 指定Program与Kernel:要知道GPU上跑的是什么程序,程序接口是什么样的
    • 创建program对象
      请添加图片描述
    • 创建kernel对象
      请添加图片描述
  3. Host与Kernel的交互(Host Buffer、Kernel Buffer、Read/Write Buffer):怎么把数据写入GPU、怎么把数据从GPU上读出来
    • 创建命令队列:例如cl_command_queue clCreateCommandQueue(cl_context, cl_device_id, 0, status);
      请添加图片描述
    • 创建kernel端内存:例如cl_mem = clCreateBuffer(context, CL_MEM_READ_WRITE, size, void*, status);
      请添加图片描述
    • 参数映射,将kernel端内存与kernel的参数建立关系:例如status=clSetKernelArg(kernel_0, 0, sizeof(cl_mem), &in_0); status=clSetKernelArg(kernel_0, 1, sizeof(cl_mem), &out_0);
      请添加图片描述
    • 创建Host内存(C语言中常规创建空间的方式):例如unsigned int *in_buf_0=(unsigned int*) aligned_alloc(64, n*sizeof(unsigned int)); unsigned int *out_buf_0=(unsigned int*) aligned_alloc(64, n*sizeof(unsigned int));
    • 将Host内存写入kernel内容:例如clEnqueueWriteBuffer(queue0[0], in_0, CL_TRUE, 0, n*sizeof(unsigned int), in_buf_0, 0, NULL, NULL);
      请添加图片描述
  4. Kernel的执行:执行GPU程序,执行完后读出数据并释放内存
    • 法1-执行任务(单工作项)
      请添加图片描述
    • 法2-NDRange(多工作项执行方式)
      在这里插入图片描述
      请添加图片描述
  5. 内存释放

二、OpenCL: High-Level Overview

  1. OpenCL Components:
    • C Host API: basically saying what devices do I want to use, what do I want them to do, what functions I want to call, where should memory be
      • What you call from the host
      • Directs devices
    • OpenCL C
      • Used to program the device
      • Based on C99
      • Many built-in functions
    • Models
      • Memory, execution, etc

请添加图片描述
如上图,the host is going to call the host API to manage devices on the right. Devices are programmed using OpenCL C. Underneath all of these are models. These models are here to guide everything.

  1. OpenCL Model:
    • Device Model: what devices look like inside
      • Inside the Device:
        请添加图片描述
        The device is broken down into further pieces, each of the small rectangle is a compute unit (CU), in this picture we have 15 compute units, and we have 8 processing elements per CU
      • Inside the Compute Unit:
        请添加图片描述
        PE stands for processing element. Let’s take apart one of these blocks of the PE and private memory and see what that’s about.
      • Inside the Processing Element (PE):
        Think of the PE as a very simple processor. In particular, all instructions are executed on the processing element (means that everything that you are going to do in terms of actually making devices do work, the PE is going to be responsible for all of that.
    • Execution Model: How work gets done on devices
      • Kernel Functions:
        • OpenCL executes kernel functions on the device. The kernel functions are just ordinary functions with a special signature
        • Kernel calls have two parts:
          • Ordinary function argument list
          • External execution parameters that control parallelism
            请添加图片描述
      • Role of the host in kernel execution
        • Coordinates execution (the host tells the device to call this function, but does not participate itself)
        • Provides arguments to the kernel ( the host tell the device what to do and to provide it arguments)
        • Provides execution parameters to launch the kernel
      • NDRange: execution strategy
        • The same kernel function will be invoked many times
          • The argument list is identical for all invocations
        • Basically we call the same function over and over
          • How many times we do this is dictated by the execution parameters.
        • Host sets extra execution parameters prior to launch
      • NDRange: Identifying the call
        • How do kernel functions know what to work on?
          • The argument list is identical
        • Insight: execution paramenters provide an index space
          • each function invocation can access its index
        • The index space is n-dimensional
      • NDRange: Some Definitions
        • Work-item: invocation of the kernel for a particular index
        • Global ID: globally unique id for a work-item (from index space)
        • Global Work Size: the number of work-items (per dimension)
        • Work Dimension: dimension of the index space
        • Work-groups: Partition the global work into smaller pieces. Work-groups execute on compute units, work-items (inside a work-group) mapped to CU PEs. All work-items in a work-group share local memory
          • work-group size has a physical meaning: it is device specific
            • Maximum work-group size is a device characteristic: you can query a device to determine this value
            • Maximum work-group size is an integer: Handle n-dimensional work-groups in a special way
            • How to determine the best work-group size: this is too advanced for now
          • work-items can find out: their work-group id, size of work-groups, global id, global work size
        • The work-item perspective: each work-item has its own private memory, all of the work-items within the work-group or compute unit are able to share the local memory. Every work-item on the device can access the constant memory and the global memory
          请添加图片描述
      • 一些Kernel Call Points
        • Host will provide execution dimensions to the device, this creates an index space
        • Parameters can be values or global memory objects
        • Global memory is persistent between calls. But constant、local、private memory is just scratch space, they are going to be reset per kernel call
        • OpenCL implementation has considerable flexibility:
          • How to map work-items to PEs?
          • How to schedule work?
    • Memory Model: How devices and host see data
      • Global Memory: where you load data and run functions
        • Shared with all processing elements
        • Host can access this memory too
          • Memory map
          • Copy data to/from global memory
        • This is OpenCL persistent storage (the memory remains across subsequent executions)
          • Other memory regions are scratch space
      • Constant Memory:
        • Shared with all processing elements
        • Read-only memory
        • Very different way to share data with all device PEs
        • Not persistent (will change over time)
      • Local Memory:
        请添加图片描述
        • Shared with all PEs in a CU
        • Very efficient way to share data with all CU PEs
        • Cannot be accessed by other compute units
        • Not persistent
      • Private Memory:
        • Accessible by a single processing element (PE)
        • No other PE can access this memory
        • Not persistent
    • Host API: How the host control the devices
      • Platform

        • A platform is an implementation of OpenCL
        • Platforms are like drivers for particular devices: platforms expose devices to you
        • Example: A system with two GPUs (AMD+nVIDIA) and a Xeon Phi (Intel)
          • A platform from AMD for one GPU and the CPU
          • A platform from Intel for the Xeon Phi
          • A platform from nVIDIA for the other GPU
        • Use the platform to discover devices available to you
      • Context: when you write an OpenCL program, creating a context is the first thing you do. What you’re going to do is: discover the platform -> get a context -> start locating memory -> start controlling devices

        • You create a context for a particular platform (you cannot have multiple platforms in a context)
        • A context is a container:
          • Contains devices
          • Contains memory
        • Most operations are related to a context (Implicitly or explicitly)
      • Program:

        • Programs are just collections of kernels (you extract kernels from your program to call them)
        • OpenCL applications have to load kernels
          • Compile OpenCL C source code
          • Load binary representation
        • Programs are device specific
      • Asynchronous Device Calls:
        The host manages devices asynchronously. You can have multiple devices attached to your host (for example you may have a Xeon Phi, an AMD GPU, an Nvidia GPU and you can use a CPU as another device). Now you want to manage all of these devices asynchronously for best performance. OpenCL has an asynchronous interface to do this.

        • Asynchronous Device Management
          • Host issues commands to device
          • Commands tell the device to do something
          • Device take commands and do as they say
          • Host waits for commands to complete: this means the device has completed that action
          • Commands can be dependent on other commands
          • OpenCL commands are issued by clEnqueue* calls:
            • A cl_event object returned by clEnqueue* calls is used for dependencies
        • Command overview:
          请添加图片描述
          • clEnqueueFoo enqueue the command “Foo” to run on a particular device
          • e1 is a handle to represent this command
          • {deps}: this is a set of previously issued commands that have to be finished before. Commands take a list of dependencies
        • An example:
          请添加图片描述
          I have 2 commands I’m going to run called Foo and 1 command called bar. e1 and e2 have no dependencies because their dependent set is empty. But the command bar cannot be completed until these two previous calls to Foo have been finished. In real life, Foo might be doing memory copies and bar might be a kernel
        • Where do commands go? Now we have talked about enqueueing thins but haven’t really saied where they go or what they do
          • OpenCL has command-queues
          • A command-queue is attached to a single device
          • You can create as many command-queues as you want
          • clEnqueue* commands have a command-queue parameter

            1


            2

      • Host API Summary:

        • Host API controls the device (Devices can’t do anything themselves)
        • Asynchronous execution model
          • Important for speed
          • A bit different from traditional asynchronous APIs (because of the command queue system and everything else)
  2. Mapping NDRange to Devices
    • Remember the PE runs instructions
      • So work-items should run on PEs
    • Assign multiple work-items to each PE
      • Need to handle the case that global work size > number PEs
    • Partition the global work into smaller pieces (work-groups)
      • Work-groups execute on compute units. All work-items in the work-group share local memory and mapped to CU PEs.
  3. Conceptual Work-Group Launching
    在这里插入图片描述
  4. Geometric Visualization:
    • 1D:
      请添加图片描述
    • 2D:
      请添加图片描述
    • 3D
      在这里插入图片描述

三、OpenCL C

What is OpenCL C:

  • OpenCL device programming language: the OpenCL C is a modification of the C programming language to actually target the devices
  • The main actor in OpenCL programming
  • OpenCL C is like C99
  • The other part of the OpenCL specification

OpenCL C != C:

  • No function pointers
  • No recursion
  • Function calls might be inlined
  • OpenCL C is not a subset of C: OpenCL C has features not in C
  • The specification outlines the full set of differences
  1. Types:
    • OpenCL C vs C
      • OpenCL C provides a concrete representation
        • 带符号整数用二进制补码表示
        • types have fixed sizes
      • OpenCL C provides vector types and operatons
      • OpenCL C provides image types: example of an opaque type
        • Opaque type is something that you don’t have direct access to its memory representation. You use other functions to extract information from it.
      • OpenCL C types are mostly C types
        请添加图片描述
        请添加图片描述
    • Host and Device Types:
      请添加图片描述
      如上图,因为在device端,我们知道int类型是用2进制补码表示、32比特的;而在host我们并不清楚它具体的表示方式和size,因此不能直接复制。So be careful of host-device data exchange!
    • Types restricted to device: means that you can’t transfer them between the host and the device
      请添加图片描述
  2. Memory regions
    请添加图片描述
    • OpenCL C memory pointers: __global int* x
      • __global: specifying the memory region where do we want to point to
      • __global int*: pointer to an integer in global memory
      • 如果有两个__global int*变量x和y,就可以运行x=y,也就是让x指向y所指的地方;如果x是__global int*,y是__private int*,就不可以运行x=y,but we can still copy values(即运行*x=*y
  3. Vector operations:就是向量,类似于C++中的数组
    • OpenCL V Vector Types:
      请添加图片描述
    • Vector operaions:
      • vector-vector: 如下面的代码,就是作component-wise operation
        float4 x, y, z;
        z = x + y;
        
        请添加图片描述
      • scalar-vector: When we mix scalars and vectors, the scalars will be padded out. 如下面的代码,结果应为z=(float4)x+y
        float x;
        float4 y;
        z = x + y;
        
    • Vector Components: vec.<component>
    • Why use OpenCL vector types / OpenCL C vector type advantages:
      • Clear communication of vector operations (you and the compiler both know these are vectors, i.e. a bundle of data here)
      • Simplifies code
      • Excellent performance: the complier can do a great job of vectorizing when you are using vectors in this context
  4. Structures
    • OpenCL C has structures and unions, just like C
    • But there are good reasons to not use them (performance)
    • Be careful of data exchange
      • Binary layout of struct must be same on device and host
      • Almost impossible to get right
  5. Functions
    • Overview
      • Ordinary C functions: nothing special
      • Recursion is forbidden
      • Functions might be expanded and inlined (by the compiler, not something affects you, but you should still know)
    • Example
      请添加图片描述
      请添加图片描述
  6. Kernels: this is really what you’re calling to do work on the device. The time on studying execution model is going to pay off here
    • Introducing Kernels:
      • Kernels are entry points to device execution (likeint main(int argc, char** argv, except我们可以将main改成任意名字)
      • Kernels are called by the host
        • Host will setup parameters for the call
        • Host will supply execution parameters for the call
        • Device runs function
      • Kernel arguments are pointers to __global (something in the global space) or just values
    • Kernel example: adds two arrays together
      请添加图片描述
      • 这里get_global_id(0)中的0就是the zeroth dimension of the id
      • 函数前面的__kernel is always required
    • Review of Execution Model: 下面这些concepts对于write kernel functions 来说非常重要,而也会有一些relevant functions to access these
      • Execution has dimensions
      • Global work size
      • Global offset
      • Work-group size
    • Relevant functions
      • get_global_id(n): give us the work-item id in dimension n
      • get_global_offset(n)
      • get_local_id(n): says which work-item am I inside my work-group
    • Local memory
      • Memory shared by work-items within the work-group (might be implemented in hardware)
      • Often key to top-performance, so how do we declare something to use local memory? 如下图的两段代码
        请添加图片描述
        请添加图片描述
    • Constant memory
      • Read-only memory shared by all work-items
        • Very fast to read
        • But relatively small amount of space
      • Useful in some circumstances: e.g. lookup tables
    • Kernel limitations
      • Kernels might execute concurrently on the device, but there is no mechanism for them to cooperate
      • A single kernel is limited in what it can do, so you might need to launch several kernels to get a job done
      • Kernels cannot allocate memory: everything is fixed prior to kernel execution
    • Kernel attributes
      • vec_type_hint: hint to the compiler for vectorization
      • reqd_work_group_size: forces a work-group size (very useful for performance). It can do very special and very particular optimization and do a very good job of doing things like register allocation
  7. Quick Topics
    • OpenCL supports image operations: Load an image, do something, write an image
    • Built-in OpenCL C functions (kind of like a standard library)
      • Work-item functions: figure out the kernel launch parameters
      • Math functions
      • Integer functions
      • Geometric functions
      • see the documentation for details
    • Synchronization: complex topic, need to watch for another video
    • Extensions: These are extra features that you can enable with #pragma

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:/a/622827.html

如若内容造成侵权/违法违规/事实不符,请联系我们进行投诉反馈qq邮箱809451989@qq.com,一经查实,立即删除!

相关文章

SpringBoot集成Seata分布式事务OpenFeign远程调用

Docker Desktop 安装Seata Server seata 本质上是一个服务&#xff0c;用docker安装更方便&#xff0c;配置默认&#xff1a;file docker run -d --name seata-server -p 8091:8091 -p 7091:7091 seataio/seata-server:2.0.0与SpringBoot集成 表结构 项目目录 dynamic和dyna…

【MQTT】paho.mqtt.c 库的“介绍、下载、交叉编译” 详解,以及编写MQTT客户端例子源码

&#x1f601;博客主页&#x1f601;&#xff1a;&#x1f680;https://blog.csdn.net/wkd_007&#x1f680; &#x1f911;博客内容&#x1f911;&#xff1a;&#x1f36d;嵌入式开发、Linux、C语言、C、数据结构、音视频&#x1f36d; ⏰发布时间⏰&#xff1a;2024-05-13 1…

ubuntu 22.04 安装 RTX 4090 显卡驱动 GPU Driver(PyTorch准备)

文章目录 1. 参考文章2. 检查GPU是Nvidia3. 卸载已有驱动3.1. 命令删除3.2. 老驱动包 4. 官网下载驱动5. 运行5.1. 远程安装关闭交互界面5.2. 运行5.3. 打开交互界面 6. 检测与后续安装 1. 参考文章 https://blog.csdn.net/JineD/article/details/129432308 2. 检查GPU是Nvid…

02-WPF_基础(二)

3、控件学习 控件学习 布局控件&#xff1a; panel、Grid 内容空间&#xff1a;Context 之恶能容纳一个控件或布局控件 代表提内容控件&#xff1a;内容控件可以设置标题 Header 父类&#xff1a;HeaderContextControl。 条目控件&#xff1a;可以显示一列数据&#xf…

计算机网络复习-应用层

概述 传输层以及以下的层提供完整的通信服务&#xff0c;不需要管传输&#xff0c;只需要往上对接用户即可。应用层是面向用户的一层 定义应用间通信的规则 应用进程的报文类型 (请求报文、应答报文)报文的语法、格式应用进程发送数据的时机、规则 DNS详解 DNS&#xff1a…

js基础-数组-事件对象-日期-本地存储

一、大纲 一、获取元素位置 在JavaScript中&#xff0c;获取一个元素在页面上的位置可以通过多种方法实现。以下是一些常见的方法&#xff1a; getBoundingClientRect() getBoundingClientRect() 方法返回元素的大小及其相对于视口的位置。它提供了元素的left、top、right和bo…

[Linux][网络][高级IO][一][五种IO模型][同步通信][异步通信][非阻塞IO]详细讲解

目录 0.预备知识 && 思考问题1.五种IO模型0.形象理解五种模型1.阻塞IO2.非阻塞IO3.信号驱动IO4.多路转接/多路复用5.异步IO 2.高级IO重要概念1.同步通信 vs 异步通信2.阻塞 vs 非阻塞 3.非阻塞IO1.fcntl()2.实现SetNonBlock 0.预备知识 && 思考问题 网络通信本…

Poe是什么?怎样订阅Poe?

Poe&#xff08;全称“开放探索平台”&#xff0c;Platform for Open Exploration&#xff09;是一款由Quora开发的移动应用程序&#xff0c;于2022年12月推出。该应用程序内置建基于AI技术的聊天机器人&#xff0c;可供用户向机器人询问专业知识、食谱、日常生活&#xff0c;甚…

懒人网址导航源码v3.9

测试环境 宝塔Nginx -Tengine2.2.3的PHP5.6 MySQL5.6.44 为防止调试错误&#xff0c;建议使用测试环境运行的php与mysql版本 首先用phpMyAdmin导入数据库文件db/db.sql 如果导入不行&#xff0c;请直接复制数据库内容运行sql语句也可以 再修改config.php来进行数据库配置…

解决SpringBoot整合MyBatis和MyBatis-Plus,请求后不打印sql日志

问题发现 在整合springBootmyBatis时&#xff0c;发现请求不打印sql日志&#xff0c;示例代码如下&#xff1a; RestController public class MyController {AutowiredProductMapper productMapper;GetMapping("/test")public void test() {System.out.println(&qu…

使用Dockerfile配置Springboot应用服务发布Docker镜像-16

创建Docker镜像 springboot-docker模块 这个应用可以随便找一个即可&#xff0c;这里不做详细描述了。 pom.xml 依赖版本可参考 springbootSeries 模块中pom.xml文件中的版本定义 <dependencies><dependency><groupId>com.alibaba.cloud</groupId>…

[数据集][图像分类]杂草分类数据集17509张9类别

数据集格式&#xff1a;仅仅包含jpg图片&#xff0c;每个类别文件夹下面存放着对应图片 图片数量(jpg文件个数)&#xff1a;17509 分类类别数&#xff1a;9 类别名称:["chineseapple","lantana","negatives","parkinsonia","part…

Nginx - location中的匹配规则和动态Proxy

文章目录 官网location 规则详解动态Proxy使用多个 if 指令指定不同的 proxy_pass根据参数选择不同的 proxy_pass 官网 https://nginx.org/en/docs/http/ngx_http_core_module.html#location location 规则详解 Nginx的location指令工作原理如下&#xff1a; 位置匹配&#…

虚拟机有线已连接但无法上网—·可能性之一

背景 VMware虚拟机&#xff0c;搭建了三台Linux服务器&#xff0c;组成Hadoop集群&#xff0c;由于在Hadoop102上有一些经常与Mysql数据库交互的任务&#xff0c;需要经常打开运行&#xff0c;而Hadoop103和104则经常处于关闭状态&#xff0c;一段时间后再次启动集群时候&…

【go项目01_学习记录11】

操作数据库 1 文章列表2 删除文章 1 文章列表 &#xff08;1&#xff09;先保证文章已经有多篇&#xff0c;可以直接在数据库中添加&#xff0c;或者访问链接: localhost:3000/articles/create&#xff0c;增加几篇文章。 &#xff08;2&#xff09;之前设置好了articles.ind…

C语言 | Leetcode C语言题解之第87题扰乱字符串

题目&#xff1a; 题解&#xff1a; struct HashTable {int key;int val;UT_hash_handle hh; };void modifyHashTable(struct HashTable** hashTable, int x, int inc) {struct HashTable* tmp;HASH_FIND_INT(*hashTable, &x, tmp);if (tmp NULL) {tmp malloc(sizeof(st…

树莓派对FPGA板子上的流水灯程序的控制

文章目录 一 树莓派使用教程二 树莓派串口代码三 Verilog代码四 quartus引脚绑定五 运行效果总结 分别在DE2-115开发板和树莓派上编写串口通信程序&#xff0c; 实现树莓派串口指令对FPGA板子上的流水灯程序的控制&#xff0c;控制方式自定。 一 树莓派使用教程 参考&#xff…

第187题| 快速学会“阿贝尔定理”| 无穷级数(十五)|武忠祥老师每日一题

解题思路&#xff1a;这道题没有告诉我们是多少&#xff0c;没办法求出收敛半径&#xff0c;所以我们只能根据题目给的两个条件来解题&#xff08;选项代入法&#xff09;。 1.x-1&#xff0c;说明收敛的中心点是1&#xff0c;观察下列选项&#xff0c;显然答案在C和D之中。 …

Linux中的网络隔离功能 netns

Network Namespace&#xff08;netns&#xff09; 是Linux内核提供的一项实现网络隔离的功能&#xff0c;它能隔离多个不同的网络空间&#xff0c;并且各自拥有独立的网络协议栈。通过 namespace 可以隔离容器的进程 PID、文件系统挂载点、主机名等多种资源&#xff0c;它可以为…

基于门控的循环神经网络:LSTM

之前我们介绍了循环神经网络的原理以及实现。但是循环神经网络有一个问题&#xff0c;也就是长期依赖问题。我们之前的01序列预测案例中可以看到&#xff0c;当序列长度到达10以上之后错误就会增多&#xff0c;说明简单的RNN记忆容量较小&#xff0c;当长度更大时就不怎么适用了…