Tiled Matrix Multiplication

if(true) {
(function(i,s,o,g,r,a,m){i[‘GoogleAnalyticsObject’]=r;i[r]=i[r]||function(){
(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
})(window,document,‘script’,‘https://www.google-analytics.com/analytics.js’,‘ga’);
}
if (typeof ga === “function”) {
ga(‘create’, ‘UA-141814063-1’, ‘auto’, {});

  }
  </script><link rel="shortcut icon" href="/icons-734a781301b6d50527ac54b7d31cec57/favicon.ico"/><link rel="icon" type="image/png" sizes="16x16" href="/icons-734a781301b6d50527ac54b7d31cec57/favicon-16x16.png"/><link rel="icon" type="image/png" sizes="32x32" href="/icons-734a781301b6d50527ac54b7d31cec57/favicon-32x32.png"/><link rel="manifest" href="/icons-734a781301b6d50527ac54b7d31cec57/manifest.json"/><meta name="mobile-web-app-capable" content="yes"/><meta name="theme-color" content="#fff"/><meta name="application-name" content="gatsby-starter-hello-world"/><link rel="apple-touch-icon" sizes="57x57" href="/icons-734a781301b6d50527ac54b7d31cec57/apple-touch-icon-57x57.png"/><link rel="apple-touch-icon" sizes="60x60" href="/icons-734a781301b6d50527ac54b7d31cec57/apple-touch-icon-60x60.png"/><link rel="apple-touch-icon" sizes="72x72" href="/icons-734a781301b6d50527ac54b7d31cec57/apple-touch-icon-72x72.png"/><link rel="apple-touch-icon" sizes="76x76" href="/icons-734a781301b6d50527ac54b7d31cec57/apple-touch-icon-76x76.png"/><link rel="apple-touch-icon" sizes="114x114" href="/icons-734a781301b6d50527ac54b7d31cec57/apple-touch-icon-114x114.png"/><link rel="apple-touch-icon" sizes="120x120" href="/icons-734a781301b6d50527ac54b7d31cec57/apple-touch-icon-120x120.png"/><link rel="apple-touch-icon" sizes="144x144" href="/icons-734a781301b6d50527ac54b7d31cec57/apple-touch-icon-144x144.png"/><link rel="apple-touch-icon" sizes="152x152" href="/icons-734a781301b6d50527ac54b7d31cec57/apple-touch-icon-152x152.png"/><link rel="apple-touch-icon" sizes="167x167" href="/icons-734a781301b6d50527ac54b7d31cec57/apple-touch-icon-167x167.png"/><link rel="apple-touch-icon" sizes="180x180" href="/icons-734a781301b6d50527ac54b7d31cec57/apple-touch-icon-180x180.png"/><link rel="apple-touch-icon" sizes="1024x1024" href="/icons-734a781301b6d50527ac54b7d31cec57/apple-touch-icon-1024x1024.png"/><meta name="apple-mobile-web-app-capable" content="yes"/><meta name="apple-mobile-web-app-status-bar-style" content="black-translucent"/><meta name="apple-mobile-web-app-title" content="gatsby-starter-hello-world"/><link rel="apple-touch-startup-image" media="(device-width: 320px) and (device-height: 480px) and (-webkit-device-pixel-ratio: 1)" href="/icons-734a781301b6d50527ac54b7d31cec57/apple-touch-startup-image-320x460.png"/><link rel="apple-touch-startup-image" media="(device-width: 320px) and (device-height: 480px) and (-webkit-device-pixel-ratio: 2)" href="/icons-734a781301b6d50527ac54b7d31cec57/apple-touch-startup-image-640x920.png"/><link rel="apple-touch-startup-image" media="(device-width: 320px) and (device-height: 568px) and (-webkit-device-pixel-ratio: 2)" href="/icons-734a781301b6d50527ac54b7d31cec57/apple-touch-startup-image-640x1096.png"/><link rel="apple-touch-startup-image" media="(device-width: 375px) and (device-height: 667px) and (-webkit-device-pixel-ratio: 2)" href="/icons-734a781301b6d50527ac54b7d31cec57/apple-touch-startup-image-750x1294.png"/><link rel="apple-touch-startup-image" media="(device-width: 414px) and (device-height: 736px) and (orientation: landscape) and (-webkit-device-pixel-ratio: 3)" href="/icons-734a781301b6d50527ac54b7d31cec57/apple-touch-startup-image-1182x2208.png"/><link rel="apple-touch-startup-image" media="(device-width: 414px) and (device-height: 736px) and (orientation: portrait) and (-webkit-device-pixel-ratio: 3)" href="/icons-734a781301b6d50527ac54b7d31cec57/apple-touch-startup-image-1242x2148.png"/><link rel="apple-touch-startup-image" media="(device-width: 768px) and (device-height: 1024px) and (orientation: landscape) and (-webkit-device-pixel-ratio: 1)" href="/icons-734a781301b6d50527ac54b7d31cec57/apple-touch-startup-image-748x1024.png"/><link rel="apple-touch-startup-image" media="(device-width: 768px) and (device-height: 1024px) and (orientation: portrait) and (-webkit-device-pixel-ratio: 1)" href="/icons-734a781301b6d50527ac54b7d31cec57/apple-touch-startup-image-768x1004.png"/><link rel="apple-touch-startup-image" media="(device-width: 768px) and (device-height: 1024px) and (orientation: landscape) and (-webkit-device-pixel-ratio: 2)" href="/icons-734a781301b6d50527ac54b7d31cec57/apple-touch-startup-image-1496x2048.png"/><link rel="apple-touch-startup-image" media="(device-width: 768px) and (device-height: 1024px) and (orientation: portrait) and (-webkit-device-pixel-ratio: 2)" href="/icons-734a781301b6d50527ac54b7d31cec57/apple-touch-startup-image-1536x2008.png"/><link as="script" rel="preload" href="/webpack-runtime-3ee88fe8cebd4e0cceb4.js"/><link as="script" rel="preload" href="/app-dfd9346b7f706d9fba43.js"/><link as="script" rel="preload" href="/styles-35d409c61f58f7a61639.js"/><link as="script" rel="preload" href="/1-24d16269159e89d0a015.js"/><link as="script" rel="preload" href="/2-749140a4d16844d8356b.js"/><link as="script" rel="preload" href="/14-8136f547fe9ea611aa29.js"/><link as="script" rel="preload" href="/component---src-templates-blog-js-63ca91c014e5a8445e33.js"/><link as="fetch" rel="preload" href="/static/d/125/path---blog-tiled-matrix-multiplication-1-fe-a65-awprJ6fzP1FAubeRzmVHIJujH88.json" crossorigin="use-credentials"/></head><body><noscript id="gatsby-noscript">This app works best with JavaScript enabled.</noscript><div id="___gatsby"><div style="outline:none" tabindex="-1" role="group"><div class="layout-module--container--2TGku"><div class="layout-module--content--3nIku"><header class="header-module--header--3A712"><h1><a class="header-module--title--33kOg" href="/"> <!-- -->Penny Xu<!-- --> </a></h1><nav><ul class="header-module--nav-list--87D9u"><li><a class="header-module--nav-item--1ixk0" href="/"> Home </a></li><li><a class="header-module--nav-item--1ixk0" href="/blog"> Blog </a></li><li><a class="header-module--nav-item--1ixk0" href="/about"> About Me </a></li><li><a class="header-module--nav-item--1ixk0" href="/contact"> Contact </a></li></ul></nav></header><h1>Tiled Matrix Multiplication</h1><p>2019-08-24</p><nav><ul class="tags-module--nav-list--i1shz"><p class="tags-module--tag--2-fKt"> <svg aria-hidden="true" focusable="false" data-prefix="fas" data-icon="tags" class="svg-inline--fa fa-tags fa-w-20 fa-xs " role="img" xmlns="http://www.w3.org/2000/svg" viewBox="0 0 640 512"><path fill="currentColor" d="M497.941 225.941L286.059 14.059A48 48 0 0 0 252.118 0H48C21.49 0 0 21.49 0 48v204.118a48 48 0 0 0 14.059 33.941l211.882 211.882c18.744 18.745 49.136 18.746 67.882 0l204.118-204.118c18.745-18.745 18.745-49.137 0-67.882zM112 160c-26.51 0-48-21.49-48-48s21.49-48 48-48 48 21.49 48 48-21.49 48-48 48zm513.941 133.823L421.823 497.941c-18.745 18.745-49.137 18.745-67.882 0l-.36-.36L527.64 323.522c16.999-16.999 26.36-39.6 26.36-63.64s-9.362-46.641-26.36-63.64L331.397 0h48.721a48 48 0 0 1 33.941 14.059l211.882 211.882c18.745 18.745 18.745 49.137 0 67.882z"></path></svg> Tags: </p><li><a class="tags-module--nav-item--20nSV" href="/tags/deep-learning/">deep learning</a></li><li><a class="tags-module--nav-item--20nSV" href="/tags/matrix-multiplication/">matrix multiplication</a></li><li><a class="tags-module--nav-item--20nSV" href="/tags/cuda/">CUDA</a></li><li><a class="tags-module--nav-item--20nSV" href="/tags/parallelism/">parallelism</a></li></ul></nav><hr/><div><p>Let's talk about tiled matrix multiplication today. This is an algorithm performed on GPUs due to the parallel nature of matrix multiplication. We will especially look at a method called "tiling," which is used to reduce global memory accesses by taking advantage of the shared memory on the GPU. Tiling can be seen as a way to boost execution efficiency of the kernel. We will then examine the CUDA kernel code that do exactly what we see in the visualization, which shows what each thread within a block is doing to compute the output.</p>

Keep in mind that this post is not meant to teach you CUDA coding, but rather it is meant to help viewers gain some visual intuition on what each thread is doing in a basic tiled matrix multiplication algorithm. I strongly believe that writing the code (launching the kernel, index calculations...) will come easily if you understand and see what you are trying to code.

Why should you care?

The efficiency of calculating matrix multiplication is the backbone of everything. Everything as in rendering graphics and machine learning. Ever heard of Tensors? Yeah...everything is matrix multiplication I swear.

Some background

The main idea of using GPUs for computation is simple. The idea is to get more work done in less time. Imagine you have an assignment with 4 math problems to solve, each problem taking 1 hour. You can spend 4 hours and do all 4 problems by yourself. But what if you have 3 other friends with the same assignment? Then you tell your friends to each solve 1 problem and then you all will share the solutions...because sharing is caring. This means in 1 hour, your assignment would be finished.

To finish off this analogy, each one of your friends is a worker, or an unit of execution, a thread. When you have a lot of workers (threads) to manage, you might want to organize them in a way. Below is the organization of threads in CUDA terms.

  • Thread: single unit of execution --- each thread has its own memory called registers
  • Block: group of threads --- all threads in a block has access to a shared memory called shared memory
  • Grid: group of blocks --- all threads in a grid has access to global memory and constant memory

Problem setup

Given a 4x4 input matrix A and a 4x4 input matrix B, I want to calculate a 4x4 output matrix C. Since C consists of 16 elements, where each element is computed through a dot product of a row of A and a column of B, then let's launch 16 threads, where each thread calculates 1 output element. For the sake of this example, let's say the threads is organized into a 2x2 block, and there are 4 blocks in a grid.

Visualization

Let's see what each thread within each block is doing. From the visualization below, you can see that each thread is responsible for loading input elements into the shared memory. Remember that shared memory is shared within each block. This means that each of the four threads in a block in this example can see what the other three threads loaded into share A and share B. You can see that we are essentially doing mini-matrix multiplication using shared memory, storing the temporary result somewhere, and then continue summing the temporary results of the next mini-matrix multiplication. When we are finished with each individual mini-matrix multiplication, each thread would load their corresponding result to the output C element that they are mapped to. Keep in mind that we are only looking at the threads in one block, don't forget that all the other threads in the other three blocks are also doing their version of the calculations AT THE SAME TIME. Just think about it...

Let's compare global memory accesses with and without tiling. A global memory access is accessing elements of either input A or input B.

Without tiling: In order to calculate one output element, a thread will need to access one entire row of input A and one entire column of input B, for calculating the dot product. In our example, that is 8 accesses per thread.

With tiling: Each thread ends up loading two elements from input A and two elements from input B, which totals up to 4 accesses per thread.

In general, reduction in global memory accesses in tiling matrix multiplication is proportional to the dimension of the blocks used. This means that with blocks size of NxN, the potential reduction of global memory traffic would be N. So in our example, since we used 2x2 blocks, we can see that global memory accesses with tiling is 1/2 of the global memory accesses without tiling.

Kernel code

Many details of the coding is abstracted away with the explanation above, since I want you to build an intuitive visual understanding of what is going on. However, I will include the CUDA kernel code below if you are curious! Try and match each numbered line with the visualization!

  • In the code below, the width that is in the argument is the width of the output C.

  • You can see that the number "2" is used throughout the code, this is to match our simple example above, as it is the block width or tile width.

  • threadIdx is specific to a block, and blockIdx is specific to a grid. Since our matrix multiplication example has a two dimensional output, then it is easiest to organize the threads in 2D. So the four threads in a block is actually indexed like thread00, thread01, thread10, thread11, where the first and second number corresponds to the row number and the column index within its block. This is also the case for how each block (block00, block01, block10, block11) is indexed in this example.

  • __syncthreads() is a barrier synchronization line that says no threads can continue execution of the remaining code until all threads have reached that point in their execution. This is super important for the correctness of this algorithm.

    __global__ tile_matrix_multiply(float* A, float* B, float* C, int width)
    
      1.   __shared__ shareA[2][2];
      2.   __shared__ shareB[2][2];
      3.   int bx = blockIdx.x; int by = blockIdx.y;
      4.   int tx = threadIdx.x; int ty = threadIdx.y;
      5.   int row = by * 2 + ty;
      6.   int col = bx * 2 + tx;
      7.   float temp = 0;
      8.   for(int i = 0; i &#x3C; width/2; ++i){
    
      9.       shareA[ty][tx] = A[row*width + (i*2 + tx)];
      10.      shareB[ty][tx] = B[(i*2 + ty)*width + col];
      11.      __syncthreads();
    
      12.     for(int k = 0; k &#x3C; 2; ++k){
      13.       temp += shareA[ty][k] * shareB[k][tx];
      14.       __syncthreads();
              }
           }
      15.  C[row*width + col] = temp;
    

Please do not hesitate to comment down below if you have any questions regarding the visualization or code!


Created by Penny Xu , © 2020

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:/a/222936.html

如若内容造成侵权/违法违规/事实不符,请联系我们进行投诉反馈qq邮箱809451989@qq.com,一经查实,立即删除!

相关文章

uniapp微信小程序分包,小程序分包

前言&#xff0c;都知道我是一个后端开发、所以今天来写一下uniapp。 起因是美工给我的切图太大&#xff0c;微信小程序不让了&#xff0c;在网上找了一大堆分包的文章&#xff0c;我心思我照着写的啊&#xff0c;怎么就一直报错呢&#xff1f; 错误原因 tabBar的页面被我放在分…

五肽-13|提亮肤色,美白肌肤

五肽-13 INCI名称&#xff1a;五肽-13 说明&#xff1a; 五肽-13是一种合成肽&#xff0c;由丙氨酸、精氨酸、赖氨酸、脯氨酸和缬氨酸组成 功能&#xff1a; 五肽-13起到增白剂的作用 应用程序&#xff1a; 提亮和美白

2023年2月8日 Go生态洞察:Profile-Guided Optimization预览

&#x1f337;&#x1f341; 博主猫头虎&#xff08;&#x1f405;&#x1f43e;&#xff09;带您 Go to New World✨&#x1f341; &#x1f984; 博客首页——&#x1f405;&#x1f43e;猫头虎的博客&#x1f390; &#x1f433; 《面试题大全专栏》 &#x1f995; 文章图文…

Streamlit框架的定制化

Streamlit框架的定制化 最近做了一个关于streamlit框架的项目&#xff0c;颇有感触&#xff0c;所以在这里记录一下。 什么是streamlit? Streamlit 是一个python的WEB UI库&#xff0c;它做了高度的封装以便于不懂后前端开发的人员也能轻松构建画面。你可以从官网进行详细的…

Linux文件结构与文件权限

基于centos了解Linux文件结构 了解一下文件类型 Linux采用的一切皆文件的思想&#xff0c;将硬件设备、软件等所有数据信息都以文件的形式呈现在用户面前&#xff0c;这就使得我们对计算机的管理更加方便。所以本篇文章会对Linux操作系统的文件结构和文件权限进行讲解。 首先…

halcon如何设置窗口背景颜色?

halcon窗口背景默认是黑色&#xff0c;有时候图片背景是黑色&#xff0c;不方便观察边缘&#xff0c;如果需要设置窗口背景颜色&#xff0c;可以使用如下算子。 设置窗口背景颜色&#xff1a;白色 set_window_param (WindowHandle, background_color, white) 设置白色后的效…

13款趣味性不错(炫酷)的前端动画特效及源码(预览获取)分享(附源码)

文字激光打印特效 基于canvas实现的动画特效&#xff0c;你既可以设置初始的打印文字也可以在下方输入文字可实现激光字体打印&#xff0c;精简易用。 预览获取 核心代码 <!DOCTYPE html> <html lang"en"> <head><meta charset"UTF-8&q…

关于 mapboxgl 的常用方法及效果

给地图标记点 实现效果 /*** 在地图上添加标记点* point: [lng, lat]* color: #83f7a0*/addMarkerOnMap(point, color #83f7a0) {const marker new mapboxgl.Marker({draggable: false,color: color,}).setLngLat(point).addTo(this.map);this.markersList.push(marker);},…

从【注意力机制】开始的,零基础【大模型】系列

注意力机制 原理&#xff1a;从关注全部到关注重点软注意力-计算方式传统注意力问题 键值注意力&#xff1a;单标签的检索系统计算方式 多头注意力&#xff1a;多标签的检索系统自注意力&#xff1a;对输入数据内部关系进行预处理计算方式 Transformer 原理&#xff1a;从关注全…

医院预约挂号平台的设计与实现

摘 要 网络的空前发展给人们的工作和生活带来了极大的便利&#xff0c;信息技术已成为节约运营成本、提高工作效率的首选。相比之下&#xff0c;国内相当多的中小医院在医院预约工作中的手工工作比较保守&#xff0c;数据查询和存储成本都很高&#xff0c;但效率很低。为了使医…

docker-compose部署sonarqube 8.9 版本

官方部署文档注意需求版本 所以选择8.9版本 一、准备部署配置 1、持久化目录 rootlocalhost:/root# mkdir -p /data/sonar/postgres /data/sonar/sonarqube/data /data/sonar/sonarqube/logs /data/sonar/sonarqube/extensions rootlocalhost:/root# chmod 777 /data/sona…

天眼销为电销行业降低获客成本

当下&#xff0c;做电销的老板都有一个深刻体会&#xff1a;市场竞争越来越激烈&#xff0c;获客成本不断攀升&#xff0c;但效率不升返降&#xff0c;企业经营困难。特别是在这一两年&#xff0c;市场环境紧张&#xff0c;业务不好开展&#xff0c;更是雪上加霜。 销售也感觉…

Matlab 曲线动态绘制

axes(handles.axes1); % 选定所画坐标轴 figure也可 h1 animatedline; h1.Color b; h1.LineWidth 2; h1.LineStyle -; % 线属性设置 for i 1 : length(x)addpoints(h1,x(i),y(i)); % x/y为待绘制曲线数据drawnow;pause(0.01); % 画点间停顿 end 示例&#xff1a; figure…

BearPi Std 板从入门到放弃 - 引气入体篇(8)(ADC)

简介 基于前面的文章, 缩略STM32CubeMx创建项目的过程&#xff0c;直接添加ADC相关初 始化; 开发板 &#xff1a; Bearpi Std(小熊派标准板) 主芯片: STM32L431RCT6 LED : PC13 \ 推挽输出即可 \ 高电平点亮 串口: Usart1 ADC1: PC2步骤 创建STM32CubeMX LED/串口ADC1初始…

「音视频处理」音频编码AAC详解,低码率提高音质?

AAC&#xff08;高级音频编码&#xff09; 也称为 MPEG-4 音频。数码音频压缩和编码的标准方式。AAC 编码文件可与音乐光盘的质量相匹敌&#xff0c;且声音质量通常等同于或高于以相同或甚至更高的位速率编码的 MP3 文件。 我们按这样的顺序讨论 &#xff1a; 1、 封装格式的…

如何使用 Zotero 导出所选条目的 PDF 文件

如何使用 Zotero 导出所选条目的 PDF 文件 Zotero 是一款强大的参考文献管理工具&#xff0c;但它并不直接提供将整个文件夹导出为 PDF 的选项。不过&#xff0c;您可以使用以下步骤来导出您所选的 Zotero 条目中的 PDF 文件&#xff0c;无需额外的插件。 选择所需的 Zotero 条…

2023年危险化学品生产单位安全生产管理人员证考试题库及危险化学品生产单位安全生产管理人员试题解析

题库来源&#xff1a;安全生产模拟考试一点通公众号小程序 2023年危险化学品生产单位安全生产管理人员证考试题库及危险化学品生产单位安全生产管理人员试题解析是安全生产模拟考试一点通结合&#xff08;安监局&#xff09;特种作业人员操作证考试大纲和&#xff08;质检局&a…

二叉树链式结构

1.前置说明 我们手动构建一棵二叉树&#xff1a; 注意&#xff1a;上述代码并不是创建二叉树的方式 从概念中可以看出&#xff0c;二叉树定义是递归式的&#xff0c;因此后序基本操作中基本都是按照该概念实现的 2.二叉树的遍历 2.1前序、中序以及后序遍历 学习二叉树结构&a…

基于c++版本链队列改-Python版本链队列基础理解

##基于链表的队列实现 可以将链表的“头节点”和“尾节点”分别视为“队首”和“队尾”&#xff0c;规定队尾仅可添加节点&#xff0c;队首仅可删除节点。 ##图解 ##基于链表的队列实现代码 class ListNode:"""定义链表"""def __init__(self)…

Nexus搭建npm私库(角色管理、上传脚本)

安装Nexus 官网下载 https://www.sonatype.com/products/sonatype-nexus-oss-download 进入官网下载&#xff0c;最新下载方式需要输入个人信息才能下载了 选择对应的系统进行下载 Windows 推荐也下载 UNIX 版本&#xff08;Windows 版本配置比较难改&#xff09; 如果没有下…