【深度学习】注意力机制(三)

本文介绍一些注意力机制的实现,包括EMHSA/SA/SGE/AFT/Outlook Attention。

【深度学习】注意力机制(一)

【深度学习】注意力机制(二)

目录

一、EMHSA(Efficient Multi-Head Self-Attention)

二、SA(SHUFFLE ATTENTION)

三、SGE(Spatial Group-wise Enhance)

四、AFT(Attention Free Transformer)

五、Outlook Attention


一、EMHSA(Efficient Multi-Head Self-Attention)

论文:论文地址

如下图:

代码(代码连接):

import numpy as np
import torch
from torch import nn
from torch.nn import init



class EMSA(nn.Module):

    def __init__(self, d_model, d_k, d_v, h,dropout=.1,H=7,W=7,ratio=3,apply_transform=True):

        super(EMSA, self).__init__()
        self.H=H
        self.W=W
        self.fc_q = nn.Linear(d_model, h * d_k)
        self.fc_k = nn.Linear(d_model, h * d_k)
        self.fc_v = nn.Linear(d_model, h * d_v)
        self.fc_o = nn.Linear(h * d_v, d_model)
        self.dropout=nn.Dropout(dropout)

        self.ratio=ratio
        if(self.ratio>1):
            self.sr=nn.Sequential()
            self.sr_conv=nn.Conv2d(d_model,d_model,kernel_size=ratio+1,stride=ratio,padding=ratio//2,groups=d_model)
            self.sr_ln=nn.LayerNorm(d_model)

        self.apply_transform=apply_transform and h>1
        if(self.apply_transform):
            self.transform=nn.Sequential()
            self.transform.add_module('conv',nn.Conv2d(h,h,kernel_size=1,stride=1))
            self.transform.add_module('softmax',nn.Softmax(-1))
            self.transform.add_module('in',nn.InstanceNorm2d(h))

        self.d_model = d_model
        self.d_k = d_k
        self.d_v = d_v
        self.h = h

        self.init_weights()


    def init_weights(self):
        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                init.kaiming_normal_(m.weight, mode='fan_out')
                if m.bias is not None:
                    init.constant_(m.bias, 0)
            elif isinstance(m, nn.BatchNorm2d):
                init.constant_(m.weight, 1)
                init.constant_(m.bias, 0)
            elif isinstance(m, nn.Linear):
                init.normal_(m.weight, std=0.001)
                if m.bias is not None:
                    init.constant_(m.bias, 0)

    def forward(self, queries, keys, values, attention_mask=None, attention_weights=None):

        b_s, nq ,c = queries.shape
        nk = keys.shape[1]

        q = self.fc_q(queries).view(b_s, nq, self.h, self.d_k).permute(0, 2, 1, 3)  # (b_s, h, nq, d_k)

        if(self.ratio>1):
            x=queries.permute(0,2,1).view(b_s,c,self.H,self.W) #bs,c,H,W
            x=self.sr_conv(x) #bs,c,h,w
            x=x.contiguous().view(b_s,c,-1).permute(0,2,1) #bs,n',c
            x=self.sr_ln(x)
            k = self.fc_k(x).view(b_s, -1, self.h, self.d_k).permute(0, 2, 3, 1)  # (b_s, h, d_k, n')
            v = self.fc_v(x).view(b_s, -1, self.h, self.d_v).permute(0, 2, 1, 3)  # (b_s, h, n', d_v)
        else:
            k = self.fc_k(keys).view(b_s, nk, self.h, self.d_k).permute(0, 2, 3, 1)  # (b_s, h, d_k, nk)
            v = self.fc_v(values).view(b_s, nk, self.h, self.d_v).permute(0, 2, 1, 3)  # (b_s, h, nk, d_v)

        if(self.apply_transform):
            att = torch.matmul(q, k) / np.sqrt(self.d_k)  # (b_s, h, nq, n')
            att = self.transform(att) # (b_s, h, nq, n')
        else:
            att = torch.matmul(q, k) / np.sqrt(self.d_k)  # (b_s, h, nq, n')
            att = torch.softmax(att, -1) # (b_s, h, nq, n')


        if attention_weights is not None:
            att = att * attention_weights
        if attention_mask is not None:
            att = att.masked_fill(attention_mask, -np.inf)
        
        att=self.dropout(att)

        out = torch.matmul(att, v).permute(0, 2, 1, 3).contiguous().view(b_s, nq, self.h * self.d_v)  # (b_s, nq, h*d_v)
        out = self.fc_o(out)  # (b_s, nq, d_model)
        return out

二、SA(SHUFFLE ATTENTION)

论文:论文地址

如下图:

代码如下(代码连接):

import numpy as np
import torch
from torch import nn
from torch.nn import init
from torch.nn.parameter import Parameter


class ShuffleAttention(nn.Module):

    def __init__(self, channel=512,reduction=16,G=8):
        super().__init__()
        self.G=G
        self.channel=channel
        self.avg_pool = nn.AdaptiveAvgPool2d(1)
        self.gn = nn.GroupNorm(channel // (2 * G), channel // (2 * G))
        self.cweight = Parameter(torch.zeros(1, channel // (2 * G), 1, 1))
        self.cbias = Parameter(torch.ones(1, channel // (2 * G), 1, 1))
        self.sweight = Parameter(torch.zeros(1, channel // (2 * G), 1, 1))
        self.sbias = Parameter(torch.ones(1, channel // (2 * G), 1, 1))
        self.sigmoid=nn.Sigmoid()


    def init_weights(self):
        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                init.kaiming_normal_(m.weight, mode='fan_out')
                if m.bias is not None:
                    init.constant_(m.bias, 0)
            elif isinstance(m, nn.BatchNorm2d):
                init.constant_(m.weight, 1)
                init.constant_(m.bias, 0)
            elif isinstance(m, nn.Linear):
                init.normal_(m.weight, std=0.001)
                if m.bias is not None:
                    init.constant_(m.bias, 0)


    @staticmethod
    def channel_shuffle(x, groups):
        b, c, h, w = x.shape
        x = x.reshape(b, groups, -1, h, w)
        x = x.permute(0, 2, 1, 3, 4)

        # flatten
        x = x.reshape(b, -1, h, w)

        return x

    def forward(self, x):
        b, c, h, w = x.size()
        #group into subfeatures
        x=x.view(b*self.G,-1,h,w) #bs*G,c//G,h,w

        #channel_split
        x_0,x_1=x.chunk(2,dim=1) #bs*G,c//(2*G),h,w

        #channel attention
        x_channel=self.avg_pool(x_0) #bs*G,c//(2*G),1,1
        x_channel=self.cweight*x_channel+self.cbias #bs*G,c//(2*G),1,1
        x_channel=x_0*self.sigmoid(x_channel)

        #spatial attention
        x_spatial=self.gn(x_1) #bs*G,c//(2*G),h,w
        x_spatial=self.sweight*x_spatial+self.sbias #bs*G,c//(2*G),h,w
        x_spatial=x_1*self.sigmoid(x_spatial) #bs*G,c//(2*G),h,w

        # concatenate along channel axis
        out=torch.cat([x_channel,x_spatial],dim=1)  #bs*G,c//G,h,w
        out=out.contiguous().view(b,-1,h,w)

        # channel shuffle
        out = self.channel_shuffle(out, 2)
        return out

三、SGE(Spatial Group-wise Enhance)

论文:Spatial Group-wise Enhance: Improving Semanti

如下图:

代码如下(代码连接):

import torch
import torch.nn as nn


class SpatialGroupEnhance(nn.Module):
    def __init__(self, groups = 64):
        super(SpatialGroupEnhance, self).__init__()
        self.groups   = groups
        self.avg_pool = nn.AdaptiveAvgPool2d(1)
        self.weight   = Parameter(torch.zeros(1, groups, 1, 1))
        self.bias     = Parameter(torch.ones(1, groups, 1, 1))
        self.sig      = nn.Sigmoid()

    def forward(self, x): # (b, c, h, w)
        b, c, h, w = x.size()
        x = x.view(b * self.groups, -1, h, w) 
        xn = x * self.avg_pool(x)
        xn = xn.sum(dim=1, keepdim=True)
        t = xn.view(b * self.groups, -1)
        t = t - t.mean(dim=1, keepdim=True)
        std = t.std(dim=1, keepdim=True) + 1e-5
        t = t / std
        t = t.view(b, self.groups, h, w)
        t = t * self.weight + self.bias
        t = t.view(b * self.groups, 1, h, w)
        x = x * self.sig(t)
        x = x.view(b, c, h, w)
        return x

四、AFT(Attention Free Transformer)

论文:An Attention Free Transformer

如下图:

代码如下(代码连接):

import torch, math
from torch import nn, einsum
import torch.nn.functional as F    

class AFTFull(nn.Module):
    def __init__(self, max_seqlen, dim, hidden_dim=64):
        super().__init__()
        '''
        max_seqlen: the maximum number of timesteps (sequence length) to be fed in
        dim: the embedding dimension of the tokens
        hidden_dim: the hidden dimension used inside AFT Full

        Number of heads is 1 as done in the paper
        '''
        self.dim = dim
        self.hidden_dim = hidden_dim
        self.to_q = nn.Linear(dim, hidden_dim)
        self.to_k = nn.Linear(dim, hidden_dim)
        self.to_v = nn.Linear(dim, hidden_dim)
        self.project = nn.Linear(hidden_dim, dim)
        self.wbias = nn.Parameter(torch.Tensor(max_seqlen, max_seqlen))
        nn.init.xavier_uniform_(self.wbias)

    def forward(self, x):
        B, T, _ = x.shape
        Q = self.to_q(x).view(B, T, self.hidden_dim)
        K = self.to_k(x).view(B, T, self.hidden_dim)
        V = self.to_v(x).view(B, T, self.hidden_dim)
        temp_wbias = self.wbias[:T, :T].unsqueeze(0) # sequences can still be variable length

        '''
        From the paper
        '''
        Q_sig = torch.sigmoid(Q)
        temp = torch.exp(temp_wbias) @ torch.mul(torch.exp(K), V)
        weighted = temp / (torch.exp(temp_wbias) @ torch.exp(K))
        Yt = torch.mul(Q_sig, weighted)

        Yt = Yt.view(B, T, self.hidden_dim)
        Yt = self.project(Yt)

        return Yt

class AFTSimple(nn.Module):
    def __init__(self, max_seqlen, dim, hidden_dim=64):
        super().__init__()
        '''
        max_seqlen: the maximum number of timesteps (sequence length) to be fed in
        dim: the embedding dimension of the tokens
        hidden_dim: the hidden dimension used inside AFT Full
        
        Number of Heads is 1 as done in the paper.
        '''
        self.dim = dim
        self.hidden_dim = hidden_dim
        self.to_q = nn.Linear(dim, hidden_dim)
        self.to_k = nn.Linear(dim, hidden_dim)
        self.to_v = nn.Linear(dim, hidden_dim)
        self.project = nn.Linear(hidden_dim, dim)

    def forward(self, x):
        B, T, _ = x.shape
        Q = self.to_q(x).view(B, T, self.hidden_dim)
        K = self.to_k(x).view(B, T, self.hidden_dim)
        V = self.to_v(x).view(B, T, self.hidden_dim)

        '''
        From the paper
        '''
        weights = torch.mul(torch.softmax(K, 1), V).sum(dim=1, keepdim=True)
        Q_sig = torch.sigmoid(Q)
        Yt = torch.mul(Q_sig, weights)

        Yt = Yt.view(B, T, self.hidden_dim)
        Yt = self.project(Yt)

        return Yt

class AFTLocal(nn.Module):
    def __init__(self, max_seqlen, dim, hidden_dim=64, s=256):
        super().__init__()
        '''
        max_seqlen: the maximum number of timesteps (sequence length) to be fed in
        dim: the embedding dimension of the tokens
        hidden_dim: the hidden dimension used inside AFT Full
        s: the window size used for AFT-Local in the paper

        Number of heads is 1 as done in the paper
        '''
        self.dim = dim
        self.hidden_dim = hidden_dim
        self.to_q = nn.Linear(dim, hidden_dim)
        self.to_k = nn.Linear(dim, hidden_dim)
        self.to_v = nn.Linear(dim, hidden_dim)
        self.project = nn.Linear(hidden_dim, dim)
        self.wbias = nn.Parameter(torch.Tensor(max_seqlen, max_seqlen))
        self.max_seqlen = max_seqlen
        self.s = s
        nn.init.xavier_uniform_(self.wbias)


    def forward(self, x):
        B, T, _ = x.shape
        Q = self.to_q(x).view(B, T, self.hidden_dim)
        K = self.to_k(x).view(B, T, self.hidden_dim)
        V = self.to_v(x).view(B, T, self.hidden_dim)
        self.wbias = nn.Parameter(torch.Tensor([
            [self.wbias[i][j] if math.fabs(i-j) < self.s else 0 for j in range(self.max_seqlen)] 
            for i in range(self.max_seqlen)
            ]))
        temp_wbias = self.wbias[:T, :T].unsqueeze(0) # sequences can still be variable length

        '''
        From the paper
        '''
        Q_sig = torch.sigmoid(Q)
        temp = torch.exp(temp_wbias) @ torch.mul(torch.exp(K), V)
        weighted = temp / (torch.exp(temp_wbias) @ torch.exp(K))
        Yt = torch.mul(Q_sig, weighted)

        Yt = Yt.view(B, T, self.hidden_dim)
        Yt = self.project(Yt)

        return Yt

五、Outlook Attention

论文:VOLO: Vision Outlooker for Visual Recognition

如下图:

代码如下(代码连接):

import torch
import torch.nn as nn


class OutlookAttention(nn.Module):
    """
    Implementation of outlook attention
    --dim: hidden dim
    --num_heads: number of heads
    --kernel_size: kernel size in each window for outlook attention
    return: token features after outlook attention
    """

    def __init__(self, dim, num_heads, kernel_size=3, padding=1, stride=1,
                 qkv_bias=False, qk_scale=None, attn_drop=0., proj_drop=0.):
        super().__init__()
        head_dim = dim // num_heads
        self.num_heads = num_heads
        self.kernel_size = kernel_size
        self.padding = padding
        self.stride = stride
        self.scale = qk_scale or head_dim**-0.5

        self.v = nn.Linear(dim, dim, bias=qkv_bias)
        self.attn = nn.Linear(dim, kernel_size**4 * num_heads)

        self.attn_drop = nn.Dropout(attn_drop)
        self.proj = nn.Linear(dim, dim)
        self.proj_drop = nn.Dropout(proj_drop)

        self.unfold = nn.Unfold(kernel_size=kernel_size, padding=padding, stride=stride)
        self.pool = nn.AvgPool2d(kernel_size=stride, stride=stride, ceil_mode=True)

    def forward(self, x):
        B, H, W, C = x.shape

        v = self.v(x).permute(0, 3, 1, 2)  # B, C, H, W

        h, w = math.ceil(H / self.stride), math.ceil(W / self.stride)
        v = self.unfold(v).reshape(B, self.num_heads, C // self.num_heads,
                                   self.kernel_size * self.kernel_size,
                                   h * w).permute(0, 1, 4, 3, 2)  # B,H,N,kxk,C/H

        attn = self.pool(x.permute(0, 3, 1, 2)).permute(0, 2, 3, 1)
        attn = self.attn(attn).reshape(
            B, h * w, self.num_heads, self.kernel_size * self.kernel_size,
            self.kernel_size * self.kernel_size).permute(0, 2, 1, 3, 4)  # B,H,N,kxk,kxk
        attn = attn * self.scale
        attn = attn.softmax(dim=-1)
        attn = self.attn_drop(attn)

        x = (attn @ v).permute(0, 1, 4, 3, 2).reshape(
            B, C * self.kernel_size * self.kernel_size, h * w)
        x = F.fold(x, output_size=(H, W), kernel_size=self.kernel_size,
                   padding=self.padding, stride=self.stride)

        x = self.proj(x.permute(0, 2, 3, 1))
        x = self.proj_drop(x)

        return x

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:/a/235956.html

如若内容造成侵权/违法违规/事实不符,请联系我们进行投诉反馈qq邮箱809451989@qq.com,一经查实,立即删除!

相关文章

logstash插件简单介绍

logstash插件 输入插件(input) Input&#xff1a;输入插件。 Input plugins | Logstash Reference [8.11] | Elastic 所有输入插件都支持的配置选项 SettingInput typeRequiredDefaultDescriptionadd_fieldhashNo{}添加一个字段到一个事件codeccodecNoplain用于输入数据的…

可学习超图拉普拉斯算子代码

python版本&#xff1a;3.6。sklearn版本&#xff1a;scikit-learn0.19 问题1&#xff1a;ERROR: Could not build wheels for ecos, scs, which is required to install pyproject.toml-based projects| 解决办法&#xff1a;cvxpy安装过程中遇到的坑_ecos 2.0.7.post1 cp37 …

Terraform实战(二)-terraform创建阿里云资源

1 初始化环境 1.1 创建初始文件夹 $ cd /data $ mkdir terraform $ mkdir aliyun terraform作为terraform的配置文件夹&#xff0c;内部的每一个.tf&#xff0c;.tfvars文件都会被加载。 1.2 配置provider 创建providers.tf文件&#xff0c;配置provider依赖。 provider…

LeetCode 每日一题 Day 9 ||简单dp

70. 爬楼梯 假设你正在爬楼梯。需要 n 阶你才能到达楼顶。 每次你可以爬 1 或 2 个台阶。你有多少种不同的方法可以爬到楼顶呢&#xff1f; 示例 1&#xff1a; 输入&#xff1a;n 2 输出&#xff1a;2 解释&#xff1a;有两种方法可以爬到楼顶。 1 阶 1 阶2 阶 示例 2&am…

智能井盖传感器怎么有效监测井盖位移

随着城市化进程的加速推进&#xff0c;城市基础设施的安全与维护问题日益凸显&#xff0c;引发了社会的广泛关注。其中井盖作为城市地下设施的重要一环&#xff0c;其安全问题时刻影响着市民的幸福生活。近年来智能井盖传感器的发展与应用为实时监测井盖位移提供了全新的解决方…

嵌入式开发按怎样的路线学习较好?

嵌入式开发按怎样的路线学习较好&#xff1f; 在开始前我有一些资料&#xff0c;是我根据自己从业十年经验&#xff0c;熬夜搞了几个通宵&#xff0c;精心整理了一份「嵌入式从专业入门到高级教程工具包」&#xff0c;点个关注&#xff0c;全部无偿共享给大家&#xff01;&…

BigData之Google Hadoop中间件安装

前言 Hadoop / Zookeeper / Hbase 因资源有限 这三个都是安装在同一台Centos7.9的机器上 但通过配置 所以在逻辑上是distributed模式 1 Java安装 1.1 下载java11 tar/opt/java/jdk-11.0.5/ 1.2 环境配置修改 文件/etc/profile export JAVA_HOME/opt/java/jdk-11.0.5/ e…

网络层重点协议——IP协议详解

✏️✏️✏️今天给大家分享的是网络层的重点协议——IP协议。 清风的CSDN博客 &#x1f6e9;️&#x1f6e9;️&#x1f6e9;️希望我的文章能对你有所帮助&#xff0c;有不足的地方还请各位看官多多指教&#xff0c;大家一起学习交流&#xff01; ✈️✈️✈️动动你们发财的…

解决vue3 动态引入报错问题

之前这样写的&#xff0c;能使用&#xff0c;但是有警告 警告&#xff0c;查了下&#xff0c;是动态引入的问题&#xff0c;看到说要用glob 然后再我的基础上&#xff0c;稍微 改了下&#xff0c;就可以了&#xff1a; 最后打印了下&#xff0c;modules[../../components/flowc…

每日一练【无重复字符的最长子串】

一、题目描述 给定一个字符串 s &#xff0c;请你找出其中不含有重复字符的 最长子串 的长度。 二、题目解析 算法思想&#xff1a;移动窗口的思想去解决。 那为什么要用这个方法解决呢&#xff1f; 我们首先用暴力的思路去遍历一遍&#xff0c;我们遍历到deabc后&#xff…

外包干了3个月,技术退步明显.......

先说一下自己的情况&#xff0c;大专生&#xff0c;18年通过校招进入武汉某软件公司&#xff0c;干了接近4年的功能测试&#xff0c;今年年初&#xff0c;感觉自己不能够在这样下去了&#xff0c;长时间呆在一个舒适的环境会让一个人堕落! 而我已经在一个企业干了四年的功能测…

图文教程:从0开始安装stable-diffusion

现在AI绘画还是挺火&#xff0c;Midjourney虽然不错&#xff0c;但是对于我来说还是挺贵的。今天我就来安一下开源的AI绘画stable-diffusion,它的缺点就是对电脑的要求比较高&#xff0c;尤其是显卡。 话不多说开搞。 访问sd的github&#xff0c;https://github.com/AUTOMATIC…

AIGC报告专题:计算机Pika-AIGC新秀-视频生成产业或迎来GPT时刻

今天分享的AIGC系列深度研究报告&#xff1a;《AIGC报告专题&#xff1a;计算机Pika-AIGC新秀-视频生成产业或迎来GPT时刻》。 &#xff08;报告出品方&#xff1a;中泰证券&#xff09; 报告共计&#xff1a;11页 Pika&#xff1a;专注Text to Video生成场景&#xff0c;支持…

创投课程第四期 | Web3一级市场投资框架的演变及投资人能力框架的构成

协会邀请了来自Zonff Partners的合伙人——Colin&#xff0c;作为VC创投课程第4期的嘉宾&#xff0c;在北京时间12月9日(周六)下午14:00 PM-15:00 PM于蚂蚁链科技产业创新中心进行线下分享&#xff0c;届时将与所有对Web3投资、创业心怀热忱的朋友们共同探讨《WEB3一级市场投资…

Java面试题(每天10题)-------连载(46)

目录 Dubbo篇 1、Dubbo的默认集群容错方案 2、Dubbo支持哪些序列化方式&#xff1f; 3、Dubbo超时时间怎样设置&#xff1f; 4、服务调用超时问题怎么解决&#xff1f; 5、Dubbo在安全机制方面是如何解决的&#xff1f; 6、Dubbo和Dubbox之间的区别 7、Dubbo和Spring C…

mybatis多表映射-对一关联

1、建库建表 create database mybatis-example; use mybatis-example; create table t_book (bid varchar(20) primary key,bname varchar(20),stuid varchar(20) ); insert into t_book values(b001,Java,s001); insert into t_book values(b002,Python,s002); insert into …

IDEA 2023.3 start failed 启动失败修复

发现是 RestfulToolkit 插件有冲突导致的&#xff0c;删除插件后成功启动 open ~/Library/Application\ Support/JetBrains/IntelliJIdea2023.3/plugins参考&#xff1a;https://youtrack.jetbrains.com/issue/IDEA-340080/Critical-startup-error-after-upgrading-to-Intelli…

成都工业学院Web技术基础(WEB)实验一:HTML5排版标签使用

写在前面 1、基于2022级计算机大类实验指导书 2、代码仅提供参考&#xff0c;前端变化比较大&#xff0c;按照要求&#xff0c;只能做到像&#xff0c;不能做到一模一样 3、图片和文字仅为示例&#xff0c;需要自行替换 4、如果代码不满足你的要求&#xff0c;请寻求其他的…

文生图:AE/VAE/VQVAE/VQGAN/DALLE模型

文生图模型演进&#xff1a;AE、VAE、VQ-VAE、VQ-GAN、DALL-E 等 8 模型本文中我们回顾了 AE、VAE、VQ-VAE、VQ-VAE-2 以及 VQ-GAN、DALL-E、DALL-E mini 和 CLIP-VQ-GAN 等 8 中模型&#xff0c;以介绍文生图模型的演进。https://mp.weixin.qq.com/s/iFrCEpAJ3WMhB-01lZ_qIA 1…

【C++11并发】Atomic 笔记

简介 用atomic定义的变量&#xff0c;支持原子操作&#xff0c;即要么全部完成操作&#xff0c;要不全部没有完成&#xff0c;我们是不可能看到中间状态。一般在多线程程序中&#xff0c;可以用atomic来完成数据同步。 标准库为我们主要提供了四类工具 atomic类模板操作atomi…