文章目录
- 资料支撑
- 资料结论
- sdwebui A1111 速度对比测试
- sdxl
- xformers 用contorlnet sdxl
- sdpa(--opt-sdp-no-mem-attention) 用contorlnet sdxl
- sdpa(--opt-sdp-attention) 用contorlnet sdxl
- 不用xformers或者sdpa ,用contorlnet sdxl
- 不用xformers或者sdpa 纯生图 sdxl
- 用sdpa 纯生图 不用contorlnet 生图时间
- sd1.5
- 不用xformers或者sdpa sd1.5+hirefix2倍 纯生图512
- 用sdpa sd1.5+hirefix2倍 纯生图512
- 不用xformers或者sdpa sd1.5 纯生图512
- 用sdpa sd1.5 纯生图512
- 其他速度
- 结论
资料支撑
xformers中可以使用Flashv2
https://github.com/facebookresearch/xformers/issues/795
https://github.com/vllm-project/vllm/issues/485
https://github.com/facebookresearch/xformers/issues/832
PyTorch 支持 Flash Attention 2。
Flash Attention 2 是 Flash Attention 的改进版本,它提供了更高的性能和更好的并行性。它于 2023 年 11 月发布,并被集成到 PyTorch 2.2 中。
PyTorch 2.2 于 2024 年 2 月发布,它包含以下与 Flash Attention 2 相关的更新:
- 将 Flash Attention 内核更新到 v2 版本
- 支持 aarch64 平台上的 Flash Attention 2
- 修复了 Flash Attention 2 中的一些已知问题
要使用 Flash Attention 2,您需要安装 PyTorch 2.2 或更高版本。您还可以使用 torch.nn.functional.flash_attn() 函数显式调用 Flash Attention 2。
以下是一些有关如何使用 Flash Attention 2 的资源: - PyTorch 文档:https://discuss.pytorch.org/t/flash-attention/174955
- Flash Attention 2 论文:https://arxiv.org/abs/2307.08691
- Flash Attention 2 GitHub 存储库:https://github.com/Dao-AILab/flash-attention
https://github.com/pytorch/pytorch/pull/105602
更新日志:https://pytorch.org/blog/pytorch2-2/
https://pytorch.org/docs/2.2/generated/torch.nn.functional.scaled_dot_product_attention.html
Triton 内核
https://pytorch.org/blog/pytorch2-3/
SDPA vs. xformers
https://github.com/huggingface/diffusers/issues/3793
F.scaled_dot_product_attention() 是pytorch的SDPA
xformers.ops.memory_efficient_attention是xformer的对应算子
https://github.com/lucidrains/memory-efficient-attention-pytorch/blob/main/memory_efficient_attention_pytorch/memory_efficient_attention.py
https://github.com/facebookresearch/xformers/issues/950
sdwebui支持SDP:
https://github.com/AUTOMATIC1111/stable-diffusion-webui/pull/8367
https://qq742971636.blog.csdn.net/article/details/139772822
sdp 注意力机制与 xformers 相当,甚至略胜一筹:
pytorch 2.0的注意力是Flash Attention 1
https://pytorch.org/docs/2.0/generated/torch.nn.functional.scaled_dot_product_attention.html
pytorch 2.2的注意力是Flash Attention 2
https://pytorch.org/docs/2.2/generated/torch.nn.functional.scaled_dot_product_attention.html
资料结论
pytorch2.2版本的 F.scaled_dot_product_attention() 即是Flash Attention 2
xformers 中新版本已经有类似实现。
sdwebui A1111 速度对比测试
参数含义看这里:
https://qq742971636.blog.csdn.net/article/details/139772822
使用ipadapter contorlnet
pytorch2.3+xformers 0.25
25轮
In a snowy mountain range, the young man is dressed in winter attire, facing the camera with a determined gaze. He sports a thick wool coat, knit hat, and gloves to keep warm in the frigid temperatures. His eyes, piercing and resolute, reflect the strength and resolve needed to conquer the elements and the challenging terrain.
paintings, sketches, worst quality, low quality, normal quality, lowres, blurry, text, logo, monochrome, grayscale, skin spots, acnes, skin blemishes, age spot, strabismus, wrong finger, bad anatomy, bad hands, error, missing fingers, cropped, jpeg artifacts, signature, watermark, username, dark skin, fused girls, fushion, bad feet, ugly, pregnant, vore, duplicate, morbid, mutilated, transexual, hermaphrodite, long neck, mutated hands, poorly drawn face, mutation, deformed, bad proportions, malformed limbs, extra limbs, cloned face, disfigured, gross proportions, missing arms, missing legs, extra arms, extra legs, plump, open mouth, tooth, teeth, nsfw,
sdxl
xformers 用contorlnet sdxl
xformers:
./webui.sh --enable-insecure-extension-access --skip-python-version-check --skip-torch-cuda-test --listen --port 7860 --no-download-sd-model --api --no-half-vae --xformers
速度:
Time taken: 11.5 sec.
A: 13.29 GB, R: 16.77 GB, Sys: 18.5/39.3945 GB (47.0%)
sdpa(–opt-sdp-no-mem-attention) 用contorlnet sdxl
sdpa
./webui.sh --enable-insecure-extension-access --skip-python-version-check --skip-torch-cuda-test --listen --port 7860 --no-download-sd-model --api --no-half-vae --opt-sdp-no-mem-attention
Time taken: 11.1 sec.
A: 13.29 GB, R: 14.81 GB, Sys: 16.6/39.3945 GB (42.1%)
sdpa(–opt-sdp-attention) 用contorlnet sdxl
sdpa
./webui.sh --enable-insecure-extension-access --skip-python-version-check --skip-torch-cuda-test --listen --port 7860 --no-download-sd-model --api --no-half-vae --opt-sdp-attention
Time taken: 11.4 sec.
A: 13.29 GB, R: 14.81 GB, Sys: 16.6/39.3945 GB (42.1%)
不用xformers或者sdpa ,用contorlnet sdxl
Time taken: 13.3 sec.
A: 13.28 GB, R: 15.39 GB, Sys: 17.1/39.3945 GB (43.5%)
不用xformers或者sdpa 纯生图 sdxl
Time taken: 10.1 sec.
A: 10.27 GB, R: 12.45 GB, Sys: 13.0/39.3945 GB (33.0%)
用sdpa 纯生图 不用contorlnet 生图时间
Time taken: 6.7 sec.
A: 10.29 GB, R: 11.89 GB, Sys: 12.5/39.3945 GB (31.7%)
sd1.5
不用xformers或者sdpa sd1.5+hirefix2倍 纯生图512
Time taken: 10.7 sec.
A: 10.37 GB, R: 10.49 GB, Sys: 11.1/39.3945 GB (28.1%)
用sdpa sd1.5+hirefix2倍 纯生图512
Time taken: 6.2 sec.
A: 5.75 GB, R: 7.05 GB, Sys: 7.7/39.3945 GB (19.4%)
不用xformers或者sdpa sd1.5 纯生图512
Time taken: 3.1 sec.
A: 3.11 GB, R: 3.46 GB, Sys: 3.4/39.3945 GB (8.6%)
用sdpa sd1.5 纯生图512
Time taken: 2.3 sec.
A: 3.13 GB, R: 4.07 GB, Sys: 3.7/39.3945 GB (9.3%)
其他速度
写真四张图A100:时间: 50.00366139411926
写真,A10,1张图,生图换脸一套时间,25秒
写真,A10,2张图,生图换脸一套时间,46秒
aicy生图,不计算llm时间为,3.3秒
aicy生图,计算llm时间为,5.2秒
结论
新版的xformers 、Flash Attention 2、Pytorch 的速度都差不多。安装pytorch 2.2以上,启用sdpa(–opt-sdp-no-mem-attention,就可以不用安装xformers 了。