怎样写出一个更快的 memset/memcpy ？第1页

ling-29-85 网友的相关建议:

glibc中最新的memcpy和memset都是我写的，我们的代码同时也考虑预取，对齐（编译之后使用-D 可以清晰看到），减少指令跳转错误提升性能,，希望能够帮助你

源代码如下：

a. memcpy:

https:// sourceware.org/git/gitw eb.cgi?p=glibc.git;a=commitdiff;h=05f3633da4f9df870d04dd77336e793746e57ed4

b. memset:

https:// sourceware.org/ml/libc- alpha/2014-06/msg00480.html

c. memcpy in Linux Kernel

[tip:core/locking] x86, mem: Optimize memcpy by avoiding memory false dependece

Linus Torvalds 的评价： “The code looks clever and nice”!

更多早些我写的glibc 函数代码：

http:// blog.csdn.net/linguranu s/archive/2011/02/16/6189676.aspx

代码中主要考虑了内存的读写优化、指令跳转、指令对齐，指令缺失等问题

虽然优化memcpy/memset等glibc函数，然而任何数据的迁移对于整体系统将是非常沉重的负担（RC 导致延迟），坦诚的说如果发现我们的程序中由于memcpy/memset等成为性能的瓶颈，那么程序在间接的告诉我们架构不正确建议考虑数据引用如零拷贝，如果内容很短（如64字节之内）就直接用按需赋值就好（避免跳转预测失败带来额外的成本），而不是着重进一步优化这些函数。

也许我们学习知识的目的是为了了解事物的规律，然后在工作中避免问题的产生，而不是进一步的优化

skywind3000 网友的相关建议:

写过相关代码，最终的实现能在不同拷贝长度，对齐和不对齐，平均比 memcpy 快40%（gcc4.9, vc 2012），主要是以下几个优化点：

策略区别：64字节以内用小内存方案，64K以内用中尺寸方案，大于64K用大内存拷贝方案。
查表跳转：拷贝不同小尺寸内存，直接跳转到相应地址解除循环。
目标对齐：64字节以上拷贝的先用跳转表方法拷贝几个字节让目标地址对齐，好做后面的事情。
矢量拷贝：并行一次性读入N个矢量到 sse2 寄存器，再并行写出。
缓存预取：使用 prefetchnta ，提前预取数据，等到真的要用时数据已经到位。
内存直写：使用 movntdq 来直写内存，避免缓存污染。

测试结果

针对不同内存尺寸（从32字节到8MB），拷贝若干次，并且针对目标地址和源地址分别对齐和不对齐的情况进行测试，并且给出和 memcpy 的对比时间：

       result: gcc4.9 (msvc 2012 got a similar result):   benchmark(size=32 bytes, times=16777216): result(dst aligned, src aligned): memcpy_fast=180ms memcpy=249 ms result(dst aligned, src unalign): memcpy_fast=170ms memcpy=271 ms result(dst unalign, src aligned): memcpy_fast=179ms memcpy=269 ms result(dst unalign, src unalign): memcpy_fast=180ms memcpy=260 ms   benchmark(size=64 bytes, times=16777216): result(dst aligned, src aligned): memcpy_fast=162ms memcpy=300 ms result(dst aligned, src unalign): memcpy_fast=199ms memcpy=328 ms result(dst unalign, src aligned): memcpy_fast=410ms memcpy=339 ms result(dst unalign, src unalign): memcpy_fast=390ms memcpy=361 ms   benchmark(size=512 bytes, times=8388608): result(dst aligned, src aligned): memcpy_fast=160ms memcpy=241 ms result(dst aligned, src unalign): memcpy_fast=200ms memcpy=519 ms result(dst unalign, src aligned): memcpy_fast=313ms memcpy=509 ms result(dst unalign, src unalign): memcpy_fast=311ms memcpy=520 ms   benchmark(size=1024 bytes, times=4194304): result(dst aligned, src aligned): memcpy_fast=145ms memcpy=179 ms result(dst aligned, src unalign): memcpy_fast=180ms memcpy=430 ms result(dst unalign, src aligned): memcpy_fast=245ms memcpy=430 ms result(dst unalign, src unalign): memcpy_fast=230ms memcpy=455 ms   benchmark(size=4096 bytes, times=524288): result(dst aligned, src aligned): memcpy_fast=80ms memcpy=80 ms result(dst aligned, src unalign): memcpy_fast=110ms memcpy=205 ms result(dst unalign, src aligned): memcpy_fast=110ms memcpy=224 ms result(dst unalign, src unalign): memcpy_fast=110ms memcpy=200 ms   benchmark(size=8192 bytes, times=262144): result(dst aligned, src aligned): memcpy_fast=70ms memcpy=78 ms result(dst aligned, src unalign): memcpy_fast=100ms memcpy=222 ms result(dst unalign, src aligned): memcpy_fast=100ms memcpy=210 ms result(dst unalign, src unalign): memcpy_fast=100ms memcpy=230 ms   benchmark(size=1048576 bytes, times=2048): result(dst aligned, src aligned): memcpy_fast=200ms memcpy=201 ms result(dst aligned, src unalign): memcpy_fast=260ms memcpy=270 ms result(dst unalign, src aligned): memcpy_fast=263ms memcpy=361 ms result(dst unalign, src unalign): memcpy_fast=267ms memcpy=321 ms   benchmark(size=4194304 bytes, times=512): result(dst aligned, src aligned): memcpy_fast=281ms memcpy=391 ms result(dst aligned, src unalign): memcpy_fast=265ms memcpy=407 ms result(dst unalign, src aligned): memcpy_fast=313ms memcpy=453 ms result(dst unalign, src unalign): memcpy_fast=282ms memcpy=439 ms   benchmark(size=8388608 bytes, times=256): result(dst aligned, src aligned): memcpy_fast=266ms memcpy=422 ms result(dst aligned, src unalign): memcpy_fast=250ms memcpy=407 ms result(dst unalign, src aligned): memcpy_fast=297ms memcpy=516 ms result(dst unalign, src unalign): memcpy_fast=281ms memcpy=436 ms  benchmark random access: memcpy_fast=594ms memcpy=1161ms

实现代码

skywind3000/FastMemcpy · GitHub

注意：你用VC的话，请使用 Release+打开优化，别在 Debug下测试，没意义。

相关讨论

我只能说2012年的代码，到目前为止还是比主流的 memcpy 实现快那么多，也难怪不少追求性能的项目都要自己来重新写一下 memcpy。并不排除未来各个平台的标准库会进一步优化，在此之前，为了不对标准库产生过大依赖，一些项目还是选择自己优化了。

终于看到 glibc memcpy的作者 @Ling (对不起@不到）来答题了，我只在少数有限的几台机器上测试得出比 glibc memcpy 快40%的结论，不排除 Ling 考虑的情况比我多，测试比我广，大家可以在自己的机器上进行验证。

相关参考

缓存优化：《

Using Block Prefetch for Optimized Memory Performance

》

这篇文章仅仅针对大尺寸对齐的内存拷贝，实际使用有更多情况需要自己手工处理。

Skywind Inside - 内存拷贝优化（1）-小内存拷贝优化
Skywind Inside - 内存拷贝优化（2）-全尺寸拷贝优化

---------

12月20日更新：继续优化

修改了小内存方案：由原来64字节扩大为128字节，由 int 改为 xmm，小内存性能提升 80%
修改了中内存方案：从4个xmm寄存器并行拷贝改为8个并行拷贝+prefetch，提升20%左右
去除目标地址对齐的分支判断，用一次xmm拷贝完成目标对齐，性能提升10%。
增加测试用例：为了贴近实际情况，增加了随机访问，10MB空间内（绝对大于L2尺寸）随机位置，随机长度的拷贝测试，并且为避免随机数生成影响结果，全部提前生成随机数。

最新代码测试结果（可以对比上面的表看新版本性改变情况）：

       benchmark(size=32 bytes, times=16777216): result(dst aligned, src aligned): memcpy_fast=78ms memcpy=260 ms result(dst aligned, src unalign): memcpy_fast=78ms memcpy=250 ms result(dst unalign, src aligned): memcpy_fast=78ms memcpy=266 ms result(dst unalign, src unalign): memcpy_fast=78ms memcpy=234 ms  benchmark(size=64 bytes, times=16777216): result(dst aligned, src aligned): memcpy_fast=109ms memcpy=281 ms result(dst aligned, src unalign): memcpy_fast=109ms memcpy=328 ms result(dst unalign, src aligned): memcpy_fast=109ms memcpy=343 ms result(dst unalign, src unalign): memcpy_fast=93ms memcpy=344 ms  benchmark(size=512 bytes, times=8388608): result(dst aligned, src aligned): memcpy_fast=125ms memcpy=218 ms result(dst aligned, src unalign): memcpy_fast=156ms memcpy=484 ms result(dst unalign, src aligned): memcpy_fast=172ms memcpy=546 ms result(dst unalign, src unalign): memcpy_fast=172ms memcpy=515 ms  benchmark(size=1024 bytes, times=4194304): result(dst aligned, src aligned): memcpy_fast=109ms memcpy=172 ms result(dst aligned, src unalign): memcpy_fast=187ms memcpy=453 ms result(dst unalign, src aligned): memcpy_fast=172ms memcpy=437 ms result(dst unalign, src unalign): memcpy_fast=156ms memcpy=452 ms  benchmark(size=4096 bytes, times=524288): result(dst aligned, src aligned): memcpy_fast=62ms memcpy=78 ms result(dst aligned, src unalign): memcpy_fast=109ms memcpy=202 ms result(dst unalign, src aligned): memcpy_fast=94ms memcpy=203 ms result(dst unalign, src unalign): memcpy_fast=110ms memcpy=218 ms  benchmark(size=8192 bytes, times=262144): result(dst aligned, src aligned): memcpy_fast=62ms memcpy=78 ms result(dst aligned, src unalign): memcpy_fast=78ms memcpy=202 ms result(dst unalign, src aligned): memcpy_fast=78ms memcpy=203 ms result(dst unalign, src unalign): memcpy_fast=94ms memcpy=203 ms  benchmark(size=1048576 bytes, times=2048): result(dst aligned, src aligned): memcpy_fast=203ms memcpy=191 ms result(dst aligned, src unalign): memcpy_fast=219ms memcpy=281 ms result(dst unalign, src aligned): memcpy_fast=218ms memcpy=328 ms result(dst unalign, src unalign): memcpy_fast=218ms memcpy=312 ms  benchmark(size=4194304 bytes, times=512): result(dst aligned, src aligned): memcpy_fast=312ms memcpy=406 ms result(dst aligned, src unalign): memcpy_fast=296ms memcpy=421 ms result(dst unalign, src aligned): memcpy_fast=312ms memcpy=468 ms result(dst unalign, src unalign): memcpy_fast=297ms memcpy=452 ms  benchmark(size=8388608 bytes, times=256): result(dst aligned, src aligned): memcpy_fast=281ms memcpy=452 ms result(dst aligned, src unalign): memcpy_fast=280ms memcpy=468 ms result(dst unalign, src aligned): memcpy_fast=298ms memcpy=514 ms result(dst unalign, src unalign): memcpy_fast=344ms memcpy=472 ms  benchmark random access: memcpy_fast=515ms memcpy=1014ms

对比 rte_memcpy

根据 Ling的推荐对比了 rte_memcpy，gcc升级到5.1（rte需要avx1），memcpy_fast任然是sse2，等有空可以改个avx版本，三个内存拷贝同时评测，为了增加准确性，增加了一些尺寸，比如37字节，71字节之类的非对齐尺寸：

       benchmark(size=32 bytes, times=16777216): (dst aligned, src aligned): memcpy_fast=47ms rte_memcpy=31ms memcpy=250ms (dst aligned, src unalign): memcpy_fast=46ms rte_memcpy=62ms memcpy=249ms (dst unalign, src aligned): memcpy_fast=47ms rte_memcpy=58ms memcpy=234ms (dst unalign, src unalign): memcpy_fast=46ms rte_memcpy=46ms memcpy=234ms  benchmark(size=37 bytes, times=16777216): (dst aligned, src aligned): memcpy_fast=47ms rte_memcpy=47ms memcpy=266ms (dst aligned, src unalign): memcpy_fast=47ms rte_memcpy=47ms memcpy=265ms (dst unalign, src aligned): memcpy_fast=47ms rte_memcpy=47ms memcpy=265ms (dst unalign, src unalign): memcpy_fast=46ms rte_memcpy=46ms memcpy=272ms  benchmark(size=64 bytes, times=16777216): (dst aligned, src aligned): memcpy_fast=63ms rte_memcpy=31ms memcpy=312ms (dst aligned, src unalign): memcpy_fast=47ms rte_memcpy=47ms memcpy=297ms (dst unalign, src aligned): memcpy_fast=78ms rte_memcpy=62ms memcpy=312ms (dst unalign, src unalign): memcpy_fast=47ms rte_memcpy=47ms memcpy=281ms  benchmark(size=71 bytes, times=16777216): (dst aligned, src aligned): memcpy_fast=63ms rte_memcpy=62ms memcpy=343ms (dst aligned, src unalign): memcpy_fast=63ms rte_memcpy=94ms memcpy=343ms (dst unalign, src aligned): memcpy_fast=93ms rte_memcpy=63ms memcpy=327ms (dst unalign, src unalign): memcpy_fast=62ms rte_memcpy=78ms memcpy=328ms  benchmark(size=512 bytes, times=8388608): (dst aligned, src aligned): memcpy_fast=141ms rte_memcpy=109ms memcpy=220ms (dst aligned, src unalign): memcpy_fast=156ms rte_memcpy=156ms memcpy=515ms (dst unalign, src aligned): memcpy_fast=141ms rte_memcpy=156ms memcpy=483ms (dst unalign, src unalign): memcpy_fast=203ms rte_memcpy=156ms memcpy=501ms  benchmark(size=523 bytes, times=8388608): (dst aligned, src aligned): memcpy_fast=140ms rte_memcpy=172ms memcpy=530ms (dst aligned, src unalign): memcpy_fast=172ms rte_memcpy=172ms memcpy=546ms (dst unalign, src aligned): memcpy_fast=156ms rte_memcpy=218ms memcpy=561ms (dst unalign, src unalign): memcpy_fast=187ms rte_memcpy=202ms memcpy=577ms  benchmark(size=1024 bytes, times=4194304): (dst aligned, src aligned): memcpy_fast=125ms rte_memcpy=125ms memcpy=188ms (dst aligned, src unalign): memcpy_fast=156ms rte_memcpy=125ms memcpy=499ms (dst unalign, src aligned): memcpy_fast=171ms rte_memcpy=156ms memcpy=484ms (dst unalign, src unalign): memcpy_fast=156ms rte_memcpy=156ms memcpy=499ms  benchmark(size=4096 bytes, times=524288): (dst aligned, src aligned): memcpy_fast=62ms rte_memcpy=47ms memcpy=78ms (dst aligned, src unalign): memcpy_fast=109ms rte_memcpy=78ms memcpy=219ms (dst unalign, src aligned): memcpy_fast=78ms rte_memcpy=78ms memcpy=203ms (dst unalign, src unalign): memcpy_fast=78ms rte_memcpy=63ms memcpy=250ms  benchmark(size=8192 bytes, times=262144): (dst aligned, src aligned): memcpy_fast=78ms rte_memcpy=47ms memcpy=63ms (dst aligned, src unalign): memcpy_fast=94ms rte_memcpy=63ms memcpy=203ms (dst unalign, src aligned): memcpy_fast=78ms rte_memcpy=62ms memcpy=202ms (dst unalign, src unalign): memcpy_fast=78ms rte_memcpy=62ms memcpy=203ms  benchmark(size=1048576 bytes, times=2048): (dst aligned, src aligned): memcpy_fast=218ms rte_memcpy=219ms memcpy=187ms (dst aligned, src unalign): memcpy_fast=219ms rte_memcpy=265ms memcpy=296ms (dst unalign, src aligned): memcpy_fast=218ms rte_memcpy=265ms memcpy=312ms (dst unalign, src unalign): memcpy_fast=218ms rte_memcpy=249ms memcpy=281ms  benchmark(size=4194304 bytes, times=512): (dst aligned, src aligned): memcpy_fast=312ms rte_memcpy=437ms memcpy=422ms (dst aligned, src unalign): memcpy_fast=281ms rte_memcpy=422ms memcpy=440ms (dst unalign, src aligned): memcpy_fast=327ms rte_memcpy=405ms memcpy=437ms (dst unalign, src unalign): memcpy_fast=327ms rte_memcpy=422ms memcpy=421ms  benchmark(size=8388608 bytes, times=256): (dst aligned, src aligned): memcpy_fast=312ms rte_memcpy=406ms memcpy=390ms (dst aligned, src unalign): memcpy_fast=283ms rte_memcpy=421ms memcpy=439ms (dst unalign, src aligned): memcpy_fast=312ms rte_memcpy=407ms memcpy=484ms (dst unalign, src unalign): memcpy_fast=297ms rte_memcpy=390ms memcpy=423ms  benchmark random access: memcpy_fast=517ms rte_memcpy=592ms memcpy=1014ms

这里就不分析了，大家自己看上面数据吧，感兴趣可以看评论区讨论和代码，或者自己跑一下代码看看不同编译器，标准库，主机上的表现。

-----

怎样写出一个更快的 memset/memcpy ？的其他答案点击这里

怎样写出一个更快的 memset/memcpy ？第1页

相关话题

前一个讨论

下一个讨论

相关的话题

怎样写出一个更快的 memset/memcpy ？ 第1页

相关话题

前一个讨论

下一个讨论

相关的话题

怎样写出一个更快的 memset/memcpy ？第1页