glibc中最新的memcpy和memset都是我写的,我们的代码同时也考虑预取,对齐(编译之后使用-D 可以清晰看到),减少指令跳转错误提升性能,,希望能够帮助你
源代码如下:
a. memcpy:
b. memset:
https:// sourceware.org/ml/libc- alpha/2014-06/msg00480.html
c. memcpy in Linux Kernel
[tip:core/locking] x86, mem: Optimize memcpy by avoiding memory false dependece
Linus Torvalds 的评价: “The code looks clever and nice”!
更多早些我写的glibc 函数代码 :
http:// blog.csdn.net/linguranu s/archive/2011/02/16/6189676.aspx
代码中主要考虑了内存的读写优化、指令跳转、指令对齐,指令缺失等问题
虽然优化memcpy/memset等glibc函数,然而任何数据的迁移对于整体系统将是非常沉重的负担(RC 导致延迟),坦诚的说如果发现我们的程序中由于memcpy/memset等 成为性能的瓶颈,那么程序在间接的告诉我们架构不正确建议考虑数据引用如零拷贝,如果内容很短(如64字节之内)就直接用按需赋值就好(避免跳转预测失败带来额外的成本),而不是着重进一步优化这些函数。
也许我们学习知识的目的是为了了解事物的规律,然后在工作中避免问题的产生,而不是进一步的优化
写过相关代码,最终的实现能在不同拷贝长度,对齐和不对齐,平均比 memcpy 快40%(gcc4.9, vc 2012),主要是以下几个优化点:
测试结果
针对不同内存尺寸(从32字节到8MB),拷贝若干次,并且针对目标地址和源地址分别对齐和不对齐的情况进行测试,并且给出和 memcpy 的对比时间:
result: gcc4.9 (msvc 2012 got a similar result): benchmark(size=32 bytes, times=16777216): result(dst aligned, src aligned): memcpy_fast=180ms memcpy=249 ms result(dst aligned, src unalign): memcpy_fast=170ms memcpy=271 ms result(dst unalign, src aligned): memcpy_fast=179ms memcpy=269 ms result(dst unalign, src unalign): memcpy_fast=180ms memcpy=260 ms benchmark(size=64 bytes, times=16777216): result(dst aligned, src aligned): memcpy_fast=162ms memcpy=300 ms result(dst aligned, src unalign): memcpy_fast=199ms memcpy=328 ms result(dst unalign, src aligned): memcpy_fast=410ms memcpy=339 ms result(dst unalign, src unalign): memcpy_fast=390ms memcpy=361 ms benchmark(size=512 bytes, times=8388608): result(dst aligned, src aligned): memcpy_fast=160ms memcpy=241 ms result(dst aligned, src unalign): memcpy_fast=200ms memcpy=519 ms result(dst unalign, src aligned): memcpy_fast=313ms memcpy=509 ms result(dst unalign, src unalign): memcpy_fast=311ms memcpy=520 ms benchmark(size=1024 bytes, times=4194304): result(dst aligned, src aligned): memcpy_fast=145ms memcpy=179 ms result(dst aligned, src unalign): memcpy_fast=180ms memcpy=430 ms result(dst unalign, src aligned): memcpy_fast=245ms memcpy=430 ms result(dst unalign, src unalign): memcpy_fast=230ms memcpy=455 ms benchmark(size=4096 bytes, times=524288): result(dst aligned, src aligned): memcpy_fast=80ms memcpy=80 ms result(dst aligned, src unalign): memcpy_fast=110ms memcpy=205 ms result(dst unalign, src aligned): memcpy_fast=110ms memcpy=224 ms result(dst unalign, src unalign): memcpy_fast=110ms memcpy=200 ms benchmark(size=8192 bytes, times=262144): result(dst aligned, src aligned): memcpy_fast=70ms memcpy=78 ms result(dst aligned, src unalign): memcpy_fast=100ms memcpy=222 ms result(dst unalign, src aligned): memcpy_fast=100ms memcpy=210 ms result(dst unalign, src unalign): memcpy_fast=100ms memcpy=230 ms benchmark(size=1048576 bytes, times=2048): result(dst aligned, src aligned): memcpy_fast=200ms memcpy=201 ms result(dst aligned, src unalign): memcpy_fast=260ms memcpy=270 ms result(dst unalign, src aligned): memcpy_fast=263ms memcpy=361 ms result(dst unalign, src unalign): memcpy_fast=267ms memcpy=321 ms benchmark(size=4194304 bytes, times=512): result(dst aligned, src aligned): memcpy_fast=281ms memcpy=391 ms result(dst aligned, src unalign): memcpy_fast=265ms memcpy=407 ms result(dst unalign, src aligned): memcpy_fast=313ms memcpy=453 ms result(dst unalign, src unalign): memcpy_fast=282ms memcpy=439 ms benchmark(size=8388608 bytes, times=256): result(dst aligned, src aligned): memcpy_fast=266ms memcpy=422 ms result(dst aligned, src unalign): memcpy_fast=250ms memcpy=407 ms result(dst unalign, src aligned): memcpy_fast=297ms memcpy=516 ms result(dst unalign, src unalign): memcpy_fast=281ms memcpy=436 ms benchmark random access: memcpy_fast=594ms memcpy=1161ms
实现代码
skywind3000/FastMemcpy · GitHub注意:你用VC的话,请使用 Release+打开优化,别在 Debug下测试,没意义。
相关讨论
我只能说2012年的代码,到目前为止还是比主流的 memcpy 实现快那么多,也难怪不少追求性能的项目都要自己来重新写一下 memcpy。并不排除未来各个平台的标准库会进一步优化,在此之前,为了不对标准库产生过大依赖,一些项目还是选择自己优化了。
终于看到 glibc memcpy的作者 @Ling (对不起@不到)来答题了,我只在少数有限的几台机器上测试得出比 glibc memcpy 快40%的结论,不排除 Ling 考虑的情况比我多,测试比我广,大家可以在自己的机器上进行验证。
相关参考
缓存优化:《
Using Block Prefetch for Optimized Memory Performance》
这篇文章仅仅针对大尺寸对齐的内存拷贝,实际使用有更多情况需要自己手工处理。
Skywind Inside - 内存拷贝优化(1)-小内存拷贝优化---------
12月20日 更新:继续优化
最新代码测试结果(可以对比上面的表看新版本性改变情况):
benchmark(size=32 bytes, times=16777216): result(dst aligned, src aligned): memcpy_fast=78ms memcpy=260 ms result(dst aligned, src unalign): memcpy_fast=78ms memcpy=250 ms result(dst unalign, src aligned): memcpy_fast=78ms memcpy=266 ms result(dst unalign, src unalign): memcpy_fast=78ms memcpy=234 ms benchmark(size=64 bytes, times=16777216): result(dst aligned, src aligned): memcpy_fast=109ms memcpy=281 ms result(dst aligned, src unalign): memcpy_fast=109ms memcpy=328 ms result(dst unalign, src aligned): memcpy_fast=109ms memcpy=343 ms result(dst unalign, src unalign): memcpy_fast=93ms memcpy=344 ms benchmark(size=512 bytes, times=8388608): result(dst aligned, src aligned): memcpy_fast=125ms memcpy=218 ms result(dst aligned, src unalign): memcpy_fast=156ms memcpy=484 ms result(dst unalign, src aligned): memcpy_fast=172ms memcpy=546 ms result(dst unalign, src unalign): memcpy_fast=172ms memcpy=515 ms benchmark(size=1024 bytes, times=4194304): result(dst aligned, src aligned): memcpy_fast=109ms memcpy=172 ms result(dst aligned, src unalign): memcpy_fast=187ms memcpy=453 ms result(dst unalign, src aligned): memcpy_fast=172ms memcpy=437 ms result(dst unalign, src unalign): memcpy_fast=156ms memcpy=452 ms benchmark(size=4096 bytes, times=524288): result(dst aligned, src aligned): memcpy_fast=62ms memcpy=78 ms result(dst aligned, src unalign): memcpy_fast=109ms memcpy=202 ms result(dst unalign, src aligned): memcpy_fast=94ms memcpy=203 ms result(dst unalign, src unalign): memcpy_fast=110ms memcpy=218 ms benchmark(size=8192 bytes, times=262144): result(dst aligned, src aligned): memcpy_fast=62ms memcpy=78 ms result(dst aligned, src unalign): memcpy_fast=78ms memcpy=202 ms result(dst unalign, src aligned): memcpy_fast=78ms memcpy=203 ms result(dst unalign, src unalign): memcpy_fast=94ms memcpy=203 ms benchmark(size=1048576 bytes, times=2048): result(dst aligned, src aligned): memcpy_fast=203ms memcpy=191 ms result(dst aligned, src unalign): memcpy_fast=219ms memcpy=281 ms result(dst unalign, src aligned): memcpy_fast=218ms memcpy=328 ms result(dst unalign, src unalign): memcpy_fast=218ms memcpy=312 ms benchmark(size=4194304 bytes, times=512): result(dst aligned, src aligned): memcpy_fast=312ms memcpy=406 ms result(dst aligned, src unalign): memcpy_fast=296ms memcpy=421 ms result(dst unalign, src aligned): memcpy_fast=312ms memcpy=468 ms result(dst unalign, src unalign): memcpy_fast=297ms memcpy=452 ms benchmark(size=8388608 bytes, times=256): result(dst aligned, src aligned): memcpy_fast=281ms memcpy=452 ms result(dst aligned, src unalign): memcpy_fast=280ms memcpy=468 ms result(dst unalign, src aligned): memcpy_fast=298ms memcpy=514 ms result(dst unalign, src unalign): memcpy_fast=344ms memcpy=472 ms benchmark random access: memcpy_fast=515ms memcpy=1014ms
对比 rte_memcpy
根据 Ling的推荐对比了 rte_memcpy,gcc升级到5.1(rte需要avx1),memcpy_fast任然是sse2,等有空可以改个avx版本,三个内存拷贝同时评测,为了增加准确性,增加了一些尺寸,比如37字节,71字节之类的非对齐尺寸:
benchmark(size=32 bytes, times=16777216): (dst aligned, src aligned): memcpy_fast=47ms rte_memcpy=31ms memcpy=250ms (dst aligned, src unalign): memcpy_fast=46ms rte_memcpy=62ms memcpy=249ms (dst unalign, src aligned): memcpy_fast=47ms rte_memcpy=58ms memcpy=234ms (dst unalign, src unalign): memcpy_fast=46ms rte_memcpy=46ms memcpy=234ms benchmark(size=37 bytes, times=16777216): (dst aligned, src aligned): memcpy_fast=47ms rte_memcpy=47ms memcpy=266ms (dst aligned, src unalign): memcpy_fast=47ms rte_memcpy=47ms memcpy=265ms (dst unalign, src aligned): memcpy_fast=47ms rte_memcpy=47ms memcpy=265ms (dst unalign, src unalign): memcpy_fast=46ms rte_memcpy=46ms memcpy=272ms benchmark(size=64 bytes, times=16777216): (dst aligned, src aligned): memcpy_fast=63ms rte_memcpy=31ms memcpy=312ms (dst aligned, src unalign): memcpy_fast=47ms rte_memcpy=47ms memcpy=297ms (dst unalign, src aligned): memcpy_fast=78ms rte_memcpy=62ms memcpy=312ms (dst unalign, src unalign): memcpy_fast=47ms rte_memcpy=47ms memcpy=281ms benchmark(size=71 bytes, times=16777216): (dst aligned, src aligned): memcpy_fast=63ms rte_memcpy=62ms memcpy=343ms (dst aligned, src unalign): memcpy_fast=63ms rte_memcpy=94ms memcpy=343ms (dst unalign, src aligned): memcpy_fast=93ms rte_memcpy=63ms memcpy=327ms (dst unalign, src unalign): memcpy_fast=62ms rte_memcpy=78ms memcpy=328ms benchmark(size=512 bytes, times=8388608): (dst aligned, src aligned): memcpy_fast=141ms rte_memcpy=109ms memcpy=220ms (dst aligned, src unalign): memcpy_fast=156ms rte_memcpy=156ms memcpy=515ms (dst unalign, src aligned): memcpy_fast=141ms rte_memcpy=156ms memcpy=483ms (dst unalign, src unalign): memcpy_fast=203ms rte_memcpy=156ms memcpy=501ms benchmark(size=523 bytes, times=8388608): (dst aligned, src aligned): memcpy_fast=140ms rte_memcpy=172ms memcpy=530ms (dst aligned, src unalign): memcpy_fast=172ms rte_memcpy=172ms memcpy=546ms (dst unalign, src aligned): memcpy_fast=156ms rte_memcpy=218ms memcpy=561ms (dst unalign, src unalign): memcpy_fast=187ms rte_memcpy=202ms memcpy=577ms benchmark(size=1024 bytes, times=4194304): (dst aligned, src aligned): memcpy_fast=125ms rte_memcpy=125ms memcpy=188ms (dst aligned, src unalign): memcpy_fast=156ms rte_memcpy=125ms memcpy=499ms (dst unalign, src aligned): memcpy_fast=171ms rte_memcpy=156ms memcpy=484ms (dst unalign, src unalign): memcpy_fast=156ms rte_memcpy=156ms memcpy=499ms benchmark(size=4096 bytes, times=524288): (dst aligned, src aligned): memcpy_fast=62ms rte_memcpy=47ms memcpy=78ms (dst aligned, src unalign): memcpy_fast=109ms rte_memcpy=78ms memcpy=219ms (dst unalign, src aligned): memcpy_fast=78ms rte_memcpy=78ms memcpy=203ms (dst unalign, src unalign): memcpy_fast=78ms rte_memcpy=63ms memcpy=250ms benchmark(size=8192 bytes, times=262144): (dst aligned, src aligned): memcpy_fast=78ms rte_memcpy=47ms memcpy=63ms (dst aligned, src unalign): memcpy_fast=94ms rte_memcpy=63ms memcpy=203ms (dst unalign, src aligned): memcpy_fast=78ms rte_memcpy=62ms memcpy=202ms (dst unalign, src unalign): memcpy_fast=78ms rte_memcpy=62ms memcpy=203ms benchmark(size=1048576 bytes, times=2048): (dst aligned, src aligned): memcpy_fast=218ms rte_memcpy=219ms memcpy=187ms (dst aligned, src unalign): memcpy_fast=219ms rte_memcpy=265ms memcpy=296ms (dst unalign, src aligned): memcpy_fast=218ms rte_memcpy=265ms memcpy=312ms (dst unalign, src unalign): memcpy_fast=218ms rte_memcpy=249ms memcpy=281ms benchmark(size=4194304 bytes, times=512): (dst aligned, src aligned): memcpy_fast=312ms rte_memcpy=437ms memcpy=422ms (dst aligned, src unalign): memcpy_fast=281ms rte_memcpy=422ms memcpy=440ms (dst unalign, src aligned): memcpy_fast=327ms rte_memcpy=405ms memcpy=437ms (dst unalign, src unalign): memcpy_fast=327ms rte_memcpy=422ms memcpy=421ms benchmark(size=8388608 bytes, times=256): (dst aligned, src aligned): memcpy_fast=312ms rte_memcpy=406ms memcpy=390ms (dst aligned, src unalign): memcpy_fast=283ms rte_memcpy=421ms memcpy=439ms (dst unalign, src aligned): memcpy_fast=312ms rte_memcpy=407ms memcpy=484ms (dst unalign, src unalign): memcpy_fast=297ms rte_memcpy=390ms memcpy=423ms benchmark random access: memcpy_fast=517ms rte_memcpy=592ms memcpy=1014ms
这里就不分析了,大家自己看上面数据吧,感兴趣可以看评论区讨论和代码,或者自己跑一下代码看看不同编译器,标准库,主机上的表现。
-----