百科问答小站 logo
百科问答小站 font logo



怎样写出一个更快的 memset/memcpy ? 第1页

  

user avatar   ling-29-85 网友的相关建议: 
      

glibc中最新的memcpy和memset都是我写的,我们的代码同时也考虑预取,对齐(编译之后使用-D 可以清晰看到),减少指令跳转错误提升性能,,希望能够帮助你

源代码如下:

a. memcpy:

sourceware.org/git/gitw

b. memset:

sourceware.org/ml/libc-

c. memcpy in Linux Kernel

[tip:core/locking] x86, mem: Optimize memcpy by avoiding memory false dependece

Linus Torvalds 的评价: “The code looks clever and nice”!

更多早些我写的glibc 函数代码 :

blog.csdn.net/linguranu

代码中主要考虑了内存的读写优化、指令跳转、指令对齐,指令缺失等问题

虽然优化memcpy/memset等glibc函数,然而任何数据的迁移对于整体系统将是非常沉重的负担(RC 导致延迟),坦诚的说如果发现我们的程序中由于memcpy/memset等 成为性能的瓶颈,那么程序在间接的告诉我们架构不正确建议考虑数据引用如零拷贝,如果内容很短(如64字节之内)就直接用按需赋值就好(避免跳转预测失败带来额外的成本),而不是着重进一步优化这些函数。

也许我们学习知识的目的是为了了解事物的规律,然后在工作中避免问题的产生,而不是进一步的优化


user avatar   skywind3000 网友的相关建议: 
      

写过相关代码,最终的实现能在不同拷贝长度,对齐和不对齐,平均比 memcpy 快40%(gcc4.9, vc 2012),主要是以下几个优化点:

  • 策略区别:64字节以内用小内存方案,64K以内用中尺寸方案,大于64K用大内存拷贝方案。
  • 查表跳转:拷贝不同小尺寸内存,直接跳转到相应地址解除循环。
  • 目标对齐:64字节以上拷贝的先用跳转表方法拷贝几个字节让目标地址对齐,好做后面的事情。
  • 矢量拷贝:并行一次性读入N个矢量到 sse2 寄存器,再并行写出。
  • 缓存预取:使用 prefetchnta ,提前预取数据,等到真的要用时数据已经到位。
  • 内存直写:使用 movntdq 来直写内存,避免缓存污染。

测试结果

针对不同内存尺寸(从32字节到8MB),拷贝若干次,并且针对目标地址和源地址分别对齐和不对齐的情况进行测试,并且给出和 memcpy 的对比时间:

       result: gcc4.9 (msvc 2012 got a similar result):   benchmark(size=32 bytes, times=16777216): result(dst aligned, src aligned): memcpy_fast=180ms memcpy=249 ms result(dst aligned, src unalign): memcpy_fast=170ms memcpy=271 ms result(dst unalign, src aligned): memcpy_fast=179ms memcpy=269 ms result(dst unalign, src unalign): memcpy_fast=180ms memcpy=260 ms   benchmark(size=64 bytes, times=16777216): result(dst aligned, src aligned): memcpy_fast=162ms memcpy=300 ms result(dst aligned, src unalign): memcpy_fast=199ms memcpy=328 ms result(dst unalign, src aligned): memcpy_fast=410ms memcpy=339 ms result(dst unalign, src unalign): memcpy_fast=390ms memcpy=361 ms   benchmark(size=512 bytes, times=8388608): result(dst aligned, src aligned): memcpy_fast=160ms memcpy=241 ms result(dst aligned, src unalign): memcpy_fast=200ms memcpy=519 ms result(dst unalign, src aligned): memcpy_fast=313ms memcpy=509 ms result(dst unalign, src unalign): memcpy_fast=311ms memcpy=520 ms   benchmark(size=1024 bytes, times=4194304): result(dst aligned, src aligned): memcpy_fast=145ms memcpy=179 ms result(dst aligned, src unalign): memcpy_fast=180ms memcpy=430 ms result(dst unalign, src aligned): memcpy_fast=245ms memcpy=430 ms result(dst unalign, src unalign): memcpy_fast=230ms memcpy=455 ms   benchmark(size=4096 bytes, times=524288): result(dst aligned, src aligned): memcpy_fast=80ms memcpy=80 ms result(dst aligned, src unalign): memcpy_fast=110ms memcpy=205 ms result(dst unalign, src aligned): memcpy_fast=110ms memcpy=224 ms result(dst unalign, src unalign): memcpy_fast=110ms memcpy=200 ms   benchmark(size=8192 bytes, times=262144): result(dst aligned, src aligned): memcpy_fast=70ms memcpy=78 ms result(dst aligned, src unalign): memcpy_fast=100ms memcpy=222 ms result(dst unalign, src aligned): memcpy_fast=100ms memcpy=210 ms result(dst unalign, src unalign): memcpy_fast=100ms memcpy=230 ms   benchmark(size=1048576 bytes, times=2048): result(dst aligned, src aligned): memcpy_fast=200ms memcpy=201 ms result(dst aligned, src unalign): memcpy_fast=260ms memcpy=270 ms result(dst unalign, src aligned): memcpy_fast=263ms memcpy=361 ms result(dst unalign, src unalign): memcpy_fast=267ms memcpy=321 ms   benchmark(size=4194304 bytes, times=512): result(dst aligned, src aligned): memcpy_fast=281ms memcpy=391 ms result(dst aligned, src unalign): memcpy_fast=265ms memcpy=407 ms result(dst unalign, src aligned): memcpy_fast=313ms memcpy=453 ms result(dst unalign, src unalign): memcpy_fast=282ms memcpy=439 ms   benchmark(size=8388608 bytes, times=256): result(dst aligned, src aligned): memcpy_fast=266ms memcpy=422 ms result(dst aligned, src unalign): memcpy_fast=250ms memcpy=407 ms result(dst unalign, src aligned): memcpy_fast=297ms memcpy=516 ms result(dst unalign, src unalign): memcpy_fast=281ms memcpy=436 ms  benchmark random access: memcpy_fast=594ms memcpy=1161ms      

实现代码

skywind3000/FastMemcpy · GitHub

注意:你用VC的话,请使用 Release+打开优化,别在 Debug下测试,没意义。

相关讨论

我只能说2012年的代码,到目前为止还是比主流的 memcpy 实现快那么多,也难怪不少追求性能的项目都要自己来重新写一下 memcpy。并不排除未来各个平台的标准库会进一步优化,在此之前,为了不对标准库产生过大依赖,一些项目还是选择自己优化了。

终于看到 glibc memcpy的作者 @Ling (对不起@不到)来答题了,我只在少数有限的几台机器上测试得出比 glibc memcpy 快40%的结论,不排除 Ling 考虑的情况比我多,测试比我广,大家可以在自己的机器上进行验证。

相关参考

缓存优化:《

Using Block Prefetch for Optimized Memory Performance

这篇文章仅仅针对大尺寸对齐的内存拷贝,实际使用有更多情况需要自己手工处理。

Skywind Inside - 内存拷贝优化(1)-小内存拷贝优化
Skywind Inside - 内存拷贝优化(2)-全尺寸拷贝优化

---------

12月20日 更新:继续优化

  1. 修改了小内存方案:由原来64字节扩大为128字节,由 int 改为 xmm,小内存性能提升 80%
  2. 修改了中内存方案:从4个xmm寄存器并行拷贝改为8个并行拷贝+prefetch,提升20%左右
  3. 去除目标地址对齐的分支判断,用一次xmm拷贝完成目标对齐,性能提升10%。
  4. 增加测试用例:为了贴近实际情况,增加了随机访问,10MB空间内(绝对大于L2尺寸)随机位置,随机长度的拷贝测试,并且为避免随机数生成影响结果,全部提前生成随机数。

最新代码测试结果(可以对比上面的表看新版本性改变情况):

       benchmark(size=32 bytes, times=16777216): result(dst aligned, src aligned): memcpy_fast=78ms memcpy=260 ms result(dst aligned, src unalign): memcpy_fast=78ms memcpy=250 ms result(dst unalign, src aligned): memcpy_fast=78ms memcpy=266 ms result(dst unalign, src unalign): memcpy_fast=78ms memcpy=234 ms  benchmark(size=64 bytes, times=16777216): result(dst aligned, src aligned): memcpy_fast=109ms memcpy=281 ms result(dst aligned, src unalign): memcpy_fast=109ms memcpy=328 ms result(dst unalign, src aligned): memcpy_fast=109ms memcpy=343 ms result(dst unalign, src unalign): memcpy_fast=93ms memcpy=344 ms  benchmark(size=512 bytes, times=8388608): result(dst aligned, src aligned): memcpy_fast=125ms memcpy=218 ms result(dst aligned, src unalign): memcpy_fast=156ms memcpy=484 ms result(dst unalign, src aligned): memcpy_fast=172ms memcpy=546 ms result(dst unalign, src unalign): memcpy_fast=172ms memcpy=515 ms  benchmark(size=1024 bytes, times=4194304): result(dst aligned, src aligned): memcpy_fast=109ms memcpy=172 ms result(dst aligned, src unalign): memcpy_fast=187ms memcpy=453 ms result(dst unalign, src aligned): memcpy_fast=172ms memcpy=437 ms result(dst unalign, src unalign): memcpy_fast=156ms memcpy=452 ms  benchmark(size=4096 bytes, times=524288): result(dst aligned, src aligned): memcpy_fast=62ms memcpy=78 ms result(dst aligned, src unalign): memcpy_fast=109ms memcpy=202 ms result(dst unalign, src aligned): memcpy_fast=94ms memcpy=203 ms result(dst unalign, src unalign): memcpy_fast=110ms memcpy=218 ms  benchmark(size=8192 bytes, times=262144): result(dst aligned, src aligned): memcpy_fast=62ms memcpy=78 ms result(dst aligned, src unalign): memcpy_fast=78ms memcpy=202 ms result(dst unalign, src aligned): memcpy_fast=78ms memcpy=203 ms result(dst unalign, src unalign): memcpy_fast=94ms memcpy=203 ms  benchmark(size=1048576 bytes, times=2048): result(dst aligned, src aligned): memcpy_fast=203ms memcpy=191 ms result(dst aligned, src unalign): memcpy_fast=219ms memcpy=281 ms result(dst unalign, src aligned): memcpy_fast=218ms memcpy=328 ms result(dst unalign, src unalign): memcpy_fast=218ms memcpy=312 ms  benchmark(size=4194304 bytes, times=512): result(dst aligned, src aligned): memcpy_fast=312ms memcpy=406 ms result(dst aligned, src unalign): memcpy_fast=296ms memcpy=421 ms result(dst unalign, src aligned): memcpy_fast=312ms memcpy=468 ms result(dst unalign, src unalign): memcpy_fast=297ms memcpy=452 ms  benchmark(size=8388608 bytes, times=256): result(dst aligned, src aligned): memcpy_fast=281ms memcpy=452 ms result(dst aligned, src unalign): memcpy_fast=280ms memcpy=468 ms result(dst unalign, src aligned): memcpy_fast=298ms memcpy=514 ms result(dst unalign, src unalign): memcpy_fast=344ms memcpy=472 ms  benchmark random access: memcpy_fast=515ms memcpy=1014ms     

对比 rte_memcpy

根据 Ling的推荐对比了 rte_memcpy,gcc升级到5.1(rte需要avx1),memcpy_fast任然是sse2,等有空可以改个avx版本,三个内存拷贝同时评测,为了增加准确性,增加了一些尺寸,比如37字节,71字节之类的非对齐尺寸:

       benchmark(size=32 bytes, times=16777216): (dst aligned, src aligned): memcpy_fast=47ms rte_memcpy=31ms memcpy=250ms (dst aligned, src unalign): memcpy_fast=46ms rte_memcpy=62ms memcpy=249ms (dst unalign, src aligned): memcpy_fast=47ms rte_memcpy=58ms memcpy=234ms (dst unalign, src unalign): memcpy_fast=46ms rte_memcpy=46ms memcpy=234ms  benchmark(size=37 bytes, times=16777216): (dst aligned, src aligned): memcpy_fast=47ms rte_memcpy=47ms memcpy=266ms (dst aligned, src unalign): memcpy_fast=47ms rte_memcpy=47ms memcpy=265ms (dst unalign, src aligned): memcpy_fast=47ms rte_memcpy=47ms memcpy=265ms (dst unalign, src unalign): memcpy_fast=46ms rte_memcpy=46ms memcpy=272ms  benchmark(size=64 bytes, times=16777216): (dst aligned, src aligned): memcpy_fast=63ms rte_memcpy=31ms memcpy=312ms (dst aligned, src unalign): memcpy_fast=47ms rte_memcpy=47ms memcpy=297ms (dst unalign, src aligned): memcpy_fast=78ms rte_memcpy=62ms memcpy=312ms (dst unalign, src unalign): memcpy_fast=47ms rte_memcpy=47ms memcpy=281ms  benchmark(size=71 bytes, times=16777216): (dst aligned, src aligned): memcpy_fast=63ms rte_memcpy=62ms memcpy=343ms (dst aligned, src unalign): memcpy_fast=63ms rte_memcpy=94ms memcpy=343ms (dst unalign, src aligned): memcpy_fast=93ms rte_memcpy=63ms memcpy=327ms (dst unalign, src unalign): memcpy_fast=62ms rte_memcpy=78ms memcpy=328ms  benchmark(size=512 bytes, times=8388608): (dst aligned, src aligned): memcpy_fast=141ms rte_memcpy=109ms memcpy=220ms (dst aligned, src unalign): memcpy_fast=156ms rte_memcpy=156ms memcpy=515ms (dst unalign, src aligned): memcpy_fast=141ms rte_memcpy=156ms memcpy=483ms (dst unalign, src unalign): memcpy_fast=203ms rte_memcpy=156ms memcpy=501ms  benchmark(size=523 bytes, times=8388608): (dst aligned, src aligned): memcpy_fast=140ms rte_memcpy=172ms memcpy=530ms (dst aligned, src unalign): memcpy_fast=172ms rte_memcpy=172ms memcpy=546ms (dst unalign, src aligned): memcpy_fast=156ms rte_memcpy=218ms memcpy=561ms (dst unalign, src unalign): memcpy_fast=187ms rte_memcpy=202ms memcpy=577ms  benchmark(size=1024 bytes, times=4194304): (dst aligned, src aligned): memcpy_fast=125ms rte_memcpy=125ms memcpy=188ms (dst aligned, src unalign): memcpy_fast=156ms rte_memcpy=125ms memcpy=499ms (dst unalign, src aligned): memcpy_fast=171ms rte_memcpy=156ms memcpy=484ms (dst unalign, src unalign): memcpy_fast=156ms rte_memcpy=156ms memcpy=499ms  benchmark(size=4096 bytes, times=524288): (dst aligned, src aligned): memcpy_fast=62ms rte_memcpy=47ms memcpy=78ms (dst aligned, src unalign): memcpy_fast=109ms rte_memcpy=78ms memcpy=219ms (dst unalign, src aligned): memcpy_fast=78ms rte_memcpy=78ms memcpy=203ms (dst unalign, src unalign): memcpy_fast=78ms rte_memcpy=63ms memcpy=250ms  benchmark(size=8192 bytes, times=262144): (dst aligned, src aligned): memcpy_fast=78ms rte_memcpy=47ms memcpy=63ms (dst aligned, src unalign): memcpy_fast=94ms rte_memcpy=63ms memcpy=203ms (dst unalign, src aligned): memcpy_fast=78ms rte_memcpy=62ms memcpy=202ms (dst unalign, src unalign): memcpy_fast=78ms rte_memcpy=62ms memcpy=203ms  benchmark(size=1048576 bytes, times=2048): (dst aligned, src aligned): memcpy_fast=218ms rte_memcpy=219ms memcpy=187ms (dst aligned, src unalign): memcpy_fast=219ms rte_memcpy=265ms memcpy=296ms (dst unalign, src aligned): memcpy_fast=218ms rte_memcpy=265ms memcpy=312ms (dst unalign, src unalign): memcpy_fast=218ms rte_memcpy=249ms memcpy=281ms  benchmark(size=4194304 bytes, times=512): (dst aligned, src aligned): memcpy_fast=312ms rte_memcpy=437ms memcpy=422ms (dst aligned, src unalign): memcpy_fast=281ms rte_memcpy=422ms memcpy=440ms (dst unalign, src aligned): memcpy_fast=327ms rte_memcpy=405ms memcpy=437ms (dst unalign, src unalign): memcpy_fast=327ms rte_memcpy=422ms memcpy=421ms  benchmark(size=8388608 bytes, times=256): (dst aligned, src aligned): memcpy_fast=312ms rte_memcpy=406ms memcpy=390ms (dst aligned, src unalign): memcpy_fast=283ms rte_memcpy=421ms memcpy=439ms (dst unalign, src aligned): memcpy_fast=312ms rte_memcpy=407ms memcpy=484ms (dst unalign, src unalign): memcpy_fast=297ms rte_memcpy=390ms memcpy=423ms  benchmark random access: memcpy_fast=517ms rte_memcpy=592ms memcpy=1014ms     

这里就不分析了,大家自己看上面数据吧,感兴趣可以看评论区讨论和代码,或者自己跑一下代码看看不同编译器,标准库,主机上的表现。


-----




  

相关话题

  为什么说读代码比写代码难? 
  大家都见过哪些让你虎躯一震的代码? 
  Python真的不适合游戏开发吗?游戏全都用像Python这种解释型语言写,对开发出的游戏有什么影响? 
  为什么英特尔x86等多数中央处理器不支持源操作数和目标操作数同时为内存的指令? 
  永恒之蓝病毒是如何穿透NAT访问到子网终端的445端口的? 
  既然报个培训班就可以成为码农,那学计算机专业有什么用? 
  我们是否应该反对设计模式,它是否让代码变得多而且难阅读? 
  23岁曾放弃编码,现想重新编码,我该如何做? 
  如果一瞬间让所有编程语言的0.1 + 0.2 == 0.3,会造成多大影响? 
  为什么不能能向方法同时传入dynamic 类型,和lambda类型的参数? 

前一个讨论
$34.85 用英语怎么读?
下一个讨论
为什么那么多做耳机、音箱测评的,频响和失真都不测一下,就靠一张嘴硬吹?





© 2024-11-05 - tinynew.org. All Rights Reserved.
© 2024-11-05 - tinynew.org. 保留所有权利