System.arraycopy mit konstanter Länge

Ich spiele mit JMH herum (http://openjdk.java.net/projects/code-tools/jmh/) und bin gerade auf ein seltsames Ergebnis gestoßen.System.arraycopy mit konstanter Länge

Ich Wege Benchmarking eine flache Kopie eines Arrays zu machen, und ich kann die erwarteten Ergebnisse beobachten (das durch das Array Looping eine schlechte Idee ist und dass es kein signifikanter Unterschied zwischen #clone(), System#arraycopy() und Arrays#copyOf(), leistungs- weise).

Außer dass System#arraycopy() um ein Viertel langsamer ist, wenn die Länge des Arrays fest codiert ist ... Warte, was? Wie kann das langsamer sein?

Hat jemand eine Idee von was könnte die Ursache sein?

Die Ergebnisse (Durchsatz):

# JMH 1.11 (released 17 days ago) 
# VM version: JDK 1.8.0_05, VM 25.5-b02 
# VM invoker: /Library/Java/JavaVirtualMachines/jdk1.8.0_05.jdk/Contents/Home/jre/bin/java 
# VM options: -Dfile.encoding=UTF-8 -Duser.country=FR -Duser.language=fr -Duser.variant 
# Warmup: 20 iterations, 1 s each 
# Measurement: 20 iterations, 1 s each 
# Timeout: 10 min per iteration 
# Threads: 1 thread, will synchronize iterations 
# Benchmark mode: Throughput, ops/time 

Benchmark           Mode Cnt   Score   Error Units 
ArrayCopyBenchmark.ArraysCopyOf      thrpt 20 67100500,319 ± 455252,537 ops/s 
ArrayCopyBenchmark.ArraysCopyOf_Class    thrpt 20 65246374,290 ± 976481,330 ops/s 
ArrayCopyBenchmark.ArraysCopyOf_Class_ConstantSize thrpt 20 65068143,162 ± 1597390,531 ops/s 
ArrayCopyBenchmark.ArraysCopyOf_ConstantSize  thrpt 20 64463603,462 ± 953946,811 ops/s 
ArrayCopyBenchmark.Clone       thrpt 20 64837239,393 ± 834353,404 ops/s 
ArrayCopyBenchmark.Loop        thrpt 20 21070422,097 ± 112595,764 ops/s 
ArrayCopyBenchmark.Loop_ConstantSize    thrpt 20 24458867,274 ± 181486,291 ops/s 
ArrayCopyBenchmark.SystemArrayCopy     thrpt 20 66688368,490 ± 582416,954 ops/s 
ArrayCopyBenchmark.SystemArrayCopy_ConstantSize  thrpt 20 48992312,357 ± 298807,039 ops/s

Und die Benchmark-Klasse:

import java.util.Arrays; 
import java.util.concurrent.TimeUnit; 

import org.openjdk.jmh.annotations.Benchmark; 
import org.openjdk.jmh.annotations.BenchmarkMode; 
import org.openjdk.jmh.annotations.Mode; 
import org.openjdk.jmh.annotations.OutputTimeUnit; 
import org.openjdk.jmh.annotations.Scope; 
import org.openjdk.jmh.annotations.Setup; 
import org.openjdk.jmh.annotations.State; 

@State(Scope.Benchmark) 
@BenchmarkMode(Mode.Throughput) 
@OutputTimeUnit(TimeUnit.SECONDS) 
public class ArrayCopyBenchmark { 

    private static final int LENGTH = 32; 

    private Object[] array; 

    @Setup 
    public void before() { 
     array = new Object[LENGTH]; 
     for (int i = 0; i < LENGTH; i++) { 
      array[i] = new Object(); 
     } 
    } 

    @Benchmark 
    public Object[] Clone() { 
     Object[] src = this.array; 
     return src.clone(); 
    } 

    @Benchmark 
    public Object[] ArraysCopyOf() { 
     Object[] src = this.array; 
     return Arrays.copyOf(src, src.length); 
    } 

    @Benchmark 
    public Object[] ArraysCopyOf_ConstantSize() { 
     Object[] src = this.array; 
     return Arrays.copyOf(src, LENGTH); 
    } 

    @Benchmark 
    public Object[] ArraysCopyOf_Class() { 
     Object[] src = this.array; 
     return Arrays.copyOf(src, src.length, Object[].class); 
    } 

    @Benchmark 
    public Object[] ArraysCopyOf_Class_ConstantSize() { 
     Object[] src = this.array; 
     return Arrays.copyOf(src, LENGTH, Object[].class); 
    } 

    @Benchmark 
    public Object[] SystemArrayCopy() { 
     Object[] src = this.array; 
     int length = src.length; 
     Object[] array = new Object[length]; 
     System.arraycopy(src, 0, array, 0, length); 
     return array; 
    } 

    @Benchmark 
    public Object[] SystemArrayCopy_ConstantSize() { 
     Object[] src = this.array; 
     Object[] array = new Object[LENGTH]; 
     System.arraycopy(src, 0, array, 0, LENGTH); 
     return array; 
    } 

    @Benchmark 
    public Object[] Loop() { 
     Object[] src = this.array; 
     int length = src.length; 
     Object[] array = new Object[length]; 
     for (int i = 0; i < length; i++) { 
      array[i] = src[i]; 
     } 
     return array; 
    } 

    @Benchmark 
    public Object[] Loop_ConstantSize() { 
     Object[] src = this.array; 
     Object[] array = new Object[LENGTH]; 
     for (int i = 0; i < LENGTH; i++) { 
      array[i] = src[i]; 
     } 
     return array; 
    } 
}

Quelle

2015-09-29 omiel

seltsam, ich dachte, Arrays.copyof intern System.arraycopy heißt es sollte einen niedrigeren Durchsatz per Definition haben, wie oft haben Sie versucht, diesen Benchmark zu laufen? – MahdeTo

Versuchen Sie, Ihr Warmup zu erhöhen.20 reicht nicht aus, um JIT zu starten. – Andreas

@MahdeTo 5 mal. Es hat die Ergebnisse nicht viel verändert. – omiel

Wie üblich, werden diese Art von Fragen durch das Studium des generierten Codes schnell beantwortet. JMH bietet Ihnen -prof perfasm unter Linux und -prof xperfasm unter Windows. Wenn Sie die Benchmark auf JDK 8u40 laufen, dann werden Sie sehen (man beachte ich -bm avgt -tu ns verwendet, um Partituren verständlicher zu machen):

Benchmark       Mode Cnt Score Error Units 
ACB.SystemArrayCopy    avgt 25 13.294 ± 0.052 ns/op 
ACB.SystemArrayCopy_ConstantSize avgt 25 16.413 ± 0.080 ns/op

Warum werden diese Benchmarks anders verhalten? erste -prof perfnorm Lassen Sie uns tun sezieren (I ließ die Linien, die sind nicht wichtig):

Benchmark          Mode Cnt Score Error Units 
ACB.SAC          avgt 25 13.466 ± 0.070 ns/op 
ACB.SAC:·CPI         avgt 5 0.602 ± 0.025 #/op 
ACB.SAC:·L1-dcache-load-misses    avgt 5 2.346 ± 0.239 #/op 
ACB.SAC:·L1-dcache-loads      avgt 5 24.756 ± 1.438 #/op 
ACB.SAC:·L1-dcache-store-misses    avgt 5 2.404 ± 0.129 #/op 
ACB.SAC:·L1-dcache-stores      avgt 5 14.929 ± 0.230 #/op 
ACB.SAC:·LLC-loads       avgt 5 2.151 ± 0.217 #/op 
ACB.SAC:·branches        avgt 5 17.795 ± 1.003 #/op 
ACB.SAC:·cycles        avgt 5 56.677 ± 3.187 #/op 
ACB.SAC:·instructions       avgt 5 94.145 ± 6.442 #/op 

ACB.SAC_ConstantSize       avgt 25 16.447 ± 0.084 ns/op 
ACB.SAC_ConstantSize:·CPI      avgt 5 0.637 ± 0.016 #/op 
ACB.SAC_ConstantSize:·L1-dcache-load-misses avgt 5 2.357 ± 0.206 #/op 
ACB.SAC_ConstantSize:·L1-dcache-loads   avgt 5 25.611 ± 1.482 #/op 
ACB.SAC_ConstantSize:·L1-dcache-store-misses avgt 5 2.368 ± 0.123 #/op 
ACB.SAC_ConstantSize:·L1-dcache-stores  avgt 5 25.593 ± 1.610 #/op 
ACB.SAC_ConstantSize:·LLC-loads    avgt 5 1.050 ± 0.038 #/op 
ACB.SAC_ConstantSize:·branches    avgt 5 17.853 ± 0.697 #/op 
ACB.SAC_ConstantSize:·cycles     avgt 5 66.680 ± 2.049 #/op 
ACB.SAC_ConstantSize:·instructions   avgt 5 104.759 ± 4.831 #/op

So ConstantSize irgendwie tut mehr L1-dcache-Shops, aber eine weniger LLC-Last. Hm, das ist, was wir suchen, mehr Läden im ständigen Fall. -prof perfasm Highlights bequem die heißen Teile bei der Montage:

default:

4.32% 6.36% 0x00007f7714bda2dc: movq $0x1,(%rax)   ; alloc 
    0.09% 0.04% 0x00007f7714bda2e3: prefetchnta 0x100(%r9) 
    2.95% 1.48% 0x00007f7714bda2eb: movl $0xf80022a9,0x8(%rax) 
    0.38% 0.18% 0x00007f7714bda2f2: mov %r11d,0xc(%rax) 
    1.56% 3.02% 0x00007f7714bda2f6: prefetchnta 0x140(%r9) 
    4.73% 2.71% 0x00007f7714bda2fe: prefetchnta 0x180(%r9)

ConstantSize:

0.58% 1.22% 0x00007facf921132b: movq $0x1,(%r14)   ; alloc 
    0.84% 0.72% 0x00007facf9211332: prefetchnta 0xc0(%r10) 
    0.11% 0.13% 0x00007facf921133a: movl $0xf80022a9,0x8(%r14) 
    0.21% 0.68% 0x00007facf9211342: prefetchnta 0x100(%r10) 
    0.50% 0.87% 0x00007facf921134a: movl $0x20,0xc(%r14) 
    0.53% 0.82% 0x00007facf9211352: mov $0x10,%ecx 
    0.04% 0.14% 0x00007facf9211357: xor %rax,%rax 
    0.34% 0.76% 0x00007facf921135a: shl $0x3,%rcx 
    0.50% 1.17% 0x00007facf921135e: rex.W rep stos %al,%es:(%rdi) ; zeroing 
29.49% 52.09% 0x00007facf9211361: prefetchnta 0x140(%r10) 
    1.03% 0.53% 0x00007facf9211369: prefetchnta 0x180(%r10)

So gibt es, dass nervtötende rex.W rep stos %al,%es:(%rdi) eine erhebliche Zeit in Anspruch. Dadurch wird das neu zugewiesene Array auf Nullen gesetzt. In ConstantSize Test konnte die JVM nicht korrelieren, dass Sie das gesamte Ziel-Array überschreiben, und so musste es vor dem Tauchen in die tatsächliche Array-Kopie vor-Null.

Wenn Sie auf den generierten Code aussehen auf JDK 9b82 (die letzten verfügbaren), dann werden Sie sehen, es beide Muster in nicht-Null gestellte Kopie klappt, wie Sie mit -prof perfasm sehen können, und mit -prof perfnorm auch bestätigen können:

Natürlich sind all diese Nanobremsungen für die Arraycopy anfällig für seltsame, durch die Ausrichtung hervorgerufene Leistungsunterschiede in den vektorisierten Kopierstubs, aber das ist eine andere (Horror-) Geschichte, die ich nicht sagen kann.

Quelle

2015-09-30 14:06:49

Vielen Dank für diese großartige Antwort. Ich vermutete, dass die JVM nicht schließen konnte, dass ich die gesamte Länge des Arrays benutzte, aber ich wusste nicht, wie ich das beweisen sollte. – omiel

Ich habe versucht, Ihre Ergebnisse zu duplizieren, aber ich denke, dass 'perfasm' /' xperfasm' nicht für Mac OSX verfügbar ist. – omiel

Ja, 'perfasm' ist unter Linux verfügbar,' xperfasm' ist unter Windows verfügbar. Wenn es eine Möglichkeit gibt, die Daten zu erhalten, welche Programmadressen unter Mac OS X heiß sind, können wir das dort auch implementieren. Patches willkommen. –

System.arraycopy mit konstanter Länge

Antwort

Verwandte Themen