Warum indirekte Inkrement ist schneller als direkte inc?

Die Frage wurde von einem anderen SO-Mitglied gestellt, wurde aber enttäuschend gelöscht. Die Kommentare sagten, dass die Messungen fehlerhaft sind und keinen Sinn ergeben.Warum indirekte Inkrement ist schneller als direkte inc?

Allerdings war ich in der Lage, das ursprüngliche Problem mit einem kleinen Maßstab unter JMH zu reproduzieren:

package bench; 

import org.openjdk.jmh.annotations.*; 
import org.openjdk.jmh.runner.*; 
import org.openjdk.jmh.runner.options.*; 
import java.util.concurrent.*; 

@State(Scope.Benchmark) 
public class LoopInc { 

    private int getValue() { 
     return ThreadLocalRandom.current().nextInt(2); 
    } 

    @Benchmark 
    public int directInc() { 
     int result = 0; 
     for (int i = 0; i < 1000; i++) { 
      switch (getValue()) { 
       case 0: 
        break; 
       case 1: 
        result++; 
        break; 
      } 
     } 
     return result; 
    } 

    @Benchmark 
    public int indirectInc() { 
     int result = 0; 
     for (int i = 0; i < 1000; i++) { 
      boolean incr = false; 
      switch (getValue()) { 
       case 0: 
        break; 
       case 1: 
        incr = true; 
        break; 
      } 

      if (incr) { 
       result++; 
      } 
     } 
     return result; 
    } 

    public static void main(String[] args) throws RunnerException { 
     Options options = new OptionsBuilder() 
       .include("bench.LoopInc.*") 
       .warmupIterations(5) 
       .measurementIterations(10) 
       .forks(3) 
       .timeUnit(TimeUnit.MILLISECONDS) 
       .build(); 
     new Runner(options).run(); 
    } 
}

Die Benchmarks zeigt indirectInc Werke 3-mal schneller, obwohl die „Optimierung“ überhaupt nicht auf der Hand. Man würde annehmen, indirectInc sollte etwas langsamer arbeiten, da es sich um eine zusätzliche Zwischenoperation handelt.

Benchmark    Mode Cnt Score Error Units 
LoopInc.directInc thrpt 30 127,301 ± 0,202 ops/ms 
LoopInc.indirectInc thrpt 30 378,147 ± 1,144 ops/ms

java version "1.8.0_51" 
Java(TM) SE Runtime Environment (build 1.8.0_51-b16) 
Java HotSpot(TM) 64-Bit Server VM (build 25.51-b03, mixed mode)

Was JIT verursacht als ähnlich zu kompilieren indirectInc besser directInc?

Quelle

2015-07-16 apangin

Wie bewerten Sie das? –

Irgendwas stimmt mit Ihrem Benchmark-Ansatz nicht. Die Frage ist: Was? – Kon

Könnte ein Aufwärmproblem sein. Versuchen Sie, die Reihenfolge Ihrer Tests umzukehren – ControlAltDel

Ok, so nähern Sie sich diesen Dingen.

Versuchen Sie, es zu reproduzieren. Okay, es wiedergibt:

Benchmark    Mode Cnt Score Error Units 
LoopInc.directInc thrpt 15 175.678 ± 1.118 ops/ms 
LoopInc.indirectInc thrpt 15 641.413 ± 9.722 ops/ms

Versuchen Sie, die generierte Assembly mit -prof perfasm zu sehen. Lange Rede, kurzer Sinn, es produziert viel generierten Code, und deshalb wollen wir wahrscheinlich das Loop-Enrolling begrenzen. Aber es kann die Leistung beeinflussen und kann so ziemlich die Ursache sein. Also, lassen Sie uns erneut mit -XX:LoopUnrollLimit=1 laufen. Okay, ist das Ergebnis niedriger, aber der Unterschied ist immer noch da, ausgezeichnet:
```
Benchmark    Mode Cnt Score Error Units 
LoopInc.directInc thrpt 15 161.147 ± 6.101 ops/ms 
LoopInc.indirectInc thrpt 15 489.430 ± 1.698 ops/ms 
```

noch einen Blick auf den generierten Code, noch nichts, was unser Auge herausspringt. Nun, das scheint interessant zu sein. Lass uns das richtig machen. Können wir die Arbeitsbelastung charakterisieren? Natürlich können wir mit Hilfe von -prof perfnorm die Hardware-Zähler per Benchmark op. 7 normalisieren. Mal sehen:

Benchmark          Mode Cnt  Score  Error Units 
LoopInc.directInc       thrpt 15 161.875 ± 3.038 ops/ms 
LoopInc.directInc:·CPI      thrpt 3  0.967 ± 0.196 #/op 
LoopInc.directInc:·L1-dcache-load-misses  thrpt 3  0.394 ± 3.663 #/op 
LoopInc.directInc:·L1-dcache-loads   thrpt 3 2149.594 ± 228.166 #/op 
LoopInc.directInc:·L1-dcache-store-misses thrpt 3  0.114 ± 1.001 #/op 
LoopInc.directInc:·L1-dcache-stores   thrpt 3 1073.666 ± 96.066 #/op 
LoopInc.directInc:·L1-icache-load-misses  thrpt 3  0.965 ± 22.984 #/op 
LoopInc.directInc:·LLC-loads     thrpt 3  0.204 ± 2.763 #/op 
LoopInc.directInc:·LLC-stores    thrpt 3  0.060 ± 0.633 #/op 
LoopInc.directInc:·branch-misses    thrpt 3 536.068 ± 43.293 #/op 
LoopInc.directInc:·branches     thrpt 3 3728.890 ± 220.539 #/op 
LoopInc.directInc:·cycles     thrpt 3 26219.146 ± 6287.590 #/op 
LoopInc.directInc:·dTLB-load-misses   thrpt 3  0.063 ± 0.124 #/op 
LoopInc.directInc:·dTLB-loads    thrpt 3 2136.942 ± 165.990 #/op 
LoopInc.directInc:·dTLB-store-misses   thrpt 3  0.022 ± 0.029 #/op 
LoopInc.directInc:·dTLB-stores    thrpt 3 1084.787 ± 417.281 #/op 
LoopInc.directInc:·iTLB-load-misses   thrpt 3  0.081 ± 0.333 #/op 
LoopInc.directInc:·iTLB-loads    thrpt 3  3.623 ± 19.955 #/op 
LoopInc.directInc:·instructions    thrpt 3 27114.052 ± 1843.720 #/op 

LoopInc.indirectInc       thrpt 15 489.164 ± 2.692 ops/ms 
LoopInc.indirectInc:·CPI      thrpt 3  0.281 ± 0.015 #/op 
LoopInc.indirectInc:·L1-dcache-load-misses thrpt 3  0.503 ± 9.071 #/op 
LoopInc.indirectInc:·L1-dcache-loads   thrpt 3 2149.806 ± 369.040 #/op 
LoopInc.indirectInc:·L1-dcache-store-misses thrpt 3  0.167 ± 1.370 #/op 
LoopInc.indirectInc:·L1-dcache-stores  thrpt 3 1073.895 ± 186.741 #/op 
LoopInc.indirectInc:·L1-icache-load-misses thrpt 3  0.313 ± 1.275 #/op 
LoopInc.indirectInc:·branch-misses   thrpt 3  1.102 ± 0.375 #/op 
LoopInc.indirectInc:·branches    thrpt 3 2143.670 ± 228.475 #/op 
LoopInc.indirectInc:·cycles     thrpt 3 8701.665 ± 706.183 #/op 
LoopInc.indirectInc:·dTLB-load-misses  thrpt 3  0.020 ± 0.301 #/op 
LoopInc.indirectInc:·dTLB-loads    thrpt 3 2141.965 ± 135.852 #/op 
LoopInc.indirectInc:·dTLB-store-misses  thrpt 3  0.002 ± 0.029 #/op 
LoopInc.indirectInc:·dTLB-stores    thrpt 3 1070.376 ± 81.445 #/op 
LoopInc.indirectInc:·iTLB-load-misses  thrpt 3  0.007 ± 0.135 #/op 
LoopInc.indirectInc:·iTLB-loads    thrpt 3  0.310 ± 5.768 #/op 
LoopInc.indirectInc:·instructions   thrpt 3 30968.207 ± 3627.540 #/op

Oh, beide Benchmarks haben vergleichbare Anzahl von Anweisungen. Der langsamere braucht mehr Zyklen (daher ist CPI auch nicht ideal in directInc; indirectInc erzeugt jedoch einen nahezu idealen CPI). Wenn man sich die möglichen Ursachen genauer ansieht: Es gibt nicht viele Cache-Fehler, nicht viele TLB-Fehler, aber der langsame Benchmark hat viele Verzweigungsfehler. AHA! Jetzt wissen wir, was im generierten Code aussehen soll.

Lassen Sie uns den generierten Code erneut betrachten. -prof perfasm hebt die Sprünge bequem hervor. Und dann sehen Sie, diese ...

directInc:

    ╭│  0x00007fa0a82a50ff: jmp 0x00007fa0a82a5116 
11.39% 16.90% ││ ↗ 0x00007fa0a82a5101: inc %edx    ;*iinc 
        ││ │             ; - org.openjdk.LoopInc::[email protected] (line 18) 
12.52% 23.11% ││ │↗↗ 0x00007fa0a82a5103: mov %r10,0xe8(%r11) ;*invokevirtual putLong 
        ││ │││            ; - java.util.concurrent.ThreadLocalRandom::[email protected] (line 241) 
12.00% 8.14% ││ │││ 0x00007fa0a82a510a: inc %r8d    ;*iinc 
        ││ │││            ; - org.openjdk.LoopInc::[email protected] (line 18) 
    0.03% 0.03% ││ │││ 0x00007fa0a82a510d: cmp $0x3e8,%r8d 
        │╰ │││ 0x00007fa0a82a5114: jge 0x00007fa0a82a50c7 ;*aload_0 
        │ │││            ; - org.openjdk.LoopInc::[email protected] (line 19) 
    0.80% 0.91% ↘ │││ 0x00007fa0a82a5116: mov 0xf0(%r11),%r10d ;*invokevirtual getInt 
        │││            ; - java.util.concurrent.ThreadLocalRandom::[email protected] (line 222) 
    4.28% 1.23%  │││ 0x00007fa0a82a511d: test %r10d,%r10d 
        ╭│││ 0x00007fa0a82a5120: je  0x00007fa0a82a517b ;*ifne 
        ││││            ; - java.util.concurrent.ThreadLocalRandom::[email protected] (line 222) 
    2.11% 0.01% ││││ 0x00007fa0a82a5122: movabs $0x9e3779b97f4a7c15,%r10 
    0.01% 0.07% ││││ 0x00007fa0a82a512c: add 0xe8(%r11),%r10 ;*ladd 
        ││││            ; - java.util.concurrent.ThreadLocalRandom::[email protected] (line 242) 
    7.73% 1.89% ││││ 0x00007fa0a82a5133: mov %r10,%r9 
    1.21% 1.84% ││││ 0x00007fa0a82a5136: shr $0x21,%r9 
    1.90% 0.03% ││││ 0x00007fa0a82a513a: xor %r10,%r9 
    2.02% 0.03% ││││ 0x00007fa0a82a513d: movabs $0xff51afd7ed558ccd,%rcx 
    0.94% 1.82% ││││ 0x00007fa0a82a5147: imul %rcx,%r9   ;*lmul 
        ││││            ; - java.util.concurrent.ThreadLocalRandom::[email protected] (line 182) 
    7.01% 2.40% ││││ 0x00007fa0a82a514b: mov %r9,%rcx 
        ││││ 0x00007fa0a82a514e: shr $0x21,%rcx 
    1.89% 0.70% ││││ 0x00007fa0a82a5152: xor %r9,%rcx 
    3.11% 2.55% ││││ 0x00007fa0a82a5155: movabs $0xc4ceb9fe1a85ec53,%r9 
    0.99% 1.50% ││││ 0x00007fa0a82a515f: imul %r9,%rcx 
    7.66% 2.89% ││││ 0x00007fa0a82a5163: shr $0x20,%rcx 
    3.70% 1.97% ││││ 0x00007fa0a82a5167: mov %ecx,%r9d 
      0.11% ││││ 0x00007fa0a82a516a: and $0x1,%r9d   ;*iand 
        ││││            ; - java.util.concurrent.ThreadLocalRandom::[email protected] (line 356) 
    3.76% 11.13% ││││ 0x00007fa0a82a516e: cmp $0x1,%r9d 
        │╰││ 0x00007fa0a82a5172: je  0x00007fa0a82a5101 
10.48% 16.62% │ ││ 0x00007fa0a82a5174: test %r9d,%r9d 
        │ ╰│ 0x00007fa0a82a5177: je  0x00007fa0a82a5103 ;*lookupswitch 
        │ │            ; - org.openjdk.LoopInc::[email protected] (line 19) 
        │ ╰ 0x00007fa0a82a5179: jmp 0x00007fa0a82a5103 ;*aload_0 
        │             ; - org.openjdk.LoopInc::[email protected] (line 19) 
        ↘  0x00007fa0a82a517b: mov $0xffffff5d,%esi

indirectInc:

0.01% 0.01% ↗ 0x00007f65588d8260: mov %edx,%r9d 
    0.01%   │ 0x00007f65588d8263: nopw 0x0(%rax,%rax,1) 
11.99% 11.38% │ 0x00007f65588d826c: data16 data16 xchg %ax,%ax ;*iconst_0 
        │            ; - org.openjdk.LoopInc::[email protected] (line 34) 
        │ 0x00007f65588d8270: mov 0xf0(%r8),%r10d ;*invokevirtual getInt 
        │            ; - java.util.concurrent.ThreadLocalRandom::[email protected] (line 222) 
        │ 0x00007f65588d8277: test %r10d,%r10d 
        │ 0x00007f65588d827a: je  0x00007f65588d8331 ;*ifne 
        │            ; - java.util.concurrent.ThreadLocalRandom::[email protected] (line 222) 
    0.01%   │ 0x00007f65588d8280: movabs $0x9e3779b97f4a7c15,%r10 
11.80% 11.49% │ 0x00007f65588d828a: add 0xe8(%r8),%r10  ;*ladd 
        │            ; - java.util.concurrent.ThreadLocalRandom::[email protected] (line 242) 
    0.01% 0.01% │ 0x00007f65588d8291: mov %r10,0xe8(%r8)  ;*invokevirtual putLong 
        │            ; - java.util.concurrent.ThreadLocalRandom::[email protected] (line 241) 
        │ 0x00007f65588d8298: mov %r9d,%edx 
    0.01% 0.01% │ 0x00007f65588d829b: inc %edx 
11.12% 12.40% │ 0x00007f65588d829d: mov %r10,%rcx 
      0.01% │ 0x00007f65588d82a0: shr $0x21,%rcx 
    0.03%   │ 0x00007f65588d82a4: xor %r10,%rcx 
    0.06% 0.03% │ 0x00007f65588d82a7: movabs $0xff51afd7ed558ccd,%r10 
12.38% 13.94% │ 0x00007f65588d82b1: imul %r10,%rcx   ;*lmul 
        │            ; - java.util.concurrent.ThreadLocalRandom::[email protected] (line 182) 
    0.03% 0.01% │ 0x00007f65588d82b5: mov %rcx,%r10 
        │ 0x00007f65588d82b8: shr $0x21,%r10 
      0.03% │ 0x00007f65588d82bc: xor %rcx,%r10 
11.43% 12.62% │ 0x00007f65588d82bf: movabs $0xc4ceb9fe1a85ec53,%rcx 
      0.01% │ 0x00007f65588d82c9: imul %rcx,%r10 
    0.34% 0.30% │ 0x00007f65588d82cd: shr $0x20,%r10 
    0.85% 0.76% │ 0x00007f65588d82d1: mov %r10d,%r10d 
11.81% 11.51% │ 0x00007f65588d82d4: and $0x1,%r10d 
    2.16% 1.78% │ 0x00007f65588d82d8: cmp $0x1,%r10d 
    3.45% 3.00% │ 0x00007f65588d82dc: cmovne %r9d,%edx   <----- HERE IT IS 
17.55% 15.86% │ 0x00007f65588d82e0: inc %r11d    ;*iinc 
        │            ; - org.openjdk.LoopInc::[email protected] (line 33) 
        │ 0x00007f65588d82e3: cmp $0x3e8,%r11d 
        ╰ 0x00007f65588d82ea: jl  0x00007f65588d8260 ;*if_icmpge 
                  ; - org.openjdk.LoopInc::[email protected] (line 33)

Beachten Sie die cmovne statt jmp - das ist, warum wir mehr "haben vorhersagbare "Zweige. HotSpot profiliert die Zweige und gibt die bedingte Bewegung aus, wenn der Zweigprofilzweig sehr flach ist.Mit anderen Worten, weiche einer sehr wahrscheinlichen Verzweigungsfehlvorhersage aus, indem du etwas für die zusätzliche Latenz der bedingten Bewegung bezahlst. In diesem Fall ist der Schalter jedoch speziell: Er hat mehr als zwei Alternativen (0, 1 und "nichts"). Deshalb spekuliere ich, dass das result Inkrement nicht in cmov gefaltet wird. (Im Allgemeinen haben HotSpot könnte Null auf result in „default“ gespeichert, aber es blies es, na ja)

diese Hypothese zu bestätigen, lassen Sie uns einen directCompleteInc Fall machen, wo wir noch switch verwenden, aber jetzt decken alle die Fälle:

@Benchmark 
public int directCompleteInc() { 
    int result = 0; 
    for (int i = 0; i < 1000; i++) { 
     switch (getValue()) { 
      case 1: 
       result++; 
       break; 
      default: 
       break; 
     } 
    } 
    return result; 
}

... und messen sie, und diesmal ohne Optionen, wie die OP tat

Benchmark     Mode Cnt Score Error Units 
LoopInc.directCompleteInc thrpt 5 644.414 ± 0.371 ops/ms 
LoopInc.directInc   thrpt 5 174.974 ± 0.103 ops/ms 
LoopInc.indirectInc  thrpt 5 644.015 ± 0.533 ops/ms

eS.

Bestätigen directCompleteInc verwendet cmov mit -prof perfasm. Es tut.
Trinken.

Quelle

2015-07-17 20:00:33

Ja, ich habe auch "cmov" anstelle von "jmp" in der generierten Assembly bemerkt und festgestellt, dass beide Fälle gleichermaßen langsam arbeiten, wenn ich 'cmov's mit' deaktiviere -XX: ConditionalMoveLimit = 0'. Es ist jedoch immer noch unklar, warum HotSpot im ersten Fall 'cmov' nicht erzeugen kann. Ich habe auch überprüft, dass beide Fälle gleich schnell funktionieren, wenn ich eine zufällige Sequenz durch eine vorhersagbare Sequenz ersetze, so dass "perf" eine viel kleinere Anzahl falsch vorhergesagter Zweige anzeigt. Wie auch immer, das ist eine großartige Antwort! – apangin

BTW, gibt es ein Analogon von "-ProfPernorm" für Windows? – apangin

@apangin: Windows hat '-prof xperfasm', aber nicht' perfnorm'. Mir ist 'Perf-Statistik'-Alternative für Windows nicht bekannt, also ... –

Warum indirekte Inkrement ist schneller als direkte inc?

Antwort

Verwandte Themen