Implementierung von SSE 4.2 des CRC32C in Software

So habe ich ein Design, das CRC32C Prüfsummen enthält Daten, um sicherzustellen, nicht beschädigt wurde. Ich entschied mich dafür, CRC32C zu verwenden, weil ich sowohl eine Softwareversion als auch eine hardwarebeschleunigte Version haben kann, wenn der Computer, auf dem die Software läuft, SSE 4.2Implementierung von SSE 4.2 des CRC32C in Software

Ich gehe durch Intels Entwicklerhandbuch (Vol. 2A), das zu bieten scheint der Algorithmus hinter der crc32 Anweisung. Ich habe jedoch wenig Glück. Intel-Entwicklerhandbuch sagt die folgende:

BIT_REFLECT32: DEST[31-0] = SRC[0-31] 
MOD2: Remainder from Polynomial division modulus 2 

TEMP1[31-0] <- BIT_REFLECT(SRC[31-0]) 
TEMP2[31-0] <- BIT_REFLECT(DEST[31-0]) 
TEMP3[63-0] <- TEMP1[31-0] << 32 
TEMP4[63-0] <- TEMP2[31-0] << 32 
TEMP5[63-0] <- TEMP3[63-0] XOR TEMP4[63-0] 
TEMP6[31-0] <- TEMP5[63-0] MOD2 0x11EDC6F41 
DEST[31-0] <- BIT_REFLECT(TEMP6[31-0])

, nun soweit ich sagen kann, ich habe alles getan, bis zur Linie TEMP6 richtig starten, aber ich denke, ich entweder die Polynomdivision Missverständnis sein kann, oder die Umsetzung es falsch. Wenn mein Verständnis korrekt ist, 1/1 mod 2 = 1, 0/1 mod 2 = 0, und beide Divides-by-Null sind nicht definiert.

Was ich nicht verstehe ist, wie binäre Division mit 64-Bit und 33-Bit-Operanden arbeiten. Wenn SRC0x00000000 ist und DEST0xFFFFFFFF ist, werden TEMP5[63-32] alle gesetzten Bits sein, während TEMP5[31-0] alle nicht gesetzten Bits sein werden.

Wenn ich die Bits von TEMP5 als Zähler zu verwenden, würde es 30 Divisionen durch Null sein, da das Polynom 11EDC6F41 nur 33 Bits lang ist (und es so umzuwandeln, um ein 64-Bit Integer ohne Vorzeichen läßt die oberen 30 Bits nicht gesetzt), und der Nenner ist für 30 Bits nicht gesetzt.

Wenn ich jedoch das Polynom als Zähler zu verwenden, war, die unteren 32 Bits TEMP5 sind ungesetzt, da in dividieren durch Null ergibt, und die oberen 30 Bits des Ergebnisses wäre Null, wenn die oberen 30 Bits der Zähler wäre Null, wie 0/1 mod 2 = 0.

Bin ich Missverständnis, wie das funktioniert? Einfach nur etwas fehlt? Oder hat Intel wichtige Schritte in der Dokumentation ausgelassen?

Der Grund, warum ich Intels Entwickler-Guide für den scheinbar verwendeten Algorithmus aufgerufen habe, ist, dass sie ein 33-Bit-Polynom verwendeten, und ich wollte Outputs identisch machen, was nicht passierte, wenn ich den 32- Bitpolynom (siehe unten).

uint32_t poly = 0x1EDC6F41, sres, crcTable[256], data = 0x00000000; 

for (n = 0; n < 256; n++) { 
    sres = n; 
    for (k = 0; k < 8; k++) 
     sres = (sres & 1) == 1 ? poly^(sres >> 1) : (sres >> 1); 
    crcTable[n] = sres; 
} 
sres = 0xFFFFFFFF; 

for (n = 0; n < 4; n++) { 
    sres = crcTable[(sres^data) & 0xFF]^(sres >> 8); 
}

Der obige Code erzeugt 4138093821 als Ausgang und die crc32 Opcode erzeugt 2346497208 unter Verwendung der Eingabe 0x00000000.

Sorry, wenn dies schlecht geschrieben oder unverständlich an Orten, ist es ziemlich spät für mich.

Quelle

2013-07-15 LMS

Für diejenigen, die Delphi verwenden, habe ich [Open Source Code geschrieben] (http://blog.synopse.info/post/2014/05/25/New-crc32c%28%29-function-using-optimized -asm-and-SSE-4.2-Befehl) mit der neuen 'crc32'-Hardware-Anweisung, falls verfügbar, und schnellem x86-Asm oder reinem Pascal-Code (unter Verwendung von vorberechneten Tabellen), wenn SSE 4.2 nicht verfügbar ist. Naive-Version rollt mit 330 MB/s, optimierte x86-Asr-Version mit 1,7 GB/s und SSE 4.2-Hardware mit erstaunlicher Geschwindigkeit von 3,7 GB/s (auf Win32- und Win64-Plattformen). –

Hier sind sowohl Software als auch Hardware-Versionen von CRC-32C. Die Softwareversion ist optimiert, um jeweils acht Bytes zu verarbeiten. Die Hardware-Version ist optimiert auf einem einzigen Kern drei crc32q Anweisungen effektiv parallel zu laufen, da der Durchsatz dieses Befehls ein Zyklus ist, aber die Latenzzeit beträgt drei Zyklen.

/* crc32c.c -- compute CRC-32C using the Intel crc32 instruction 
* Copyright (C) 2013 Mark Adler 
* Version 1.1 1 Aug 2013 Mark Adler 
*/ 

/* 
    This software is provided 'as-is', without any express or implied 
    warranty. In no event will the author be held liable for any damages 
    arising from the use of this software. 

    Permission is granted to anyone to use this software for any purpose, 
    including commercial applications, and to alter it and redistribute it 
    freely, subject to the following restrictions: 

    1. The origin of this software must not be misrepresented; you must not 
    claim that you wrote the original software. If you use this software 
    in a product, an acknowledgment in the product documentation would be 
    appreciated but is not required. 
    2. Altered source versions must be plainly marked as such, and must not be 
    misrepresented as being the original software. 
    3. This notice may not be removed or altered from any source distribution. 

    Mark Adler 
    [email protected] 
*/ 

/* Use hardware CRC instruction on Intel SSE 4.2 processors. This computes a 
    CRC-32C, *not* the CRC-32 used by Ethernet and zip, gzip, etc. A software 
    version is provided as a fall-back, as well as for speed comparisons. */ 

/* Version history: 
    1.0 10 Feb 2013 First version 
    1.1 1 Aug 2013 Correct comments on why three crc instructions in parallel 
*/ 

#include <stdio.h> 
#include <stdlib.h> 
#include <stdint.h> 
#include <unistd.h> 
#include <pthread.h> 

/* CRC-32C (iSCSI) polynomial in reversed bit order. */ 
#define POLY 0x82f63b78 

/* Table for a quadword-at-a-time software crc. */ 
static pthread_once_t crc32c_once_sw = PTHREAD_ONCE_INIT; 
static uint32_t crc32c_table[8][256]; 

/* Construct table for software CRC-32C calculation. */ 
static void crc32c_init_sw(void) 
{ 
    uint32_t n, crc, k; 

    for (n = 0; n < 256; n++) { 
     crc = n; 
     crc = crc & 1 ? (crc >> 1)^POLY : crc >> 1; 
     crc = crc & 1 ? (crc >> 1)^POLY : crc >> 1; 
     crc = crc & 1 ? (crc >> 1)^POLY : crc >> 1; 
     crc = crc & 1 ? (crc >> 1)^POLY : crc >> 1; 
     crc = crc & 1 ? (crc >> 1)^POLY : crc >> 1; 
     crc = crc & 1 ? (crc >> 1)^POLY : crc >> 1; 
     crc = crc & 1 ? (crc >> 1)^POLY : crc >> 1; 
     crc = crc & 1 ? (crc >> 1)^POLY : crc >> 1; 
     crc32c_table[0][n] = crc; 
    } 
    for (n = 0; n < 256; n++) { 
     crc = crc32c_table[0][n]; 
     for (k = 1; k < 8; k++) { 
      crc = crc32c_table[0][crc & 0xff]^(crc >> 8); 
      crc32c_table[k][n] = crc; 
     } 
    } 
} 

/* Table-driven software version as a fall-back. This is about 15 times slower 
    than using the hardware instructions. This assumes little-endian integers, 
    as is the case on Intel processors that the assembler code here is for. */ 
static uint32_t crc32c_sw(uint32_t crci, const void *buf, size_t len) 
{ 
    const unsigned char *next = buf; 
    uint64_t crc; 

    pthread_once(&crc32c_once_sw, crc32c_init_sw); 
    crc = crci^0xffffffff; 
    while (len && ((uintptr_t)next & 7) != 0) { 
     crc = crc32c_table[0][(crc^*next++) & 0xff]^(crc >> 8); 
     len--; 
    } 
    while (len >= 8) { 
     crc ^= *(uint64_t *)next; 
     crc = crc32c_table[7][crc & 0xff]^
       crc32c_table[6][(crc >> 8) & 0xff]^
       crc32c_table[5][(crc >> 16) & 0xff]^
       crc32c_table[4][(crc >> 24) & 0xff]^
       crc32c_table[3][(crc >> 32) & 0xff]^
       crc32c_table[2][(crc >> 40) & 0xff]^
       crc32c_table[1][(crc >> 48) & 0xff]^
       crc32c_table[0][crc >> 56]; 
     next += 8; 
     len -= 8; 
    } 
    while (len) { 
     crc = crc32c_table[0][(crc^*next++) & 0xff]^(crc >> 8); 
     len--; 
    } 
    return (uint32_t)crc^0xffffffff; 
} 

/* Multiply a matrix times a vector over the Galois field of two elements, 
    GF(2). Each element is a bit in an unsigned integer. mat must have at 
    least as many entries as the power of two for most significant one bit in 
    vec. */ 
static inline uint32_t gf2_matrix_times(uint32_t *mat, uint32_t vec) 
{ 
    uint32_t sum; 

    sum = 0; 
    while (vec) { 
     if (vec & 1) 
      sum ^= *mat; 
     vec >>= 1; 
     mat++; 
    } 
    return sum; 
} 

/* Multiply a matrix by itself over GF(2). Both mat and square must have 32 
    rows. */ 
static inline void gf2_matrix_square(uint32_t *square, uint32_t *mat) 
{ 
    int n; 

    for (n = 0; n < 32; n++) 
     square[n] = gf2_matrix_times(mat, mat[n]); 
} 

/* Construct an operator to apply len zeros to a crc. len must be a power of 
    two. If len is not a power of two, then the result is the same as for the 
    largest power of two less than len. The result for len == 0 is the same as 
    for len == 1. A version of this routine could be easily written for any 
    len, but that is not needed for this application. */ 
static void crc32c_zeros_op(uint32_t *even, size_t len) 
{ 
    int n; 
    uint32_t row; 
    uint32_t odd[32];  /* odd-power-of-two zeros operator */ 

    /* put operator for one zero bit in odd */ 
    odd[0] = POLY;    /* CRC-32C polynomial */ 
    row = 1; 
    for (n = 1; n < 32; n++) { 
     odd[n] = row; 
     row <<= 1; 
    } 

    /* put operator for two zero bits in even */ 
    gf2_matrix_square(even, odd); 

    /* put operator for four zero bits in odd */ 
    gf2_matrix_square(odd, even); 

    /* first square will put the operator for one zero byte (eight zero bits), 
     in even -- next square puts operator for two zero bytes in odd, and so 
     on, until len has been rotated down to zero */ 
    do { 
     gf2_matrix_square(even, odd); 
     len >>= 1; 
     if (len == 0) 
      return; 
     gf2_matrix_square(odd, even); 
     len >>= 1; 
    } while (len); 

    /* answer ended up in odd -- copy to even */ 
    for (n = 0; n < 32; n++) 
     even[n] = odd[n]; 
} 

/* Take a length and build four lookup tables for applying the zeros operator 
    for that length, byte-by-byte on the operand. */ 
static void crc32c_zeros(uint32_t zeros[][256], size_t len) 
{ 
    uint32_t n; 
    uint32_t op[32]; 

    crc32c_zeros_op(op, len); 
    for (n = 0; n < 256; n++) { 
     zeros[0][n] = gf2_matrix_times(op, n); 
     zeros[1][n] = gf2_matrix_times(op, n << 8); 
     zeros[2][n] = gf2_matrix_times(op, n << 16); 
     zeros[3][n] = gf2_matrix_times(op, n << 24); 
    } 
} 

/* Apply the zeros operator table to crc. */ 
static inline uint32_t crc32c_shift(uint32_t zeros[][256], uint32_t crc) 
{ 
    return zeros[0][crc & 0xff]^zeros[1][(crc >> 8) & 0xff]^
      zeros[2][(crc >> 16) & 0xff]^zeros[3][crc >> 24]; 
} 

/* Block sizes for three-way parallel crc computation. LONG and SHORT must 
    both be powers of two. The associated string constants must be set 
    accordingly, for use in constructing the assembler instructions. */ 
#define LONG 8192 
#define LONGx1 "8192" 
#define LONGx2 "16384" 
#define SHORT 256 
#define SHORTx1 "256" 
#define SHORTx2 "512" 

/* Tables for hardware crc that shift a crc by LONG and SHORT zeros. */ 
static pthread_once_t crc32c_once_hw = PTHREAD_ONCE_INIT; 
static uint32_t crc32c_long[4][256]; 
static uint32_t crc32c_short[4][256]; 

/* Initialize tables for shifting crcs. */ 
static void crc32c_init_hw(void) 
{ 
    crc32c_zeros(crc32c_long, LONG); 
    crc32c_zeros(crc32c_short, SHORT); 
} 

/* Compute CRC-32C using the Intel hardware instruction. */ 
static uint32_t crc32c_hw(uint32_t crc, const void *buf, size_t len) 
{ 
    const unsigned char *next = buf; 
    const unsigned char *end; 
    uint64_t crc0, crc1, crc2;  /* need to be 64 bits for crc32q */ 

    /* populate shift tables the first time through */ 
    pthread_once(&crc32c_once_hw, crc32c_init_hw); 

    /* pre-process the crc */ 
    crc0 = crc^0xffffffff; 

    /* compute the crc for up to seven leading bytes to bring the data pointer 
     to an eight-byte boundary */ 
    while (len && ((uintptr_t)next & 7) != 0) { 
     __asm__("crc32b\t" "(%1), %0" 
       : "=r"(crc0) 
       : "r"(next), "0"(crc0)); 
     next++; 
     len--; 
    } 

    /* compute the crc on sets of LONG*3 bytes, executing three independent crc 
     instructions, each on LONG bytes -- this is optimized for the Nehalem, 
     Westmere, Sandy Bridge, and Ivy Bridge architectures, which have a 
     throughput of one crc per cycle, but a latency of three cycles */ 
    while (len >= LONG*3) { 
     crc1 = 0; 
     crc2 = 0; 
     end = next + LONG; 
     do { 
      __asm__("crc32q\t" "(%3), %0\n\t" 
        "crc32q\t" LONGx1 "(%3), %1\n\t" 
        "crc32q\t" LONGx2 "(%3), %2" 
        : "=r"(crc0), "=r"(crc1), "=r"(crc2) 
        : "r"(next), "0"(crc0), "1"(crc1), "2"(crc2)); 
      next += 8; 
     } while (next < end); 
     crc0 = crc32c_shift(crc32c_long, crc0)^crc1; 
     crc0 = crc32c_shift(crc32c_long, crc0)^crc2; 
     next += LONG*2; 
     len -= LONG*3; 
    } 

    /* do the same thing, but now on SHORT*3 blocks for the remaining data less 
     than a LONG*3 block */ 
    while (len >= SHORT*3) { 
     crc1 = 0; 
     crc2 = 0; 
     end = next + SHORT; 
     do { 
      __asm__("crc32q\t" "(%3), %0\n\t" 
        "crc32q\t" SHORTx1 "(%3), %1\n\t" 
        "crc32q\t" SHORTx2 "(%3), %2" 
        : "=r"(crc0), "=r"(crc1), "=r"(crc2) 
        : "r"(next), "0"(crc0), "1"(crc1), "2"(crc2)); 
      next += 8; 
     } while (next < end); 
     crc0 = crc32c_shift(crc32c_short, crc0)^crc1; 
     crc0 = crc32c_shift(crc32c_short, crc0)^crc2; 
     next += SHORT*2; 
     len -= SHORT*3; 
    } 

    /* compute the crc on the remaining eight-byte units less than a SHORT*3 
     block */ 
    end = next + (len - (len & 7)); 
    while (next < end) { 
     __asm__("crc32q\t" "(%1), %0" 
       : "=r"(crc0) 
       : "r"(next), "0"(crc0)); 
     next += 8; 
    } 
    len &= 7; 

    /* compute the crc for up to seven trailing bytes */ 
    while (len) { 
     __asm__("crc32b\t" "(%1), %0" 
       : "=r"(crc0) 
       : "r"(next), "0"(crc0)); 
     next++; 
     len--; 
    } 

    /* return a post-processed crc */ 
    return (uint32_t)crc0^0xffffffff; 
} 

/* Check for SSE 4.2. SSE 4.2 was first supported in Nehalem processors 
    introduced in November, 2008. This does not check for the existence of the 
    cpuid instruction itself, which was introduced on the 486SL in 1992, so this 
    will fail on earlier x86 processors. cpuid works on all Pentium and later 
    processors. */ 
#define SSE42(have) \ 
    do { \ 
     uint32_t eax, ecx; \ 
     eax = 1; \ 
     __asm__("cpuid" \ 
       : "=c"(ecx) \ 
       : "a"(eax) \ 
       : "%ebx", "%edx"); \ 
     (have) = (ecx >> 20) & 1; \ 
    } while (0) 

/* Compute a CRC-32C. If the crc32 instruction is available, use the hardware 
    version. Otherwise, use the software version. */ 
uint32_t crc32c(uint32_t crc, const void *buf, size_t len) 
{ 
    int sse42; 

    SSE42(sse42); 
    return sse42 ? crc32c_hw(crc, buf, len) : crc32c_sw(crc, buf, len); 
} 

#ifdef TEST 

#define SIZE (262144*3) 
#define CHUNK SIZE 

int main(int argc, char **argv) 
{ 
    char *buf; 
    ssize_t got; 
    size_t off, n; 
    uint32_t crc; 

    (void)argv; 
    crc = 0; 
    buf = malloc(SIZE); 
    if (buf == NULL) { 
     fputs("out of memory", stderr); 
     return 1; 
    } 
    while ((got = read(0, buf, SIZE)) > 0) { 
     off = 0; 
     do { 
      n = (size_t)got - off; 
      if (n > CHUNK) 
       n = CHUNK; 
      crc = argc > 1 ? crc32c_sw(crc, buf + off, n) : 
          crc32c(crc, buf + off, n); 
      off += n; 
     } while (off < (size_t)got); 
    } 
    free(buf); 
    if (got == -1) { 
     fputs("read error\n", stderr); 
     return 1; 
    } 
    printf("%08x\n", crc); 
    return 0; 
} 

#endif /* TEST */

Quelle

2013-07-15 04:28:34

Ich kann nirgendwo eine 'crc32q'-Anweisung finden (nur Referenzen zur Luftfahrt), noch eine 'crc32b'-Anweisung, die Sie in Ihrer Inline-Assembly verwendet haben, und MSVC akzeptiert sie nicht inline Versammlung. – LMS

Das wurde für den GNU-Compiler (GCC) geschrieben, der im Gegensatz zur Intel-Syntax die AT & T-Syntax für Assembler-Anweisungen verwendet. Die AT & T-Syntax ist viel klarer darüber, welche Anweisung erzeugt wird, da es nicht davon abhängt, ob eine Argumenttypisierung dafür verwendet wird (z. B. dword ptr, etc.). Ihr Assembler verwendet wahrscheinlich die Intel-Syntax, wobei der Befehl "crc32" tatsächlich eine von sechs verschiedenen Anweisungen generieren kann. Welche muss vom Assembler bestimmt werden, ebenso wie von einem Menschen, der versucht, den Code basierend auf der Art der Argumente zu lesen. –

'crc32q' arbeitet auf einem" Vierwort ", 64 Bits gleichzeitig. 'crc32b' operiert jeweils auf einem Byte. –

Antwort Mark Adlers ist korrekt und vollständig sind, aber diejenigen, die eine schnelle und einfache Art und Weise CRC-32C in ihre Anwendung zu integrieren könnte es ein wenig schwierig zu finden, den Code anzupassen, vor allem, wenn sie von Windows und .NET verwenden .

Ich habe eine library that implements CRC-32C mit Hardware-oder Software-Methode je nach verfügbarer Hardware erstellt. Es ist als NuGet-Paket für C++ und .NET verfügbar.Es ist natürlich Open Source.

Neben dem oben genannten Code von Mark Adler habe ich einen einfachen Weg gefunden, den Durchsatz des Software-Fallbacks um 50% zu verbessern. Auf meinem Computer erreicht die Bibliothek jetzt 2 GB/s in Software und über 20 GB/s in Hardware. Für diejenigen neugierig, hier ist die optimierte Software-Implementierung:

static uint32_t append_table(uint32_t crci, buffer input, size_t length) 
{ 
    buffer next = input; 
#ifdef _M_X64 
    uint64_t crc; 
#else 
    uint32_t crc; 
#endif 

    crc = crci^0xffffffff; 
#ifdef _M_X64 
    while (length && ((uintptr_t)next & 7) != 0) 
    { 
     crc = table[0][(crc^*next++) & 0xff]^(crc >> 8); 
     --length; 
    } 
    while (length >= 16) 
    { 
     crc ^= *(uint64_t *)next; 
     uint64_t high = *(uint64_t *)(next + 8); 
     crc = table[15][crc & 0xff] 
      ^table[14][(crc >> 8) & 0xff] 
      ^table[13][(crc >> 16) & 0xff] 
      ^table[12][(crc >> 24) & 0xff] 
      ^table[11][(crc >> 32) & 0xff] 
      ^table[10][(crc >> 40) & 0xff] 
      ^table[9][(crc >> 48) & 0xff] 
      ^table[8][crc >> 56] 
      ^table[7][high & 0xff] 
      ^table[6][(high >> 8) & 0xff] 
      ^table[5][(high >> 16) & 0xff] 
      ^table[4][(high >> 24) & 0xff] 
      ^table[3][(high >> 32) & 0xff] 
      ^table[2][(high >> 40) & 0xff] 
      ^table[1][(high >> 48) & 0xff] 
      ^table[0][high >> 56]; 
     next += 16; 
     length -= 16; 
    } 
#else 
    while (length && ((uintptr_t)next & 3) != 0) 
    { 
     crc = table[0][(crc^*next++) & 0xff]^(crc >> 8); 
     --length; 
    } 
    while (length >= 12) 
    { 
     crc ^= *(uint32_t *)next; 
     uint32_t high = *(uint32_t *)(next + 4); 
     uint32_t high2 = *(uint32_t *)(next + 8); 
     crc = table[11][crc & 0xff] 
      ^table[10][(crc >> 8) & 0xff] 
      ^table[9][(crc >> 16) & 0xff] 
      ^table[8][crc >> 24] 
      ^table[7][high & 0xff] 
      ^table[6][(high >> 8) & 0xff] 
      ^table[5][(high >> 16) & 0xff] 
      ^table[4][high >> 24] 
      ^table[3][high2 & 0xff] 
      ^table[2][(high2 >> 8) & 0xff] 
      ^table[1][(high2 >> 16) & 0xff] 
      ^table[0][high2 >> 24]; 
     next += 12; 
     length -= 12; 
    } 
#endif 
    while (length) 
    { 
     crc = table[0][(crc^*next++) & 0xff]^(crc >> 8); 
     --length; 
    } 
    return (uint32_t)crc^0xffffffff; 
}

Wie Sie sehen können, es knirscht nur größeren Block zu einem Zeitpunkt. Es benötigt eine größere Nachschlagetabelle, ist aber weiterhin Cache-freundlich. Die Tabelle wird auf die gleiche Weise generiert, nur mit mehr Zeilen.

Eine zusätzliche Sache, die ich erforschte, ist die Verwendung der PCLMULQDQ-Anweisung, um Hardwarebeschleunigung auf AMD-Prozessoren zu erhalten. Ich habe es geschafft, Intel's CRC patch for zlib (auch available on GitHub) zu CRC-32C Polynom ~~mit Ausnahme von the magic constant 0x9db42487 zu portieren. Wenn jemand in der Lage ist, diesen zu entschlüsseln, lassen Sie es mich bitte~~ wissen. Nach supersaw7's excellent explanation on reddit habe ich auch die schwer fassbare 0x9db42487 Konstante portiert und ich muss nur etwas Zeit finden, um es zu polieren und zu testen.

Quelle

2014-02-20 17:28:15

+1 Vielen Dank für die Freigabe Ihres Codes. Es hilft mir sehr, wenn ich es nach Delphi portiere. –

Ich habe den Link zum Patch korrigiert und einige zusätzliche Links hinzugefügt. Sind Sie in dieser Angelegenheit vorangekommen, Robert? –

es scheint, dass cloudflares zlib mit PCLMULQDQ-Unterstützung nicht die Konstante verwendet ... vielleicht ist das nützlich für Sie? –

Ich vergleiche verschiedene Algorithmen hier: https://github.com/htot/crc32c

Der schnellste Algorithmus von Intels crc_iscsi_v_pcl.asm Assembler-Code genommen wurde und mit einem C-Wrapper (crcintelasm.cc (die in modifizierter Form in dem Linux-Kernel zur Verfügung steht)) in dieses Projekt einbezogen.

Um diesen Code auf 32-Bit-Plattformen ausführen zu können, wurde er zunächst auf C portiert (crc32intelc). Wo möglich, ist eine kleine Menge Inline-Assembly erforderlich. Bestimmte Teile des Codes hängen von der Bitheit ab, crc32q ist bei 32 Bits nicht verfügbar und beide sind movq, diese werden in Makros (crc32intel.h) mit alternativem Code für 32-Bit-Plattformen geschrieben.

Quelle

2016-10-04 22:35:16

Implementierung von SSE 4.2 des CRC32C in Software

Antwort

Verwandte Themen