Warum CUDA Shared Memory ist langsamer als globaler Speicher in tiled Matrixmultiplikation?

Ich habe Matrix-Multiplikationscode mit und ohne gemeinsamen Speicher gekachelt. Im Folgenden ist die Multiplikation Matrix globale Speicher mit:Warum CUDA Shared Memory ist langsamer als globaler Speicher in tiled Matrixmultiplikation?

__global__ 
void MatrixMulKernel(float* M, float* N, float* P, int Width) 
{ 
int Row = blockIdx.y*blockDim.y + threadIdx.y; 
int Col = blockIdx.x*blockDim.x + threadIdx.x; 
if ((Row < Width) && (Col < Width)) { 
    float Pvalue = 0; 
    for (int k = 0; k < Width; ++k) 
    { 
     Pvalue += M[Row*Width + k] * N[k*Width + Col]; 
    } 

    P[Row*Width + Col] = Pvalue; 
} 
}

Im Folgenden finden Sie Multiplikationsmatrix Shared Memory:

__global__ 
void MatrixMulKernel(float* d_M, float* d_N, float* d_P, int Width) 
{ 
__shared__ float Mds[blockWidth][blockWidth]; 
__shared__ float Nds[blockWidth][blockWidth]; 
int tx = threadIdx.x; int ty = threadIdx.y; 
int bx = blockIdx.x; int by = blockIdx.y; 

int row = by * blockWidth + ty; 
int col = bx * blockWidth + tx; 
float pvalue = 0; 

for (int m = 0; m < Width/blockWidth; ++m) 
{ 
    Mds[ty][tx] = d_M[row * Width + m*blockWidth + tx]; 
    Nds[ty][tx] = d_N[(m*blockWidth + ty)*Width + col]; 
    __syncthreads(); 
    for (int k = 0; k < blockWidth; ++k) 
    { 
     pvalue += Mds[ty][k]*Nds[k][tx]; 
    } 
    __syncthreads(); 
} 
d_P[row*Width + col] = pvalue; 
}

So viel wie ich weiß, Shared Memory schneller sein sollte, aber diese zwei Codes in dem Vergleich fand ich Code Ohne Shared Memory läuft es bei 1600 * 1600 Matrizen etwa 2 Sekunden schneller. Gibt es eine Erklärung für diesen Geschwindigkeitsunterschied oder etwas schief geht mit meinem Code?

Mein Lehrer verwendet "Programmierung von massiv parallelen Prozessoren" Buch als Haupttextquelle diese beiden Codes kommt davon.

EDIT:

Konfiguration für Kernel:

int NumBlocks =ceil(Width/blockWidth); // blockWidth = 16 
dim3 dimGrid(NumBlocks, NumBlocks,1); // Width = 1600 
dim3 dimBlock(blockWidth, blockWidth,1); 
clock_t startGpuCalculation = clock(); 
MatrixMulKernel <<<dimGrid, dimBlock >>>(d_M, d_N, d_P, Width); 
cudaThreadSynchronize(); 
clock_t endGpuCalculation = clock();

Quelle

2017-12-04 Nasser Kurd

Können Sie Ihre Kernel-Konfiguration (Block & Grid-Größe) auch hinzufügen? – Angew

Ich schlage vor, eine [mcve] zur Verfügung zu stellen, die beide Fälle vergleicht. –

Wenn ich den Code ausführe und die Kernelausführung mit 'nvprof' auf einem Tesla K20Xm auf CentOS 7 mit CUDA 8 und Treiber 375.66 abspiele, erhalte ich ungefähr 36ms für die Ausführungszeit des Shared Memory-Kernels und ungefähr 92ms für die Ausführung Zeit des nicht geteilten Speicherkerns. [Hier] (https://pastebin.com/Ymn08BfC) ist eine vollständige Abschrift. Daher kann ich keine Beobachtung reproduzieren, dass der Shared-Memory-Kernel langsamer ist. –

Ich war Ausführen Projekt im Debug-Modus (VS 2017 & CUDA 9). Ich führe Code im Freigabemodus aus und freigegebener Speicher ist viel schneller als globaler Speicher. Mein Fehler.

Quelle

2017-12-05 15:10:51

Warum CUDA Shared Memory ist langsamer als globaler Speicher in tiled Matrixmultiplikation?

Antwort

Verwandte Themen