Streu-Matrixblöcke unterschiedlicher Größe mit MPI

(Angenommen, alle Matrizen werden in Reihen-Haupt-Reihenfolge gespeichert.) Ein Beispiel für das Problem ist die Verteilung einer 10x10-Matrix über ein 3x3-Raster, so dass die Größe der Sub- Matrizen in jedem Knoten sieht aus wieStreu-Matrixblöcke unterschiedlicher Größe mit MPI

|-----+-----+-----| 
| 3x3 | 3x3 | 3x4 | 
|-----+-----+-----| 
| 3x3 | 3x3 | 3x4 | 
|-----+-----+-----| 
| 4x3 | 4x3 | 4x4 | 
|-----+-----+-----|

ich viele Beiträge auf Stackoverflow (wie sending blocks of 2D array in C using MPI und MPI partition matrix into blocks) gesehen habe. Aber sie befassen sich nur mit Blöcken gleicher Größe (in diesem Fall können wir einfach MPI_Type_vector oder MPI_Type_create_subarray und nur einen MPI_Scatterv Aufruf verwenden).

Also frage ich mich, was der effizienteste Weg in MPI ist, eine Matrix auf ein Raster von Prozessoren zu verteilen, wo jeder Prozessor einen Block mit einer bestimmten Größe hat.

P.S. Ich habe auch MPI_Type_create_darray betrachtet, aber es scheint nicht zulassen, dass Sie die Blockgröße für jeden Prozessor angeben.

Quelle

2015-03-29 Roun

@Patrick Vielen Dank für Ihre Kommentare. Ich denke, 'MPI_Type_indexed' funktioniert nicht, da ein einzelner Typ immer noch nur einem Block einer bestimmten Größe entsprechen kann. – Roun

Sie haben um mindestens einen zusätzlichen Schritt in MPI zu machen, um dies zu tun.

Das Problem ist, dass die allgemeinste der Sammel-/Streu Routinen, MPI_Scatterv und MPI_Gatherv, können Sie einen „Vektor“ zu übergeben (v) der Zählungen/Verschiebungen, anstatt nur eine Zählung für Scatter und Gather, aber die Typen werden alle als gleich angenommen. Hier gibt es keinen Weg darum; Die Speicherlayouts jedes Blocks sind unterschiedlich und müssen daher von einem anderen Typ behandelt werden. Wenn es nur einen Unterschied zwischen den Blöcken – gab, hatten einige unterschiedliche Anzahlen von Spalten, oder einige hatten unterschiedliche Anzahl von Zeilen –, dann würde nur die Verwendung unterschiedlicher Zählungen ausreichen. Aber mit verschiedenen Spalten und Zeilen, Zählungen werden es nicht tun; Sie müssen wirklich verschiedene Typen angeben können.

Also was Sie wirklich wollen, ist eine oft diskutierte, aber nie implementiert MPI_Scatterw (wo w bedeutet, vv; z. B. beide Zählungen und Typen sind Vektoren) Routine. Aber so etwas gibt es nicht. Der nächste, den Sie erreichen können, ist der viel allgemeinere MPI_Alltoallw Anruf, der das allgemeine Senden und Empfangen von Daten ermöglicht. wie die Spezifikation angibt, "The MPI_ALLTOALLW function generalizes several MPI functions by carefully selecting the input arguments. For example, by making all but one process have sendcounts(i) = 0, this achieves an MPI_SCATTERW function.".

Also können Sie dies mit MPI_Alltoallw tun, indem Sie alle Prozesse außer dem, der ursprünglich alle Daten hat (wir nehmen an, dass es hier 0 ist), alle ihre Sendezählungen auf Null gesendet haben. Alle Aufgaben haben auch alle ihre Empfangszahlen auf Null, mit Ausnahme der ersten - die Menge der Daten, die sie von Rang 0 erhalten.

Für Sendezahlen von Prozess 0 müssen wir zuerst vier verschiedene Arten von Typen definieren (die 4 verschiedenen Größen von Subarrays), und dann sind die Sendezählwerte alle 1, und der einzige verbleibende Teil ist herauszufinden die Sende Verschiebungen (die im Gegensatz zu scatterv, ist hier in Einheiten von Bytes, weil es keine einzige Art ist eine als Einheit verwenden könnte):

 /* 4 types of blocks - 
     * blocksize*blocksize, blocksize+1*blocksize, blocksize*blocksize+1, blocksize+1*blocksize+1 
     */ 

     MPI_Datatype blocktypes[4]; 
     int subsizes[2]; 
     int starts[2] = {0,0}; 
     for (int i=0; i<2; i++) { 
      subsizes[0] = blocksize+i; 
      for (int j=0; j<2; j++) { 
       subsizes[1] = blocksize+j; 
       MPI_Type_create_subarray(2, globalsizes, subsizes, starts, MPI_ORDER_C, MPI_CHAR, &blocktypes[2*i+j]); 
       MPI_Type_commit(&blocktypes[2*i+j]); 
      } 
     } 

     /* now figure out the displacement and type of each processor's data */ 
     for (int proc=0; proc<size; proc++) { 
      int row, col; 
      rowcol(proc, blocks, &row, &col); 

      sendcounts[proc] = 1; 
      senddispls[proc] = (row*blocksize*globalsizes[1] + col*blocksize)*sizeof(char); 

      int idx = typeIdx(row, col, blocks); 
      sendtypes[proc] = blocktypes[idx]; 
     } 
    } 

    MPI_Alltoallw(globalptr, sendcounts, senddispls, sendtypes, 
        &(localdata[0][0]), recvcounts, recvdispls, recvtypes, 
        MPI_COMM_WORLD);

Und dies funktionieren wird.

Aber das Problem ist, dass die Alltoallw-Funktion so allgemein ist, dass es für Implementierungen schwierig ist, viel in der Linie der Optimierung zu tun; also wäre ich überrascht, wenn das genauso gut wäre wie eine Streuung von gleich großen Blöcken.

So ein anderer Ansatz ist so etwas wie zwei Phasen der Kommunikation zu tun.

Die einfachste solcher Ansatz folgt nach der Feststellung, dass man fast erhalten alle Daten, wo es mit einem einzigen MPI_Scatterv() Anruf gehen muss: in Ihrem Beispiel, wenn wir in Einheiten einer einzelnen Spaltenvektor mit column = 1 betrieben werden und Zeilen = 3 (die Anzahl der Zeilen in den meisten Blöcken der Domäne), können Sie fast alle globalen Daten auf die anderen Prozessoren verteilen. Die Prozessoren erhalten jeweils 3 oder 4 dieser Vektoren, die alle Daten mit Ausnahme der allerletzten Zeile des globalen Arrays verteilen, was mit einem einfachen zweiten scatterv behandelt werden kann. Das sieht so aus;

/* We're going to be operating mostly in units of a single column of a "normal" sized block. 
* There will need to be two vectors describing these columns; one in the context of the 
* global array, and one in the local results. 
*/ 
MPI_Datatype vec, localvec; 
MPI_Type_vector(blocksize, 1, localsizes[1], MPI_CHAR, &localvec); 
MPI_Type_create_resized(localvec, 0, sizeof(char), &localvec); 
MPI_Type_commit(&localvec); 

MPI_Type_vector(blocksize, 1, globalsizes[1], MPI_CHAR, &vec); 
MPI_Type_create_resized(vec, 0, sizeof(char), &vec); 
MPI_Type_commit(&vec); 

/* The originating process needs to allocate and fill the source array, 
* and then define types defining the array chunks to send, and 
* fill out senddispls, sendcounts (1) and sendtypes. 
*/ 
if (rank == 0) { 
    /* create the vector type which will send one column of a "normal" sized-block */ 
    /* then all processors except those in the last row need to get blocksize*vec or (blocksize+1)*vec */ 
    /* will still have to do something to tidy up the last row of values */ 
    /* we need to make the type have extent of 1 char for scattering */ 
    for (int proc=0; proc<size; proc++) { 
     int row, col; 
     rowcol(proc, blocks, &row, &col); 

     sendcounts[proc] = isLastCol(col, blocks) ? blocksize+1 : blocksize; 
     senddispls[proc] = (row*blocksize*globalsizes[1] + col*blocksize); 
    } 
} 

recvcounts = localsizes[1]; 
MPI_Scatterv(globalptr, sendcounts, senddispls, vec, 
       &(localdata[0][0]), recvcounts, localvec, 0, MPI_COMM_WORLD); 

MPI_Type_free(&localvec); 
if (rank == 0) 
    MPI_Type_free(&vec); 

/* now we need to do one more scatter, scattering just the last row of data 
* just to the processors on the last row. 
* Here we recompute the send counts 
*/ 
if (rank == 0) { 
    for (int proc=0; proc<size; proc++) { 
     int row, col; 
     rowcol(proc, blocks, &row, &col); 
     sendcounts[proc] = 0; 
     senddispls[proc] = 0; 

     if (isLastRow(row,blocks)) { 
      sendcounts[proc] = blocksize; 
      senddispls[proc] = (globalsizes[0]-1)*globalsizes[1]+col*blocksize; 
      if (isLastCol(col,blocks)) 
       sendcounts[proc] += 1; 
     } 
    } 
} 

recvcounts = 0; 
if (isLastRow(myrow, blocks)) { 
    recvcounts = blocksize; 
    if (isLastCol(mycol, blocks)) 
     recvcounts++; 
} 
MPI_Scatterv(globalptr, sendcounts, senddispls, MPI_CHAR, 
       &(localdata[blocksize][0]), recvcounts, MPI_CHAR, 0, MPI_COMM_WORLD);

So weit so gut. Aber es ist eine Schande, wenn die meisten Prozessoren herum sitzen und während des letzten "Cleanup" -Streusel nichts tun.

Ein besserer Ansatz besteht also darin, alle Zeilen in einer ersten Phase zu streuen und diese Daten in einer zweiten Phase auf die Spalten zu verteilen. Hier erstellen wir neue Kommunikatoren, wobei jeder Prozessor zu zwei neuen Kommunikatoren gehört - einer für andere Prozessoren in derselben Blockzeile und der andere für dieselbe Blockspalte. Im ersten Schritt verteilt der Ursprungsprozessor alle Zeilen des globalen Arrays auf die anderen Prozessoren im selben Spaltenkommunikator - was in einem einzigen scatterv erfolgen kann. Dann streuen diese Prozessoren, die einen einzelnen scatterv und den gleichen Spalten-Datentyp wie im vorherigen Beispiel verwenden, die Spalten auf jeden Prozessor in der gleichen Blockzeile wie sie. Das Ergebnis sind zwei ziemlich einfache scatterv die alle Daten verteilen:

/* create communicators which have processors with the same row or column in them*/ 
MPI_Comm colComm, rowComm; 
MPI_Comm_split(MPI_COMM_WORLD, myrow, rank, &rowComm); 
MPI_Comm_split(MPI_COMM_WORLD, mycol, rank, &colComm); 

/* first, scatter the array by rows, with the processor in column 0 corresponding to each row 
* receiving the data */ 
if (mycol == 0) { 
    int sendcounts[ blocks[0] ]; 
    int senddispls[ blocks[0] ]; 
    senddispls[0] = 0; 

    for (int row=0; row<blocks[0]; row++) { 
     /* each processor gets blocksize rows, each of size globalsizes[1]... */ 
     sendcounts[row] = blocksize*globalsizes[1]; 
     if (row > 0) 
      senddispls[row] = senddispls[row-1] + sendcounts[row-1]; 
    } 
    /* the last processor gets one more */ 
    sendcounts[blocks[0]-1] += globalsizes[1]; 

    /* allocate my rowdata */ 
    rowdata = allocchar2darray(sendcounts[myrow], globalsizes[1]); 

    /* perform the scatter of rows */ 
    MPI_Scatterv(globalptr, sendcounts, senddispls, MPI_CHAR, 
        &(rowdata[0][0]), sendcounts[myrow], MPI_CHAR, 0, colComm); 

} 

/* Now, within each row of processors, we can scatter the columns. 
* We can do this as we did in the previous example; create a vector 
* (and localvector) type and scatter accordingly */ 
int locnrows = blocksize; 
if (isLastRow(myrow, blocks)) 
    locnrows++; 
MPI_Datatype vec, localvec; 
MPI_Type_vector(locnrows, 1, globalsizes[1], MPI_CHAR, &vec); 
MPI_Type_create_resized(vec, 0, sizeof(char), &vec); 
MPI_Type_commit(&vec); 

MPI_Type_vector(locnrows, 1, localsizes[1], MPI_CHAR, &localvec); 
MPI_Type_create_resized(localvec, 0, sizeof(char), &localvec); 
MPI_Type_commit(&localvec); 

int sendcounts[ blocks[1] ]; 
int senddispls[ blocks[1] ]; 
if (mycol == 0) { 
    for (int col=0; col<blocks[1]; col++) { 
     sendcounts[col] = isLastCol(col, blocks) ? blocksize+1 : blocksize; 
     senddispls[col] = col*blocksize; 
    } 
} 
char *rowptr = (mycol == 0) ? &(rowdata[0][0]) : NULL; 

MPI_Scatterv(rowptr, sendcounts, senddispls, vec, 
       &(localdata[0][0]), sendcounts[mycol], localvec, 0, rowComm);

, die einfacher ist und sollte eine relativ gute Balance zwischen Leistung und Robustheit sein.

all diese drei Methoden Laufende Arbeiten:

bash-3.2$ mpirun -np 6 ./allmethods alltoall 
Global array: 
abcdefg 
hijklmn 
opqrstu 
vwxyzab 
cdefghi 
jklmnop 
qrstuvw 
xyzabcd 
efghijk 
lmnopqr 
Method - alltoall 

Rank 0: 
abc 
hij 
opq 

Rank 1: 
defg 
klmn 
rstu 

Rank 2: 
vwx 
cde 
jkl 

Rank 3: 
yzab 
fghi 
mnop 

Rank 4: 
qrs 
xyz 
efg 
lmn 

Rank 5: 
tuvw 
abcd 
hijk 
opqr 

bash-3.2$ mpirun -np 6 ./allmethods twophasevecs 
Global array: 
abcdefg 
hijklmn 
opqrstu 
vwxyzab 
cdefghi 
jklmnop 
qrstuvw 
xyzabcd 
efghijk 
lmnopqr 
Method - two phase, vectors, then cleanup 

Rank 0: 
abc 
hij 
opq 

Rank 1: 
defg 
klmn 
rstu 

Rank 2: 
vwx 
cde 
jkl 

Rank 3: 
yzab 
fghi 
mnop 

Rank 4: 
qrs 
xyz 
efg 
lmn 

Rank 5: 
tuvw 
abcd 
hijk 
opqr 
bash-3.2$ mpirun -np 6 ./allmethods twophaserowcol 
Global array: 
abcdefg 
hijklmn 
opqrstu 
vwxyzab 
cdefghi 
jklmnop 
qrstuvw 
xyzabcd 
efghijk 
lmnopqr 
Method - two phase - row, cols 

Rank 0: 
abc 
hij 
opq 

Rank 1: 
defg 
klmn 
rstu 

Rank 2: 
vwx 
cde 
jkl 

Rank 3: 
yzab 
fghi 
mnop 

Rank 4: 
qrs 
xyz 
efg 
lmn 

Rank 5: 
tuvw 
abcd 
hijk 
opqr

Der Code Durchführung dieses Verfahrens folgt; Sie können Blockgrößen auf typische Größen für Ihr Problem einstellen und auf einer realistischen Anzahl von Prozessoren laufen, um ein Gefühl dafür zu bekommen, welches für Ihre Anwendung am besten ist.

#include <stdio.h> 
#include <stdlib.h> 
#include <string.h> 
#include "mpi.h" 

/* auxiliary routines, found at end of program */ 

char **allocchar2darray(int n, int m); 
void freechar2darray(char **a); 
void printarray(char **data, int n, int m); 
void rowcol(int rank, const int blocks[2], int *row, int *col); 
int isLastRow(int row, const int blocks[2]); 
int isLastCol(int col, const int blocks[2]); 
int typeIdx(int row, int col, const int blocks[2]); 

/* first method - alltoallw */ 
void alltoall(const int myrow, const int mycol, const int rank, const int size, 
        const int blocks[2], const int blocksize, const int globalsizes[2], const int localsizes[2], 
        const char *const globalptr, char **localdata) { 
    /* 
    * get send and recieve counts ready for alltoallw call. 
    * everyone will be recieving just one block from proc 0; 
    * most procs will be sending nothing to anyone. 
    */ 
    int sendcounts[ size ]; 
    int senddispls[ size ]; 
    MPI_Datatype sendtypes[size]; 
    int recvcounts[ size ]; 
    int recvdispls[ size ]; 
    MPI_Datatype recvtypes[size]; 

    for (int proc=0; proc<size; proc++) { 
     recvcounts[proc] = 0; 
     recvdispls[proc] = 0; 
     recvtypes[proc] = MPI_CHAR; 

     sendcounts[proc] = 0; 
     senddispls[proc] = 0; 
     sendtypes[proc] = MPI_CHAR; 
    } 
    recvcounts[0] = localsizes[0]*localsizes[1]; 
    recvdispls[0] = 0; 


    /* The originating process needs to allocate and fill the source array, 
    * and then define types defining the array chunks to send, and 
    * fill out senddispls, sendcounts (1) and sendtypes. 
    */ 
    if (rank == 0) { 
     /* 4 types of blocks - 
     * blocksize*blocksize, blocksize+1*blocksize, blocksize*blocksize+1, blocksize+1*blocksize+1 
     */ 
     MPI_Datatype blocktypes[4]; 
     int subsizes[2]; 
     int starts[2] = {0,0}; 
     for (int i=0; i<2; i++) { 
      subsizes[0] = blocksize+i; 
      for (int j=0; j<2; j++) { 
       subsizes[1] = blocksize+j; 
       MPI_Type_create_subarray(2, globalsizes, subsizes, starts, MPI_ORDER_C, MPI_CHAR, &blocktypes[2*i+j]); 
       MPI_Type_commit(&blocktypes[2*i+j]); 
      } 
     } 

     /* now figure out the displacement and type of each processor's data */ 
     for (int proc=0; proc<size; proc++) { 
      int row, col; 
      rowcol(proc, blocks, &row, &col); 

      sendcounts[proc] = 1; 
      senddispls[proc] = (row*blocksize*globalsizes[1] + col*blocksize)*sizeof(char); 

      int idx = typeIdx(row, col, blocks); 
      sendtypes[proc] = blocktypes[idx]; 
     } 
    } 

    MPI_Alltoallw(globalptr, sendcounts, senddispls, sendtypes, 
        &(localdata[0][0]), recvcounts, recvdispls, recvtypes, 
        MPI_COMM_WORLD); 
} 


/* second method: distribute almost all data using colums of size blocksize, 
* then clean up the last row with another scatterv */ 

void twophasevecs(const int myrow, const int mycol, const int rank, const int size, 
        const int blocks[2], const int blocksize, const int globalsizes[2], const int localsizes[2], 
        const char *const globalptr, char **localdata) { 
    int sendcounts[ size ]; 
    int senddispls[ size ]; 
    int recvcounts; 

    for (int proc=0; proc<size; proc++) { 
     sendcounts[proc] = 0; 
     senddispls[proc] = 0; 
    } 

    /* We're going to be operating mostly in units of a single column of a "normal" sized block. 
    * There will need to be two vectors describing these columns; one in the context of the 
    * global array, and one in the local results. 
    */ 
    MPI_Datatype vec, localvec; 
    MPI_Type_vector(blocksize, 1, localsizes[1], MPI_CHAR, &localvec); 
    MPI_Type_create_resized(localvec, 0, sizeof(char), &localvec); 
    MPI_Type_commit(&localvec); 

    MPI_Type_vector(blocksize, 1, globalsizes[1], MPI_CHAR, &vec); 
    MPI_Type_create_resized(vec, 0, sizeof(char), &vec); 
    MPI_Type_commit(&vec); 

    /* The originating process needs to allocate and fill the source array, 
    * and then define types defining the array chunks to send, and 
    * fill out senddispls, sendcounts (1) and sendtypes. 
    */ 
    if (rank == 0) { 
     /* create the vector type which will send one column of a "normal" sized-block */ 
     /* then all processors except those in the last row need to get blocksize*vec or (blocksize+1)*vec */ 
     /* will still have to do something to tidy up the last row of values */ 
     /* we need to make the type have extent of 1 char for scattering */ 
     for (int proc=0; proc<size; proc++) { 
      int row, col; 
      rowcol(proc, blocks, &row, &col); 

      sendcounts[proc] = isLastCol(col, blocks) ? blocksize+1 : blocksize; 
      senddispls[proc] = (row*blocksize*globalsizes[1] + col*blocksize); 
     } 
    } 

    recvcounts = localsizes[1]; 
    MPI_Scatterv(globalptr, sendcounts, senddispls, vec, 
        &(localdata[0][0]), recvcounts, localvec, 0, MPI_COMM_WORLD); 

    MPI_Type_free(&localvec); 
    if (rank == 0) 
     MPI_Type_free(&vec); 

    /* now we need to do one more scatter, scattering just the last row of data 
    * just to the processors on the last row. 
    * Here we recompute the sendcounts 
    */ 
    if (rank == 0) { 
     for (int proc=0; proc<size; proc++) { 
      int row, col; 
      rowcol(proc, blocks, &row, &col); 
      sendcounts[proc] = 0; 
      senddispls[proc] = 0; 

      if (isLastRow(row,blocks)) { 
       sendcounts[proc] = blocksize; 
       senddispls[proc] = (globalsizes[0]-1)*globalsizes[1]+col*blocksize; 
       if (isLastCol(col,blocks)) 
        sendcounts[proc] += 1; 
      } 
     } 
    } 

    recvcounts = 0; 
    if (isLastRow(myrow, blocks)) { 
     recvcounts = blocksize; 
     if (isLastCol(mycol, blocks)) 
      recvcounts++; 
    } 
    MPI_Scatterv(globalptr, sendcounts, senddispls, MPI_CHAR, 
        &(localdata[blocksize][0]), recvcounts, MPI_CHAR, 0, MPI_COMM_WORLD); 
} 
/* third method: first distribute rows, then columns, each with a single scatterv */ 

void twophaseRowCol(const int myrow, const int mycol, const int rank, const int size, 
        const int blocks[2], const int blocksize, const int globalsizes[2], const int localsizes[2], 
        const char *const globalptr, char **localdata) { 
    char **rowdata ; 

    /* create communicators which have processors with the same row or column in them*/ 
    MPI_Comm colComm, rowComm; 
    MPI_Comm_split(MPI_COMM_WORLD, myrow, rank, &rowComm); 
    MPI_Comm_split(MPI_COMM_WORLD, mycol, rank, &colComm); 

    /* first, scatter the array by rows, with the processor in column 0 corresponding to each row 
    * receiving the data */ 
    if (mycol == 0) { 
     int sendcounts[ blocks[0] ]; 
     int senddispls[ blocks[0] ]; 
     senddispls[0] = 0; 

     for (int row=0; row<blocks[0]; row++) { 
      /* each processor gets blocksize rows, each of size globalsizes[1]... */ 
      sendcounts[row] = blocksize*globalsizes[1]; 
      if (row > 0) 
       senddispls[row] = senddispls[row-1] + sendcounts[row-1]; 
     } 
     /* the last processor gets one more */ 
     sendcounts[blocks[0]-1] += globalsizes[1]; 

     /* allocate my rowdata */ 
     rowdata = allocchar2darray(sendcounts[myrow], globalsizes[1]); 

     /* perform the scatter of rows */ 
     MPI_Scatterv(globalptr, sendcounts, senddispls, MPI_CHAR, 
         &(rowdata[0][0]), sendcounts[myrow], MPI_CHAR, 0, colComm); 

    } 

    /* Now, within each row of processors, we can scatter the columns. 
    * We can do this as we did in the previous example; create a vector 
    * (and localvector) type and scatter accordingly */ 
    int locnrows = blocksize; 
    if (isLastRow(myrow, blocks)) 
     locnrows++; 

    MPI_Datatype vec, localvec; 
    MPI_Type_vector(locnrows, 1, globalsizes[1], MPI_CHAR, &vec); 
    MPI_Type_create_resized(vec, 0, sizeof(char), &vec); 
    MPI_Type_commit(&vec); 

    MPI_Type_vector(locnrows, 1, localsizes[1], MPI_CHAR, &localvec); 
    MPI_Type_create_resized(localvec, 0, sizeof(char), &localvec); 
    MPI_Type_commit(&localvec); 

    int sendcounts[ blocks[1] ]; 
    int senddispls[ blocks[1] ]; 
    if (mycol == 0) { 
     for (int col=0; col<blocks[1]; col++) { 
      sendcounts[col] = isLastCol(col, blocks) ? blocksize+1 : blocksize; 
      senddispls[col] = col*blocksize; 
     } 
    } 
    char *rowptr = (mycol == 0) ? &(rowdata[0][0]) : NULL; 

    MPI_Scatterv(rowptr, sendcounts, senddispls, vec, 
        &(localdata[0][0]), sendcounts[mycol], localvec, 0, rowComm); 

    MPI_Type_free(&localvec); 
    MPI_Type_free(&vec); 

    if (mycol == 0) 
     freechar2darray(rowdata); 

    MPI_Comm_free(&rowComm); 
    MPI_Comm_free(&colComm); 
} 

int main(int argc, char **argv) { 

    int rank, size; 
    int blocks[2] = {0,0}; 
    const int blocksize=3; 
    int globalsizes[2], localsizes[2]; 
    char **globaldata; 
    char *globalptr = NULL; 

    MPI_Init(&argc, &argv); 
    MPI_Comm_rank(MPI_COMM_WORLD, &rank); 
    MPI_Comm_size(MPI_COMM_WORLD, &size); 

    if (rank == 0 && argc < 2) { 
     fprintf(stderr,"Usage: %s method\n Where method is one of: alltoall, twophasevecs, twophaserowcol\n", argv[0]); 
     MPI_Abort(MPI_COMM_WORLD,1); 
    } 

    /* calculate sizes for a 2d grid of processors */ 
    MPI_Dims_create(size, 2, blocks); 

    int myrow, mycol; 
    rowcol(rank, blocks, &myrow, &mycol); 

    /* create array sizes so that last block has 1 too many rows/cols */ 
    globalsizes[0] = blocks[0]*blocksize+1; 
    globalsizes[1] = blocks[1]*blocksize+1; 
    if (rank == 0) { 
     globaldata = allocchar2darray(globalsizes[0], globalsizes[1]); 
     globalptr = &(globaldata[0][0]); 
     for (int i=0; i<globalsizes[0]; i++) 
      for (int j=0; j<globalsizes[1]; j++) 
       globaldata[i][j] = 'a'+(i*globalsizes[1] + j)%26; 

     printf("Global array: \n"); 
     printarray(globaldata, globalsizes[0], globalsizes[1]); 
    } 

    /* the local chunk we'll be receiving */ 
    localsizes[0] = blocksize; localsizes[1] = blocksize; 
    if (isLastRow(myrow,blocks)) localsizes[0]++; 
    if (isLastCol(mycol,blocks)) localsizes[1]++; 
    char **localdata = allocchar2darray(localsizes[0],localsizes[1]); 

    if (!strcasecmp(argv[1], "alltoall")) { 
     if (rank == 0) printf("Method - alltoall\n"); 
     alltoall(myrow, mycol, rank, size, blocks, blocksize, globalsizes, localsizes, globalptr, localdata); 
    } else if (!strcasecmp(argv[1],"twophasevecs")) { 
     if (rank == 0) printf("Method - two phase, vectors, then cleanup\n"); 
     twophasevecs(myrow, mycol, rank, size, blocks, blocksize, globalsizes, localsizes, globalptr, localdata); 
    } else { 
     if (rank == 0) printf("Method - two phase - row, cols\n"); 
     twophaseRowCol(myrow, mycol, rank, size, blocks, blocksize, globalsizes, localsizes, globalptr, localdata); 
    } 

    for (int proc=0; proc<size; proc++) { 
     if (proc == rank) { 
      printf("\nRank %d:\n", proc); 
      printarray(localdata, localsizes[0], localsizes[1]); 
     } 
     MPI_Barrier(MPI_COMM_WORLD);    
    } 

    freechar2darray(localdata); 
    if (rank == 0) 
     freechar2darray(globaldata); 

    MPI_Finalize(); 

    return 0; 
} 

char **allocchar2darray(int n, int m) { 
    char **ptrs = malloc(n*sizeof(char *)); 
    ptrs[0] = malloc(n*m*sizeof(char)); 
    for (int i=0; i<n*m; i++) 
     ptrs[0][i]='.'; 

    for (int i=1; i<n; i++) 
     ptrs[i] = ptrs[i-1] + m; 

    return ptrs; 
} 

void freechar2darray(char **a) { 
    free(a[0]); 
    free(a); 
} 

void printarray(char **data, int n, int m) { 
    for (int i=0; i<n; i++) { 
     for (int j=0; j<m; j++) 
      putchar(data[i][j]); 
     putchar('\n'); 
    } 
} 

void rowcol(int rank, const int blocks[2], int *row, int *col) { 
    *row = rank/blocks[1]; 
    *col = rank % blocks[1]; 
} 

int isLastRow(int row, const int blocks[2]) { 
    return (row == blocks[0]-1); 
} 

int isLastCol(int col, const int blocks[2]) { 
    return (col == blocks[1]-1); 
} 

int typeIdx(int row, int col, const int blocks[2]) { 
    int lastrow = (row == blocks[0]-1); 
    int lastcol = (col == blocks[1]-1); 

    return lastrow*2 + lastcol; 
}

Quelle

2015-04-06 18:10:00

Danke, das ist großartig. Ich habe einige Tests durchgeführt, um eine 4000 * 4000 Matrix über 32 Prozesse zu verteilen. Die dritte Methode (die ich mir als die beste vorstelle) dauert etwa 5 mal länger als eine einzelne Streuung. Irgendeine Idee warum? – Flash

Ich vermute, dass die dritte Methode im Hinblick auf die Skalierung am besten im Vergleich zu der ersten wäre, aber es müsste wahrscheinlich über viele Prozessoren laufen, um die Tatsache zu übertreffen, dass es zwei ziemlich große Operationen gibt.(Und ich würde erwarten, dass diese beiden Operationen immer teurer sind als eine einzelne Streuung mit einheitlicher Größe). Wenn das stimmt, hätte die zweite Methode wahrscheinlich eine ähnliche Leistung, und die erste wäre vielleicht schneller? Es ist ein wenig schwierig zu erraten, und ich gestehe, dass ich nicht dazu kam, einen richtigen Skalierungstest durchzuführen. –

Ok, ich habe die anderen beiden nicht implementiert, also könnten sie es besser machen. Ich hätte gedacht, die dritte Methode könnte nicht schlechter als ein Faktor 2 sein, da jeder Schritt nicht schlechter als die globale Streuung ist. Vielleicht gibt es Overhead beim Erstellen der Typen usw.? – Flash

Nicht sicher, ob das für Sie gilt, aber es hat mir in der Vergangenheit geholfen, so dass es für andere nützlich sein könnte.

Meine Antwort gilt im Zusammenhang mit parallel IO. Die Sache ist die, dass, wenn Sie Ihren Zugang wissen nicht überlappen, können Sie erfolgreich auch mit variablen Größen Schreiben/Lesen von MPI_COMM_SELF

Ein Stück Code ich jeden Tag benutzen enthält:

MPI_File fh; 
MPI_File_open(MPI_COMM_SELF, path.c_str(), MPI_MODE_CREATE|MPI_MODE_WRONLY, MPI_INFO_NULL, &fh); 

// Lot of computation to get the size right 

MPI_Datatype filetype; 
MPI_Type_create_subarray(gsizes.size(), &gsizes[0], &lsizes[0], &offset[0], MPI_ORDER_C, MPI_FLOAT, &filetype); 
MPI_Type_commit(&filetype); 

MPI_File_set_view(fh, 0, MPI_FLOAT, filetype, "native", MPI_INFO_NULL); 
MPI_File_write(fh, &block->field[0], block->field.size(), MPI_FLOAT, MPI_STATUS_IGNORE); 
MPI_File_close(&fh);

Quelle

2015-03-29 04:24:09 Amxx

Streu-Matrixblöcke unterschiedlicher Größe mit MPI

Antwort

Verwandte Themen