2017-03-04 7 views
0

Ich habe einen Datensatz mit 20.000 Zeilen, die wie folgt in seiner reinsten Form aussieht:Variable Teilzeichenfolge Übereinstimmung zwischen zwei Spalte

v1     v2 
1 Case 1 (A v. B)  A v. B 
2 Case 2 (A v. C)  A v. B 
3 Case 2 (A v. C)  C v. B 
4 Case 4 (X v. Z)  X v. Z 
5 Case 5 (B v. A)  A v. B 
6 Case 6 (X v. A)  X v. A 
7 Case 6 (X v. A)  A v. X 
... 

... außer es gibt n-viele Variationen von v1, v2 (eigentlich um ~ 150, aber immer noch zu viele, um sie aufzulisten).

Ich mag eine dritte Spalte v3 zurückzukehren, um einen logischen Indikatoren dafür, ob jede Teilkette von v1 übereinstimmt, die in v2 enthält.

v1     v2   v3 
1 Case 1 (A v. B)  A v. B  TRUE 
2 Case 2 (A v. C)  A v. B  FALSE 
3 Case 2 (A v. C)  C v. B  FALSE 
4 Case 4 (X v. Z)  X v. Z  TRUE 
5 Case 5 (B v. A)  A v. B  FALSE 
6 Case 6 (X v. A)  X v. A  TRUE 
7 Case 6 (X v. A)  A v. X  FALSE 

Ich habe mit so etwas wie dieses herum spielen, die ich denke, auf dem richtigen Weg ist:

library(stringr) 
x$v3 <- with(x, str_detect(v1, v2)) 

ich sehr dankbar wäre, wenn jemand mich in die richtige Richtung zeigen könnte eine Lösung/Abhilfe.

MWE zeigt, dass mein str_detect() Technik nicht funktioniert:

kann
x <- structure(list(v1 = c("Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia v Russian Federation", 
          "Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia v Russian Federation", 
          "Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia v Russian Federation", 
          "Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia v Russian Federation", 
          "Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia v Russian Federation", 
          "Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia v Russian Federation", 
          "Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia v Russian Federation", 
          "Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia v Russian Federation", 
          "Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia v Russian Federation", 
          "Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia v Russian Federation", 
          "Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia v Russian Federation", 
          "Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia v Russian Federation", 
          "Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia v Russian Federation", 
          "Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia v Russian Federation", 
          "Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia v Russian Federation", 
          "Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia v Russian Federation", 
          "Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia v Russian Federation", 
          "Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia v Russian Federation", 
          "Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia v Russian Federation", 
          "Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia v Russian Federation", 
          "Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia v Russian Federation", 
          "Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia v Russian Federation", 
          "Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia v Russian Federation", 
          "Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia v Russian Federation", 
          "Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia v Russian Federation", 
          "Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia v Russian Federation", 
          "Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia v Russian Federation", 
          "Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia v Russian Federation", 
          "Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia v Russian Federation", 
          "Application of the International Convention on the Elimination of All Forms of Racial Discrimination Georgia v Russian Federation" 
), v2 = c("Georgia v Russian Federation", " Ethiopia v South Africa Liberia v South Africa", 
      " Cameroon v United Kingdom", " New Zealand v France", " Australia v France", 
      " Nicaragua v United States of America", " Nicaragua v Honduras", 
      " Nauru v Anustralia", " Nnew Zealand v France", " Islamic Republic of Iran v United States of America", 
      " Bosnia and Herzegovina v Serbia and Montenegro", " Spain v Cananda", 
      " Libyan Arab Jamahiriya v United States of America", " Libyan Arab Jamahiriya v United Kingdom", 
      " Democratic Republic of the Congo v Burundi", " Germany v United States of America", 
      " Democratic Republic of the Congo v Belgium", " Liechtenstein v Germany", 
      " Democratic Republic of the Congo v Ugandan", " Democratic Republic of the Congo v Rwandan", 
      " Nicaragua v Colombia", " Djibouti v France", " Georgia v Russian Federation", 
      " Croatia v Serbia", " Mexico v United States of American", " Democratic Republic of the Congo v Rwanda", 
      " Spain v Canada", " Australia v France", " New Zealand v France", 
      " New Zealand v France")), .Names = c("v1", "v2" 
      ), row.names = c(NA, 30L), class = "data.frame") 

Antwort

1

grepl verwendet werden, um einen einzigen Wert von v2 auf möglichen Teil von v1 zu vergleichen

Sie müssen sie gelten für jeden Zeile getrennt, so dass eine schnelle Lösung kann sein: apply(data.frame(v1,v2),MARGIN=1, FUN=function(x) {grepl(x[2],x[1])})

Falls Sie Unterschiede in der Anzahl der Räume ignorieren möchten (wie die in Zeile # 1), können Sie den Wert in x ersetzen können [2] mit dem appropriat e regex mit gsub, so " " wird durch " *" ersetzt werden, um mehrere Leerzeichen zu ermöglichen.

dies in diesem Fall gelten funktioniert:

apply(x,MARGIN=1, FUN=function(x) {grepl(gsub(" "," *",x[2]),x[1])})

+1

Ich glaube nicht, Sie haben Recht. V1 in beiden 1. und 23. Zeile enthält 2 Leerzeichen nach Georgia und nach dem "v", enthält es nicht das doppelte Leerzeichen in v2. Ich werde in der Antwort eine Erklärung über die Räume hinzufügen und wie man sie löst –

+0

Kannst du die Funktion, die du hier benutzt hast, posten? Und vielleicht überprüfen Sie die Daten, die Sie gepostet? Ich habe den Datenrahmen erstellt, den Sie in der Frage gepostet haben, und die gleiche Funktion angewendet und TRUE an 1 und 23 erhalten, alles andere ist falsch –

+0

Ich habe meinen Speicher zurückgesetzt und es funktioniert - danke! Du hast mich eine Menge Zeit gerettet. Freut mich, dass die Antwort so einfach war. Ich war auch in der Lage, Fuzzy-String-Matching mit der agrep() -Funktion zu implementieren: apply (x, MARGIN = 1, FUN = Funktion (x) { agrepl (gsub ("", "*", x [2]), x [1], max.distanz = .25)}) – beddotcom

Verwandte Themen