2013-04-07 17 views
10

Ich versuche Difflib zu verwenden, um Diff für zwei Textdateien mit Tweets zu erzeugen. Hier ist der Code:Python Difflib Dateien vergleichen

#!/usr/bin/env python 

# difflib_test 

import difflib 

file1 = open('/home/saad/Code/test/new_tweets', 'r') 
file2 = open('/home/saad/PTITVProgs', 'r') 

diff = difflib.context_diff(file1.readlines(), file2.readlines()) 
delta = ''.join(diff) 
print delta 

Hier ist die PTITVProgs Textdatei:

Watch PTI on April 6th (7) Dr Israr Shah at 10PM on Business Plus in "Talking Policy". Rgds #PTI 
CORRECTION!! Watch PTI on April 6th (5) @Asad_Umar at 8PM on ARY News. Rgds #PTI 
Watch PTI on April 6th (5) @Asad_Umar at 8PM on AAJ News (6) PTI vs PMLN at 8PM on NewsOne. Rgds #PTI 
Watch PTI on April 6th (5) Asad Umar at 8PM on AAJ News (6) PTI vs PMLN at 8PM on NewsOne. Rgds #PTI 
Watch PTI on April 6th (5) Waleed Iqbal at 8PM on Channel 5. Rgds #PTI 
Watch PTI on April 6th (3) Dr Israr Shah at 10PM on PTV News. Rgds #PTI 
Watch PTI on April 6th (4) Javed hashmi at 1PM on PTV News. Rgds #PTI 
Watch PTI on April 6th (3) Imran Alvi at 1PM on AAJ News. Rgds #PTI 
Watch PTI on April 6th (1) Dr @ArifAlvi, Andleeb Abbas and Ehtisham Ameer at 11PM on ARY News (2) Hamid Khan at 10PM on ATV. Rgds #PTI 
Watch PTI on April 5th (1) Farooq Amjad Meer at 10:45PM on Dunya News. Rgds #PTI 
Watch PTI on April 4th (4) Faisal Khan at 8PM on PTV News. Rgds #PTI 
@FaisalJavedKhan 
Watch PTI on April 4th (3) Faisal Khan at 11PM on ATV. Rgds #PTI 
@FaisalJavedKhan 
Watch PTI on April 4th (1) Dr Israr Shah at 8PM on Waqt News (2) Dr Arif Alvi at 9PM on PTV World. Rgds #PTI 
@ArifAlvi 
Watch PTI on April 3rd (12) Abrar ul Haq on 10PM on Dawn News (13) Shabbir Sial at 10PM on Channel5. Rgds #PTI 
Watch PTI on April 3rd (11) Sadaqat Abbasi on 8PM on RohiTV. Rgds #PTI 
Watch PTI on April 3rd (10) Dr Zarqa and Andleeb Abbas on 8PM on Waqt News. Rgds #PTI 
Watch PTI on April 3rd (9) Fauzia Kasuri at 8PM on Din News. Rgds #PTI 
Watch PTI on April 3rd (8) Mehmood Rasheed at 8PM on ARY News. Rgds #PTI 
Watch PTI on April 3rd (7) Israr Abbasi (Repeat on Arp 4th) at 1:20AM and 1PM on Vibe TV. Rgds #PTI 
Watch PTI on April 3rd (5) Rao Fahad at 9PM on Express News (6) Dr Seems Zia at 10:30PM on Health TV. Rgds #PTI 

Hier ist die new_tweets Textdatei:

Watch PTI on April 7th (3) Malaika Reza at 8PM on AAJ News (4) Shah Mehmood Qureshi at 8PM on Geo News. Rgds #PTI 
Watch PTI on April 7th (2) Chairman IMRAN KHAN at 10PM on PTV News in News Night with Sadia Afzal, Rpt: 2AM, 2PM. Rgds #PTI 
@ImranKhanPTI 
Watch PTI on April 7th (1) Dr Waseem Shahzad NOW at 6PM on PTV News. Rgds #PTI 
Watch PTI on April 6th (7) Dr Israr Shah at 10PM on Business Plus in "Talking Policy". Rgds #PTI 
CORRECTION!! Watch PTI on April 6th (5) @Asad_Umar at 8PM on ARY News. Rgds #PTI 
Watch PTI on April 6th (5) @Asad_Umar at 8PM on AAJ News (6) PTI vs PMLN at 8PM on NewsOne. Rgds #PTI 
Watch PTI on April 6th (5) Asad Umar at 8PM on AAJ News (6) PTI vs PMLN at 8PM on NewsOne. Rgds #PTI 
Watch PTI on April 6th (5) Waleed Iqbal at 8PM on Channel 5. Rgds #PTI 
Watch PTI on April 6th (3) Dr Israr Shah at 10PM on PTV News. Rgds #PTI 
Watch PTI on April 6th (4) Javed hashmi at 1PM on PTV News. Rgds #PTI 
Watch PTI on April 6th (3) Imran Alvi at 1PM on AAJ News. Rgds #PTI 
Watch PTI on April 6th (1) Dr @ArifAlvi, Andleeb Abbas and Ehtisham Ameer at 11PM on ARY News (2) Hamid Khan at 10PM on ATV. Rgds #PTI 
Watch PTI on April 5th (1) Farooq Amjad Meer at 10:45PM on Dunya News. Rgds #PTI 
Watch PTI on April 4th (4) Faisal Khan at 8PM on PTV News. Rgds #PTI 
@FaisalJavedKhan 
Watch PTI on April 4th (3) Faisal Khan at 11PM on ATV. Rgds #PTI 
@FaisalJavedKhan 
Watch PTI on April 4th (1) Dr Israr Shah at 8PM on Waqt News (2) Dr Arif Alvi at 9PM on PTV World. Rgds #PTI 
@ArifAlvi 
Watch PTI on April 3rd (12) Abrar ul Haq on 10PM on Dawn News (13) Shabbir Sial at 10PM on Channel5. Rgds #PTI 
Watch PTI on April 3rd (11) Sadaqat Abbasi on 8PM on RohiTV. Rgds #PTI 
Watch PTI on April 3rd (10) Dr Zarqa and Andleeb Abbas on 8PM on Waqt News. Rgds #PTI 
Watch PTI on April 3rd (9) Fauzia Kasuri at 8PM on Din News. Rgds #PTI 

Hier ist der Unterschied i aus dem Programm:

*** 
--- 
*************** 
*** 1,7 **** 
- Watch PTI on April 7th (3) Malaika Reza at 8PM on AAJ News (4) Shah Mehmood Qureshi at 8PM on Geo News. Rgds #PTI 
- Watch PTI on April 7th (2) Chairman IMRAN KHAN at 10PM on PTV News in News Night with Sadia Afzal, Rpt: 2AM, 2PM. Rgds #PTI 
- @ImranKhanPTI 
- Watch PTI on April 7th (1) Dr Waseem Shahzad NOW at 6PM on PTV News. Rgds #PTI 
    Watch PTI on April 6th (7) Dr Israr Shah at 10PM on Business Plus in "Talking Policy". Rgds #PTI 
    CORRECTION!! Watch PTI on April 6th (5) @Asad_Umar at 8PM on ARY News. Rgds #PTI 
    Watch PTI on April 6th (5) @Asad_Umar at 8PM on AAJ News (6) PTI vs PMLN at 8PM on NewsOne. Rgds #PTI 
--- 1,3 ---- 
*************** 
*** 21,24 **** 
    Watch PTI on April 3rd (12) Abrar ul Haq on 10PM on Dawn News (13) Shabbir Sial at 10PM on Channel5. Rgds #PTI 
    Watch PTI on April 3rd (11) Sadaqat Abbasi on 8PM on RohiTV. Rgds #PTI 
    Watch PTI on April 3rd (10) Dr Zarqa and Andleeb Abbas on 8PM on Waqt News. Rgds #PTI 
! Watch PTI on April 3rd (9) Fauzia Kasuri at 8PM on Din News. Rgds #PTI--- 17,23 ---- 
    Watch PTI on April 3rd (12) Abrar ul Haq on 10PM on Dawn News (13) Shabbir Sial at 10PM on Channel5. Rgds #PTI 
    Watch PTI on April 3rd (11) Sadaqat Abbasi on 8PM on RohiTV. Rgds #PTI 
    Watch PTI on April 3rd (10) Dr Zarqa and Andleeb Abbas on 8PM on Waqt News. Rgds #PTI 
! Watch PTI on April 3rd (9) Fauzia Kasuri at 8PM on Din News. Rgds #PTI 
! Watch PTI on April 3rd (8) Mehmood Rasheed at 8PM on ARY News. Rgds #PTI 
! Watch PTI on April 3rd (7) Israr Abbasi (Repeat on Arp 4th) at 1:20AM and 1PM on Vibe TV. Rgds #PTI 
! Watch PTI on April 3rd (5) Rao Fahad at 9PM on Express News (6) Dr Seems Zia at 10:30PM on Health TV. Rgds #PTI 

Wie Sie sehen können, indem Sie schnell die beiden Quelldateien (PTITVProgs und new_tweets) vergleichen, sind die Unterschiede zwischen ihnen die 3 Tweets, die am 7. April waren und 3 Tweets am 3. April.

Ich möchte nur die Zeilen in new_tweets, die nicht in PTITVProgs im diff erscheinen.

Aber es wirft einen Haufen Text, den ich nicht sehen will. Ich weiß nicht, was *** 1,7*** und *** 1,3*** im diff-Ausgang stehen für ...? Was ist der richtige Weg, um die geänderten Linien nur zu bekommen?

+1

Ich bin nicht sicher difflib ist das richtige Werkzeug für den Job Sie versuchen, überhaupt tun - es ist viel mehr Arbeit (algorithmisch gesprochen), ein Diff zu erzeugen, als nur einen Satzvergleich durchzuführen. 'print set (fileA.readlines()) .difference (set (fileB.readlines()))' –

Antwort

20

Gerade Ausgabe von diff wie diese analysieren (ändern ‚-‘ auf ‚+‘ bei Bedarf):

#!/usr/bin/env python 

# difflib_test 

import difflib 

file1 = open('/home/saad/Code/test/new_tweets', 'r') 
file2 = open('/home/saad/PTITVProgs', 'r') 

diff = difflib.ndiff(file1.readlines(), file2.readlines()) 
delta = ''.join(x[2:] for x in diff if x.startswith('- ')) 
print delta 
14

In der Bibliothek difflib gibt es mehrere Diff-Stile, für die verschiedene Funktionen existieren. unified_diff, ndiff und context_diff.

Wenn Sie die Zeilennummer Zusammenfassungen nicht wollen, gibt ndiff Funktion ein Delta Differ-style:

import difflib 

f1 = '''1 
2 
3 
4 
5''' 
f2 = '''1 
3 
4 
5 
6''' 

diff = difflib.ndiff(f1,f2) 

for l in diff: 
    print(l) 

Ausgang:

1 
- 2   
    3   
    4   
    5 
+ 6 

EDIT:

Sie könnten Parsen Sie auch das Diff, um nur die Änderungen zu extrahieren, wenn Sie das wollen:

>>>changes = [l for l in diff if l.startswith('+ ') or l.startswith('- ')] 

>>>for c in changes: 
     print(c) 
>>> 
- 2 
+ 6 
+1

Wie ich die Frage gelesen habe, will das OP auch die Kontextlinien nicht. Die obige Antwort von @ gatto passt besser zu dem vom OP gewünschten Format. –