2017-04-04 2 views
2

Für die folgende Binärdatei (Es kann heruntergeladen werden, here):Python - Formatierung der Ausgabe

*NEWRECORD 
RECTYPE = D 
MH = Calcimycin 
AQ = AA AD AE AG AI AN BI BL CF CH CL CS CT EC HI IM IP ME PD PK PO RE SD ST TO TU UR 
ENTRY = A-23187|T109|T195|LAB|NRW|NLM (1991)|900308|abbcdef 
ENTRY = A23187|T109|T195|LAB|NRW|UNK (19XX)|741111|abbcdef 
ENTRY = Antibiotic A23187|T109|T195|NON|NRW|NLM (1991)|900308|abbcdef 
ENTRY = A 23187 
ENTRY = A23187, Antibiotic 
MN = D03.633.100.221.173 
PA = Anti-Bacterial Agents 
PA = Calcium Ionophores 
MH_TH = FDA SRS (2014) 
MH_TH = NLM (1975) 
ST = T109 
ST = T195 
N1 = 4-Benzoxazolecarboxylic acid, 5-(methylamino)-2-((3,9,11-trimethyl-8-(1-methyl-2-oxo-2-(1H-pyrrol-2-yl)ethyl)-1,7-dioxaspiro(5.5)undec-2-yl)methyl)-, (6S-(6alpha(2S*,3S*),8beta(R*),9beta,11alpha))- 
RN = 37H9VM9WZL 
RR = 52665-69-7 (Calcimycin) 
PI = Antibiotics (1973-1974) 
PI = Carboxylic Acids (1973-1974) 
MS = An ionophorous, polyether antibiotic from Streptomyces chartreusensis. It binds and transports CALCIUM and other divalent cations across membranes and uncouples oxidative phosphorylation while inhibiting ATPase of rat liver mitochondria. The substance is used mostly as a biochemical tool to study the role of divalent cations in various biological systems. 
OL = use CALCIMYCIN to search A 23187 1975-90 
PM = 91; was A 23187 1975-90 (see under ANTIBIOTICS 1975-83) 
HN = 91(75); was A 23187 1975-90 (see under ANTIBIOTICS 1975-83) 
MR = 20160527 
DA = 19741119 
DC = 1 
DX = 19840101 
UI = D000001 

*NEWRECORD 
RECTYPE = D 
MH = Temefos 
AQ = AA AD AE AG AI AN BL CF CH CL CS CT EC HI IM IP ME PD PK RE SD ST TO TU UR 
ENTRY = Abate|T109|T131|TRD|NRW|NLM (1996)|941114|abbcdef 
ENTRY = Difos|T109|T131|TRD|NRW|UNK (19XX)|861007|abbcdef 
ENTRY = Temephos|T109|T131|TRD|EQV|NLM (1996)|941201|abbcdef 
MN = D02.705.400.625.800 
MN = D02.705.539.345.800 
MN = D02.886.300.692.800 
PA = Insecticides 
MH_TH = FDA SRS (2014) 
MH_TH = INN (19XX) 
MH_TH = USAN (1974) 
ST = T109 
ST = T131 
N1 = Phosphorothioic acid, O,O'-(thiodi-4,1-phenylene) O,O,O',O'-tetramethyl ester 
RN = ONP3ME32DL 
RR = 3383-96-8 (Temefos) 
AN = for use to kill or control insects, use no qualifiers on the insecticide or the insect; appropriate qualifiers may be used when other aspects of the insecticide are discussed such as the effect on a physiologic process or behavioral aspect of the insect; for poisoning, coordinate with ORGANOPHOSPHATE POISONING 
PI = Insecticides (1966-1971) 
MS = An organothiophosphate insecticide. 
PM = 96; was ABATE 1972-95 (see under INSECTICIDES, ORGANOTHIOPHOSPHATE 1972-90) 
HN = 96; was ABATE 1972-95 (see under INSECTICIDES, ORGANOTHIOPHOSPHATE 1972-90) 
MR = 20130708 
DA = 19990101 
DC = 1 
DX = 19910101 
UI = D000002 

Ich habe folgende Python-Code:

import re 

terms = {} 
numbers = {} 

meshFile = 'd2017.bin' 
with open(meshFile, mode='rb') as file: 
    mesh = file.readlines() 

outputFile = open('mesh.txt', 'w') 

for line in mesh: 
    meshTerm = re.search(b'MH = (.+)$', line) 
    if meshTerm: 
     term = meshTerm.group(1) 
    meshNumber = re.search(b'MN = (.+)$', line) 
    if meshNumber: 
     number = meshNumber.group(1) 
     numbers[str(number)] = term 
     if term in terms: 
      terms[term] = terms[term] + ' ' + str(number) 
     else: 
      terms[term] = str(number) 

cumlist = [] 
keylist = terms.keys() 
for key in keylist: 
    #print('THE ORIGIN FOR ', key, file=outputFile) 

    item_list = terms[key].split(" ") 
    for phrase in item_list: 
     cumlist.append(phrase) 

print(cumlist) 

for item in cumlist: 
    print(numbers[str(item)], '\n', item, file=outputFile) 

Die Ausgabe sieht wie folgt aus:

b'Calcimycin\r' 
b'D03.633.100.221.173\r' 
b'Temefos\r' 
b'D02.705.400.625.800\r' 
b'Temefos\r' 
b'D02.705.539.345.800\r' 
b'Temefos\r' 
b'D02.886.300.692.800\r' 

wie kann ich neu formatiert die Ausgabe wie folgt aussehen:

Calcimycin 
D03.633.100.221.173 
Temefos 
D02.705.400.625.800 
D02.705.539.345.800 
D02.886.300.692.800 

Danke.

+0

Gibt es einen Grund, warum Sie nur binäre Zeichenfolgen verwenden? – TidB

+0

str.decode ('utf-8'). Strip() – RaminNietzsche

+0

@TidB Wenn Sie hier auf den regulären Ausdruck verweisen und "b" anstelle von "r" verwenden, liegt das daran, dass ich eine Binärdatei lese, die ist eine MeSH-Datei. Die Regex funktionierte nicht, wenn ich "r" verwendete. Habe ich deine Frage beantwortet? – Simplicity

Antwort

0
UPDATE: I simplified the source a bit 

Sie können diese regex versuchen:

MH\s*=\s*(\w+)\s*|MN\s*= \s*([^\s]*) 

Demo

Beispielcode: (Run it here)

import re 

regex = r"MH\s*=\s*(\w+)\s*|MN\s*= \s*([^\s]*)" 

test_str = ("*NEWRECORD\n" 
    "RECTYPE = D\n" 
    "MH = Calcimycin\n" 
    "AQ = AA AD AE AG AI AN BI BL CF CH CL CS CT EC HI IM IP ME PD PK PO RE SD ST TO TU UR\n" 
    "ENTRY = A-23187|T109|T195|LAB|NRW|NLM (1991)|900308|abbcdef\n" 
    "ENTRY = A23187|T109|T195|LAB|NRW|UNK (19XX)|741111|abbcdef\n" 
    "ENTRY = Antibiotic A23187|T109|T195|NON|NRW|NLM (1991)|900308|abbcdef\n" 
    "ENTRY = A 23187\n" 
    "ENTRY = A23187, Antibiotic\n" 
    "MN = D03.633.100.221.173\n" 
    "PA = Anti-Bacterial Agents\n" 
    "PA = Calcium Ionophores\n" 
    "MH_TH = FDA SRS (2014)\n" 
    "MH_TH = NLM (1975)\n" 
    "ST = T109\n" 
    "ST = T195\n" 
    "N1 = 4-Benzoxazolecarboxylic acid, 5-(methylamino)-2-((3,9,11-trimethyl-8-(1-methyl-2-oxo-2-(1H-pyrrol-2-yl)ethyl)-1,7-dioxaspiro(5.5)undec-2-yl)methyl)-, (6S-(6alpha(2S*,3S*),8beta(R*),9beta,11alpha))-\n" 
    "RN = 37H9VM9WZL\n" 
    "RR = 52665-69-7 (Calcimycin)\n" 
    "PI = Antibiotics (1973-1974)\n" 
    "PI = Carboxylic Acids (1973-1974)\n" 
    "MS = An ionophorous, polyether antibiotic from Streptomyces chartreusensis. It binds and transports CALCIUM and other divalent cations across membranes and uncouples oxidative phosphorylation while inhibiting ATPase of rat liver mitochondria. The substance is used mostly as a biochemical tool to study the role of divalent cations in various biological systems.\n" 
    "OL = use CALCIMYCIN to search A 23187 1975-90\n" 
    "PM = 91; was A 23187 1975-90 (see under ANTIBIOTICS 1975-83)\n" 
    "HN = 91(75); was A 23187 1975-90 (see under ANTIBIOTICS 1975-83)\n" 
    "MR = 20160527\n" 
    "DA = 19741119\n" 
    "DC = 1\n" 
    "DX = 19840101\n" 
    "UI = D000001\n\n" 
    "*NEWRECORD\n" 
    "RECTYPE = D\n" 
    "MH = Temefos\n" 
    "AQ = AA AD AE AG AI AN BL CF CH CL CS CT EC HI IM IP ME PD PK RE SD ST TO TU UR\n" 
    "ENTRY = Abate|T109|T131|TRD|NRW|NLM (1996)|941114|abbcdef\n" 
    "ENTRY = Difos|T109|T131|TRD|NRW|UNK (19XX)|861007|abbcdef\n" 
    "ENTRY = Temephos|T109|T131|TRD|EQV|NLM (1996)|941201|abbcdef\n" 
    "MN = D02.705.400.625.800\n" 
    "MN = D02.705.539.345.800\n" 
    "MN = D02.886.300.692.800\n" 
    "PA = Insecticides\n" 
    "MH_TH = FDA SRS (2014)\n" 
    "MH_TH = INN (19XX)\n" 
    "MH_TH = USAN (1974)\n" 
    "ST = T109\n" 
    "ST = T131\n" 
    "N1 = Phosphorothioic acid, O,O'-(thiodi-4,1-phenylene) O,O,O',O'-tetramethyl ester\n" 
    "RN = ONP3ME32DL\n" 
    "RR = 3383-96-8 (Temefos)\n" 
    "AN = for use to kill or control insects, use no qualifiers on the insecticide or the insect; appropriate qualifiers may be used when other aspects of the insecticide are discussed such as the effect on a physiologic process or behavioral aspect of the insect; for poisoning, coordinate with ORGANOPHOSPHATE POISONING\n" 
    "PI = Insecticides (1966-1971)\n" 
    "MS = An organothiophosphate insecticide.\n" 
    "PM = 96; was ABATE 1972-95 (see under INSECTICIDES, ORGANOTHIOPHOSPHATE 1972-90)\n" 
    "HN = 96; was ABATE 1972-95 (see under INSECTICIDES, ORGANOTHIOPHOSPHATE 1972-90)\n" 
    "MR = 20130708\n" 
    "DA = 19990101\n" 
    "DC = 1\n" 
    "DX = 19910101\n" 
    "UI = D000002\n\n\n\n\n\n\n" 
    "Calcimycin \n" 
    "D03.633.100.221.173\n" 
    "Temefos \n" 
    "D02.705.400.625.800\n" 
    "D02.705.539.345.800\n" 
    "D02.886.300.692.800") 

matches = re.finditer(regex, test_str, re.MULTILINE | re.IGNORECASE) 

for matchNum, match in enumerate(matches): 
    matchNum = matchNum + 1 
    for groupNum in range(0, len(match.groups())): 
     groupNum = groupNum + 1 
     if(match.group(groupNum) is not None): 
      print(match.group(groupNum)) 

Beispielausgabe:

Calcimycin 
D03.633.100.221.173 
Temefos 
D02.705.400.625.800 
D02.705.539.345.800 
D02.886.300.692.800 
+0

Wie kann ich das als Python-Code verwenden? – Simplicity

+0

@Simplicity der obige Code gibt Ihnen alles, was Sie wollen mit nur einem einzigen Regex ... Sie können aus der Ausgabe entscheiden, wie Sie sie verarbeiten möchten .. Aktualisiert es ein bisschen .. Sie können jetzt nicht testen, es ist mehr formatiert –