2016-06-02 16 views
0

Ich habe ein einfaches Skript geschrieben, um auf JSON zuzugreifen, um die Schlüsselwörter zu erhalten, die für die URL benötigt werden.Korrigieren der richtigen URL

Unter dem Skript, das ich geschrieben habe:

import urllib2 
import json 

f1 = open('CatList.text', 'r') 
f2 = open('SubList.text', 'w') 
lines = f1.read().splitlines() 


for line in lines: 

    url ='https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle='+line+'&cmlimit=100' 
    json_obj = urllib2.urlopen(url) 
    data = json.load(json_obj) 
    for item in data['query']: 
      for i in data['query']['categorymembers']: 
       print i['title'] 
       print '-----------------------------------------' 
       f2.write((i['title']).encode('utf8')+"\n") 

In diesem Skript wird das Programm lesen CatList erste, die verwendet, um eine Liste von Schlüsselwörtern sieht die URL.

Hier ist ein Beispiel, was der CatList.text enthält.

Category:Branches of geography 
Category:Geography by place 
Category:Geography awards and competitions 
Category:Geography conferences 
Category:Geography education 
Category:Environmental studies 
Category:Exploration 
Category:Geocodes 
Category:Geographers 
Category:Geographical zones 
Category:Geopolitical corridors 
Category:History of geography 
Category:Land systems 
Category:Landscape 
Category:Geography-related lists 
Category:Lists of countries by geography 
Category:Navigation 
Category:Geography organizations 
Category:Places 
Category:Geographical regions 
Category:Surveying 
Category:Geographical technology 
Category:Geography terminology 
Category:Works about geography 
Category:Geographic images 
Category:Geography stubs 

Mein Programm die Schlüsselwörter erhalten und in der URL platziert.

aber ich nicht in der Lage bin, die result.I erhalten haben den Code überprüft, indem die URL Druck:

import urllib2 
import json 

f1 = open('CatList.text', 'r') 
f2 = open('SubList2.text', 'w') 
lines = f1.read().splitlines() 


for line in lines: 

    url ='https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle='+line+'&cmlimit=100' 
    json_obj = urllib2.urlopen(url) 
    data = json.load(json_obj) 


    f2.write(url+'\n') 

Das Ergebnis I erhalten, wie in sublist2 folgt:

https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Branches of geography&cmlimit=100 
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Geography by place&cmlimit=100 
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Geography awards and competitions&cmlimit=100 
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Geography conferences&cmlimit=100 
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Geography education&cmlimit=100 
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Environmental studies&cmlimit=100 
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Exploration&cmlimit=100 
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Geocodes&cmlimit=100 
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Geographers&cmlimit=100 
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Geographical zones&cmlimit=100 
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Geopolitical corridors&cmlimit=100 
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:History of geography&cmlimit=100 
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Land systems&cmlimit=100 
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Landscape&cmlimit=100 
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Geography-related lists&cmlimit=100 
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Lists of countries by geography&cmlimit=100 
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Navigation&cmlimit=100 
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Geography organizations&cmlimit=100 
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Places&cmlimit=100 
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Geographical regions&cmlimit=100 
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Surveying&cmlimit=100 
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Geographical technology&cmlimit=100 
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Geography terminology&cmlimit=100 
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Works about geography&cmlimit=100 
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Geographic images&cmlimit=100 
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Geography stubs&cmlimit=100 

Es zeigt dass die URL richtig platziert ist.

Aber wenn ich den vollständigen Code ausführen, konnte es nicht das richtige Ergebnis erhalten.

Eine Sache, ich merke, wenn ich in der Verbindung der Adressleiste beispielsweise platzieren:

https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Branches of geography&cmlimit=100 

Es das richtige Ergebnis gibt, weil die Adressleiste automatisch korrigiert sie:

https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Branches%20of%20geography&cmlimit=100 

Ich glaube, dass, wenn% 20 an Stelle eines leeren Platzes zwischen dem Wort "Category: Geography Geography" hinzugefügt wird, mein Skript in der Lage sein wird, die richtigen JSON-Elemente zu erhalten.

Problem: Aber ich bin nicht sicher, wie diese Aussage in dem obigen Code zu ändern, um die die Leerzeichen ersetzen zu erhalten, die 20 in CatList mit% enthalten ist.

Bitte verzeih mir die schlechte Formatierung und die lange Post, ich versuche immer noch, Python zu lernen.

Vielen Dank für Ihre Hilfe.

Edit:

Vielen Dank Tim.Ihre Lösung funktioniert:

url ='https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle='+urllib2.quote(line)+'&cmlimit=100' 

Es war in der Lage, das korrekte Ergebnis zu drucken:

https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3ABranches%20of%20geography&cmlimit=100 
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3AGeography%20by%20place&cmlimit=100 
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3AGeography%20awards%20and%20competitions&cmlimit=100 
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3AGeography%20conferences&cmlimit=100 
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3AGeography%20education&cmlimit=100 
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3AEnvironmental%20studies&cmlimit=100 
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3AExploration&cmlimit=100 
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3AGeocodes&cmlimit=100 
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3AGeographers&cmlimit=100 
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3AGeographical%20zones&cmlimit=100 
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3AGeopolitical%20corridors&cmlimit=100 
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3AHistory%20of%20geography&cmlimit=100 
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3ALand%20systems&cmlimit=100 
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3ALandscape&cmlimit=100 
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3AGeography-related%20lists&cmlimit=100 
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3ALists%20of%20countries%20by%20geography&cmlimit=100 
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3ANavigation&cmlimit=100 
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3AGeography%20organizations&cmlimit=100 
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3APlaces&cmlimit=100 
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3AGeographical%20regions&cmlimit=100 
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3ASurveying&cmlimit=100 
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3AGeographical%20technology&cmlimit=100 
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3AGeography%20terminology&cmlimit=100 
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3AWorks%20about%20geography&cmlimit=100 
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3AGeographic%20images&cmlimit=100 
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3AGeography%20stubs&cmlimit=100 

Antwort

1

Verwendung urllib.quote() Sonderzeichen in einer URL zu ersetzen:

Python 2:

import urllib 
line = 'Category:Branches of geography' 
url ='https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=' + urllib.quote(line) + '&cmlimit=100' 

https://docs.python.org/2/library/urllib.html#urllib.quote

Python:

import urllib.parse 
line = 'Category:Branches of geography' 
url ='https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=' + urllib.parse.quote(line) + '&cmlimit=100' 

https://docs.python.org/3.5/library/urllib.parse.html#urllib.parse.quote

+0

Vielen Dank für die schnelle Antwort, kann ich ein Beispiel, wie man es verwenden, bitte? – windboy

+0

Ich habe Import Urllib versucht. Ich ersetze dann "url_add = wiki url", gefolgt von "url = urllib.quote (url_add)", gefolgt von "json_obj = urllib.urlopen (url)" und schließlich "data = json.load (json_obj)" aber es hat nicht funktioniert. – windboy

+0

Die Fehlermeldung, die ich bekomme, ist keine solche Datei oder Verzeichnis:: https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Branchen von geography & cmlimit = 100 ' – windboy

Verwandte Themen