2017-04-13 6 views
-1

Hoffen, um Hilfe von euch zu suchen! Ich möchte den Benutzernamen aus einem Forum mit Python löschen, aber ich konnte die Methode nicht herausfinden. Das folgende ist ein Teil des Codes:username web scraping aus forum mit python

Teil1

<td class="alt2" title="reply: 11,view: 1,097"> 
    <div class="smallfont" style="text-align:right; white-space:nowrap"> 
    2017-03-28 <span class="time">23:44</span><br> 

    <a href="member.php?find=lastposter&amp;t=1907777" rel="nofollow">username</a> <a href="showthread.php?p=9575713#post9575713"><img class="inlineimg" src="http://s.bbkz.net/forum/images/buttons_style/tc_2/lastpost.gif" alt="last" title="last" border="0"></a> 
    </div> 
</td> 

Teil 2

<div class="smallfont"> 
    <span style="cursor:pointer" onclick="window.open('member.php?u=353562', '_self')">username</span> 
</div> 

Außerdem ist das Format für Forum Link folgt aus: http://www.example.com/forum/forumdisplay.php?f=148&order=desc&page=3

ich die 'verschrotten wollen Benutzername 'aus diesen Codes auf verschiedenen Seiten mit Python, darf ich Ihre Hilfe haben?

Vielen Dank !!

[Bearbeiten - Zeit Schlaf hinzugefügt] sollte es so sein?

import requests 
from bs4 import BeautifulSoup 
import time 

url = 'http://www.example.com/forum/forumdisplay.php?f=148&order=desc&page=3' 

html_source = requests.get(url).text 

soup = BeautifulSoup(html_source, 'html.parser') 

a_tags = soup.find_all('a') 

for a in a_tags: 
    if 'member.php?' in a['href']: 
     print(a.text) 

time.sleep(10) 

Dies sind die Fehlermeldungen:

Traceback (most recent call last): 
File "C:\Users\user\AppData\Local\Programs\Python\Python35-32\lib\site-packages\requests\packages\urllib3\connection.py", line 138, in _new_conn 
(self.host, self.port), self.timeout, **extra_kw) 
File "C:\Users\user\AppData\Local\Programs\Python\Python35-32\lib\site-packages\requests\packages\urllib3\util\connection.py", line 98, in create_connection 
raise err 
File "C:\Users\user\AppData\Local\Programs\Python\Python35-32\lib\site-packages\requests\packages\urllib3\util\connection.py", line 88, in create_connection 
sock.connect(sa) 
TimeoutError: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond 

During handling of the above exception, another exception occurred: 

Traceback (most recent call last): 
File "C:\Users\user\AppData\Local\Programs\Python\Python35-32\lib\site-packages\requests\packages\urllib3\connectionpool.py", line 594, in urlopen 
chunked=chunked) 
File "C:\Users\user\AppData\Local\Programs\Python\Python35-32\lib\site-packages\requests\packages\urllib3\connectionpool.py", line 361, in _make_request 
conn.request(method, url, **httplib_request_kw) 
File "C:\Users\user\AppData\Local\Programs\Python\Python35-32\lib\http\client.py", line 1106, in request 
self._send_request(method, url, body, headers) 
File "C:\Users\user\AppData\Local\Programs\Python\Python35-32\lib\http\client.py", line 1151, in _send_request 
self.endheaders(body) 
File "C:\Users\user\AppData\Local\Programs\Python\Python35-32\lib\http\client.py", line 1102, in endheaders 
self._send_output(message_body) 
File "C:\Users\user\AppData\Local\Programs\Python\Python35-32\lib\http\client.py", line 934, in _send_output 
self.send(msg) 
File "C:\Users\user\AppData\Local\Programs\Python\Python35-32\lib\http\client.py", line 877, in send 
self.connect() 
File "C:\Users\user\AppData\Local\Programs\Python\Python35-32\lib\site-packages\requests\packages\urllib3\connection.py", line 163, in connect 
conn = self._new_conn() 
File "C:\Users\user\AppData\Local\Programs\Python\Python35-32\lib\site-packages\requests\packages\urllib3\connection.py", line 147, in _new_conn 
self, "Failed to establish a new connection: %s" % e) 
requests.packages.urllib3.exceptions.NewConnectionError: <requests.packages.urllib3.connection.HTTPConnection object at 0x029131F0>:  Failed to establish a new connection: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond 

During handling of the above exception, another exception occurred: 

Traceback (most recent call last): 
File "C:\Users\user\AppData\Local\Programs\Python\Python35-32\lib\site-packages\requests\adapters.py", line 423, in send 
timeout=timeout 
File "C:\Users\user\AppData\Local\Programs\Python\Python35-32\lib\site-packages\requests\packages\urllib3\connectionpool.py", line 643, in urlopen 
_stacktrace=sys.exc_info()[2]) 
File "C:\Users\user\AppData\Local\Programs\Python\Python35-32\lib\site-packages\requests\packages\urllib3\util\retry.py", line 363, in increment 
raise MaxRetryError(_pool, url, error or ResponseError(cause)) 
requests.packages.urllib3.exceptions.MaxRetryError: 
HTTPConnectionPool(host='www.example.com', port=80): Max retries exceeded with url: /forum/forumdisplay.php?f=148&order=desc&page=3 (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x029131F0>: Failed to establish a new connection: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond',)) 

During handling of the above exception, another exception occurred: 

Traceback (most recent call last): 
File "C:/Users/user/PycharmProjects/untitled/backpackertw_v1.py", line 6, in <module> 
html_source = requests.get(url).text 
File "C:\Users\user\AppData\Local\Programs\Python\Python35-32\lib\site-packages\requests\api.py", line 70, in get 
return request('get', url, params=params, **kwargs) 
File "C:\Users\user\AppData\Local\Programs\Python\Python35-32\lib\site-packages\requests\api.py", line 56, in request 
return session.request(method=method, url=url, **kwargs) 
File "C:\Users\user\AppData\Local\Programs\Python\Python35-32\lib\site-packages\requests\sessions.py", line 488, in request 
resp = self.send(prep, **send_kwargs) 
File "C:\Users\user\AppData\Local\Programs\Python\Python35-32\lib\site-packages\requests\sessions.py", line 609, in send 
r = adapter.send(request, **kwargs) 
File "C:\Users\user\AppData\Local\Programs\Python\Python35-32\lib\site-packages\requests\adapters.py", line 487, in send 
raise ConnectionError(e, request=request) 
requests.exceptions.ConnectionError: 
HTTPConnectionPool(host='www.example.com', port=80): Max retries exceeded with url: /forum/forumdisplay.php?f=148&order=desc&page=3 (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x029131F0>: Failed to establish a new connection: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond',)) 
+2

Sie können beautifulsoup und googeln ist immer Ihr Freund. – anonyXmous

+0

'Anfragen',' beautifulsoup', google .. –

Antwort

0

Ihr Code in etwa so sein wird:

import requests 
from bs4 import BeautifulSoup 

url = 'http://www.example.com/forum/forumdisplay.php?f=148&order=desc&page=3' 

html_source = requests.get(url).text 

soup = BeautifulSoup(html_source, 'html.parser') 

a_tags = soup.find_all('a') 

for a in a_tags: 
    if 'member.php?' in a['href']: 
     print(a.text) 

Dann werden Sie es auf ein paar mehr Seiten implementieren müssen, um unter Verwendung eines Schleife zum Erstellen jeder URL:

dh

for i in range(10) 
    url = 'http://www.example.com/forum/forumdisplay.php?f=148&order=desc&page={}'.format(i) 
    ### 
    #insert the rest of your code here 
    ### 
+0

Vielen Dank für Ihre Hilfe. Allerdings habe ich die Fehlermeldung wie folgt: Finden Sie die oben genannten – Jasonm4432

+0

Ich sah Ihre Bearbeitung ... werfen Sie einen Blick auf diesen Teil: 'TimeoutError: [WinError 10060] Ein Verbindungsversuch gescheitert, weil der verbundene Teilnehmer nicht richtig antwortete nach einer gewissen Zeit, oder eine bestehende Verbindung ist fehlgeschlagen, weil der angeschlossene Host nicht geantwortet hat. - Sie müssen die falsche URL aufgerufen haben oder die richtige URL aufgerufen haben und wirklich schnell, da der von mir bereitgestellte Code keine Ruhezeit hat. . –

+0

Wenn Sie sich den letzten Teil anschauen, sehen Sie diese Nachricht: 'requests.exceptions.ConnectionError: HTTPConnectionPool (host = 'www.example.com', port = 80):', also vielleicht Sie sollte den Host auf den richtigen ändern anstelle von 'host = 'www.example.com'' –