2017-04-13 6 views

Hoffen, um Hilfe von euch zu suchen! Ich möchte den Benutzernamen aus einem Forum mit Python löschen, aber ich konnte die Methode nicht herausfinden. Das folgende ist ein Teil des Codes:username web scraping aus forum mit python


<td class="alt2" title="reply: 11,view: 1,097"> 
    <div class="smallfont" style="text-align:right; white-space:nowrap"> 
    2017-03-28 <span class="time">23:44</span><br> 

    <a href="member.php?find=lastposter&amp;t=1907777" rel="nofollow">username</a> <a href="showthread.php?p=9575713#post9575713"><img class="inlineimg" src="http://s.bbkz.net/forum/images/buttons_style/tc_2/lastpost.gif" alt="last" title="last" border="0"></a> 

Teil 2

<div class="smallfont"> 
    <span style="cursor:pointer" onclick="window.open('member.php?u=353562', '_self')">username</span> 

Außerdem ist das Format für Forum Link folgt aus: http://www.example.com/forum/forumdisplay.php?f=148&order=desc&page=3

ich die 'verschrotten wollen Benutzername 'aus diesen Codes auf verschiedenen Seiten mit Python, darf ich Ihre Hilfe haben?

Vielen Dank !!

[Bearbeiten - Zeit Schlaf hinzugefügt] sollte es so sein?

import requests 
from bs4 import BeautifulSoup 
import time 

url = 'http://www.example.com/forum/forumdisplay.php?f=148&order=desc&page=3' 

html_source = requests.get(url).text 

soup = BeautifulSoup(html_source, 'html.parser') 

a_tags = soup.find_all('a') 

for a in a_tags: 
    if 'member.php?' in a['href']: 


Dies sind die Fehlermeldungen:

Traceback (most recent call last): 
File "C:\Users\user\AppData\Local\Programs\Python\Python35-32\lib\site-packages\requests\packages\urllib3\connection.py", line 138, in _new_conn 
(self.host, self.port), self.timeout, **extra_kw) 
File "C:\Users\user\AppData\Local\Programs\Python\Python35-32\lib\site-packages\requests\packages\urllib3\util\connection.py", line 98, in create_connection 
raise err 
File "C:\Users\user\AppData\Local\Programs\Python\Python35-32\lib\site-packages\requests\packages\urllib3\util\connection.py", line 88, in create_connection 
TimeoutError: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond 

During handling of the above exception, another exception occurred: 

Traceback (most recent call last): 
File "C:\Users\user\AppData\Local\Programs\Python\Python35-32\lib\site-packages\requests\packages\urllib3\connectionpool.py", line 594, in urlopen 
File "C:\Users\user\AppData\Local\Programs\Python\Python35-32\lib\site-packages\requests\packages\urllib3\connectionpool.py", line 361, in _make_request 
conn.request(method, url, **httplib_request_kw) 
File "C:\Users\user\AppData\Local\Programs\Python\Python35-32\lib\http\client.py", line 1106, in request 
self._send_request(method, url, body, headers) 
File "C:\Users\user\AppData\Local\Programs\Python\Python35-32\lib\http\client.py", line 1151, in _send_request 
File "C:\Users\user\AppData\Local\Programs\Python\Python35-32\lib\http\client.py", line 1102, in endheaders 
File "C:\Users\user\AppData\Local\Programs\Python\Python35-32\lib\http\client.py", line 934, in _send_output 
File "C:\Users\user\AppData\Local\Programs\Python\Python35-32\lib\http\client.py", line 877, in send 
File "C:\Users\user\AppData\Local\Programs\Python\Python35-32\lib\site-packages\requests\packages\urllib3\connection.py", line 163, in connect 
conn = self._new_conn() 
File "C:\Users\user\AppData\Local\Programs\Python\Python35-32\lib\site-packages\requests\packages\urllib3\connection.py", line 147, in _new_conn 
self, "Failed to establish a new connection: %s" % e) 
requests.packages.urllib3.exceptions.NewConnectionError: <requests.packages.urllib3.connection.HTTPConnection object at 0x029131F0>:  Failed to establish a new connection: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond 

During handling of the above exception, another exception occurred: 

Traceback (most recent call last): 
File "C:\Users\user\AppData\Local\Programs\Python\Python35-32\lib\site-packages\requests\adapters.py", line 423, in send 
File "C:\Users\user\AppData\Local\Programs\Python\Python35-32\lib\site-packages\requests\packages\urllib3\connectionpool.py", line 643, in urlopen 
File "C:\Users\user\AppData\Local\Programs\Python\Python35-32\lib\site-packages\requests\packages\urllib3\util\retry.py", line 363, in increment 
raise MaxRetryError(_pool, url, error or ResponseError(cause)) 
HTTPConnectionPool(host='www.example.com', port=80): Max retries exceeded with url: /forum/forumdisplay.php?f=148&order=desc&page=3 (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x029131F0>: Failed to establish a new connection: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond',)) 

During handling of the above exception, another exception occurred: 

Traceback (most recent call last): 
File "C:/Users/user/PycharmProjects/untitled/backpackertw_v1.py", line 6, in <module> 
html_source = requests.get(url).text 
File "C:\Users\user\AppData\Local\Programs\Python\Python35-32\lib\site-packages\requests\api.py", line 70, in get 
return request('get', url, params=params, **kwargs) 
File "C:\Users\user\AppData\Local\Programs\Python\Python35-32\lib\site-packages\requests\api.py", line 56, in request 
return session.request(method=method, url=url, **kwargs) 
File "C:\Users\user\AppData\Local\Programs\Python\Python35-32\lib\site-packages\requests\sessions.py", line 488, in request 
resp = self.send(prep, **send_kwargs) 
File "C:\Users\user\AppData\Local\Programs\Python\Python35-32\lib\site-packages\requests\sessions.py", line 609, in send 
r = adapter.send(request, **kwargs) 
File "C:\Users\user\AppData\Local\Programs\Python\Python35-32\lib\site-packages\requests\adapters.py", line 487, in send 
raise ConnectionError(e, request=request) 
HTTPConnectionPool(host='www.example.com', port=80): Max retries exceeded with url: /forum/forumdisplay.php?f=148&order=desc&page=3 (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x029131F0>: Failed to establish a new connection: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond',)) 

Sie können beautifulsoup und googeln ist immer Ihr Freund. – anonyXmous


'Anfragen',' beautifulsoup', google .. –



Ihr Code in etwa so sein wird:

import requests 
from bs4 import BeautifulSoup 

url = 'http://www.example.com/forum/forumdisplay.php?f=148&order=desc&page=3' 

html_source = requests.get(url).text 

soup = BeautifulSoup(html_source, 'html.parser') 

a_tags = soup.find_all('a') 

for a in a_tags: 
    if 'member.php?' in a['href']: 

Dann werden Sie es auf ein paar mehr Seiten implementieren müssen, um unter Verwendung eines Schleife zum Erstellen jeder URL:


for i in range(10) 
    url = 'http://www.example.com/forum/forumdisplay.php?f=148&order=desc&page={}'.format(i) 
    #insert the rest of your code here 

Vielen Dank für Ihre Hilfe. Allerdings habe ich die Fehlermeldung wie folgt: Finden Sie die oben genannten – Jasonm4432


Ich sah Ihre Bearbeitung ... werfen Sie einen Blick auf diesen Teil: 'TimeoutError: [WinError 10060] Ein Verbindungsversuch gescheitert, weil der verbundene Teilnehmer nicht richtig antwortete nach einer gewissen Zeit, oder eine bestehende Verbindung ist fehlgeschlagen, weil der angeschlossene Host nicht geantwortet hat. - Sie müssen die falsche URL aufgerufen haben oder die richtige URL aufgerufen haben und wirklich schnell, da der von mir bereitgestellte Code keine Ruhezeit hat. . –


Wenn Sie sich den letzten Teil anschauen, sehen Sie diese Nachricht: 'requests.exceptions.ConnectionError: HTTPConnectionPool (host = 'www.example.com', port = 80):', also vielleicht Sie sollte den Host auf den richtigen ändern anstelle von 'host = 'www.example.com'' –