2017-08-02 2 views
1

Ich versuche, die folgende Website zu kratzen: https://www.linkedin.com/jobs/search/?keywords=coach%20&location=United%20States&locationId=us%3A0Python: Selenium & PhantomJS

Der Text, den ich zu bekommen ist:

Showing 114,877 results 

der HTML-Code:

<div class="jobs-search-results__count-sort pt3"> 
      <div class="jobs-search-results__count-string results-count-string Sans-15px-black-55% pb0 pl5 pr4"> 
       Showing 114,877 results 
      </div> 

Meine Python-Code ist:

index_url = 'https://www.linkedin.com/jobs/search/?keywords=coach%20&location=United%20States&locationId=us%3A0' 

    java = '!function(i,n){void 0!==i.addEventListener&&void 0!==i.hidden&&(n.liVisibilityChangeListener=function(){i.hidden&&(n.liHasWindowHidden=!0)},i.addEventListener("visibilitychange",n.liVisibilityChangeListener))}(document,window);' 
    browser = webdriver.PhantomJS() 
    browser.get(index_url) 
    browser.execute_script(java) 
    soup = BeautifulSoup(browser.page_source, "html.parser") 
    link = "jobs-search-results__count-string results-count-string Sans-15px-black-55% pb0 pl5 pr4" 
    div = soup.find('div', {"class":link}) 
    text = div.text 

Bisher sieht es so aus, als ob mein Code nicht funktioniert. Ich denke, es war etwas mit der Ausführung des Java-Skripts zu tun.

bekomme ich folgende Fehlermeldung:


AttributeError       Traceback (most recent call last) 
<ipython-input-33-7cdc1c4e0894> in <module>() 
     6 link = "jobs-search-results__count-string results-count-string Sans-15px-black-55% pb0 pl5 pr4" 
     7 div = soup.find('div', {"class":link}) 
----> 8 text = div.text 

AttributeError: 'NoneType' object has no attribute 'text' 

Suppe Ausgabe:

<html><head>\n<script type="text/javascript">\nwindow.onload = function() {\n // Parse the tracking code from cookies.\n var trk = "bf";\n var trkInfo = "bf";\n var cookies = document.cookie.split("; ");\n for (var i = 0; i < cookies.length; ++i) {\n if ((cookies[i].indexOf("trkCode=") == 0) && (cookies[i].length > 8)) {\n  trk = cookies[i].substring(8);\n }\n else if ((cookies[i].indexOf("trkInfo=") == 0) && (cookies[i].length > 8)) {\n  trkInfo = cookies[i].substring(8);\n }\n }\n\n if (window.location.protocol == "http:") {\n // If "sl" cookie is set, redirect to https.\n for (var i = 0; i < cookies.length; ++i) {\n  if ((cookies[i].indexOf("sl=") == 0) && (cookies[i].length > 3)) {\n  window.location.href = "https:" + window.location.href.substring(window.location.protocol.length);\n  return;\n  }\n }\n }\n\n // Get the new domain. For international domains such as\n // fr.linkedin.com, we convert it to www.linkedin.com\n var domain = "www.linkedin.com";\n if (domain != location.host) {\n var subdomainIndex = location.host.indexOf(".linkedin");\n if (subdomainIndex != -1) {\n  domain = "www" + location.host.substring(subdomainIndex);\n }\n }\n\n window.location.href = "https://" + domain + "/authwall?trk=" + trk + "&trkInfo=" + trkInfo +\n  "&originalReferer=" + document.referrer.substr(0, 200) +\n  "&sessionRedirect=" + encodeURIComponent(window.location.href);\n}\n</script>\n</head><body></body></html> 
+0

Neugierig genug angemeldet sind, bei der Verwendung von 'Chrome' WebDriver Zugriff, ist der Text in Zusammenhang innen' div = soup.find ('div', { "Klasse": "result- Kontext "})'. Bei Verwendung von 'PhantomJS' könnte dies zu einem modalen Dialog führen. –

Antwort

1

ich die Lösung in webdriver.Chrome haben, weil ich nie PhantomJS verwendet haben. Es gibt zwei Fälle, wenn Sie den Ergebnistext erhalten möchten. Eine davon ist, dass Sie in auf Linkedin vom Treiber-Instanz angemeldet sind und andere ist, dass Sie nicht angemeldet sind.

Nehmen wir an, Sie sind nicht angemeldet. So ist der folgende Code wird Ihre Arbeit wird

getan
from selenium import webdriver 
from bs4 import BeautifulSoup 
driver = webdriver.Chrome() 
url = 'https://www.linkedin.com/jobs/search/?keywords=coach%20&location=United%20States&locationId=us%3A0' 
driver.get(url) 
soup = BeautifulSoup(driver.page_source, 'html.parser') 
text = soup.find('div',{'class':'results-context'}).text 
print(text) 

Angenommen, Sie in

from selenium import webdriver 
from bs4 import BeautifulSoup 
driver = webdriver.Chrome() 
url = 'https://www.linkedin.com/jobs/search/?keywords=coach%20&location=United%20States&locationId=us%3A0' 
driver.get(url) 
soup = BeautifulSoup(driver.page_source, 'html.parser') 

class = 'jobs-search-results__count-string results-count-string Sans-15px-black-55% pb0 pl5 pr4' 
text = soup.find('div',{'class':class}).text.split('\n')[1].lstrip() 
print(text) 
Verwandte Themen