Ich habe den kompletten Code umgeschrieben, um die href- und src-Verbindung mit beautifulsoup diesmal durch die Anfrage vieler SO-Benutzer anstelle von Regex zu holen. Hier ist der Code:Vervollständige die relativen Pfade zur absoluten Verwendung von Python
import os
from bs4 import BeautifulSoup
from urllib.parse import urlparse
path = urlpars(http://www.example.com/dynamic/search.aspx?searchtype=cat&class_id=2566&city_id=55)
lpath = os.path.dirname(path.path)
html = u"<html class=\"\"><head id=\"pageHead\"><title>\n Beauty Salons | Best Beauty Care & Treatments | Listings @ Phonebook Online\n</title>\n <!--\n <meta http-equiv=\"Cache-Control\" content=\"no-cache, no-store, must-revalidate\" /><meta http-equiv=\"Pragma\" content=\"no-cache\" /><meta http-equiv=\"Expires\" content=\"0\" />\n -->\n <meta name=\"viewport\" content=\"width=device-width, initial-scale=1\"><link rel=\"stylesheet\" href=\"../css_responsive/category.css\" type=\"text/css\" media=\"screen\">\n <script async=\"\" src=\"//www.google-analytics.com/analytics.js\"></script><script async=\"\" src=\"//www.google.com/adsense/search/async-ads.js\"></script><script type=\"text/javascript\" src=\"../styles/scripts/jquery-1.9.1.min.js\"></script>\n <link rel=\"shortcut icon\" type=\"image/png\" href=\"/PhoneBook.ico\">\n <!-- #Begin Css Plugin -->\n <link rel=\"stylesheet\" href=\"../css_responsive/fontsss.css\"><link rel=\"stylesheet\" href=\"../css_responsive/bootstrap-3.3.4-dist/css/bootstrap.css\" type=\"text/css\" media=\"screen\"><link rel=\"stylesheet\" href=\"../styles/scripts/fancybox/jquery.fancybox.css\" type=\"text/css\" media=\"screen\"><link rel=\"stylesheet\" href=\"../css_responsive/icon-detail.css\" type=\"text/css\" media=\"screen\">\n <!-- #Finish Css Plugin-->\n <!--<script src=\"http://www.google.com/adsense/search/ads.js\" type=\"text/javascript\"></script> -->\n <script type=\"text/javascript\" charset=\"utf-8\">\n (function (G, o, O, g, L, e) {\n G[g] = G[g] || function() {\n (G[g]['q'] = G[g]['q'] || []).push(\n arguments)\n }, G[g]['t'] = 1 * new Date; L = o.createElement(O), e = o.getElementsByTagName(\n O)[0]; L.async = 1; L.src = '//www.google.com/adsense/search/async-ads.js';\n e.parentNode.insertBefore(L, e)\n })(window, document, 'script', '_googCsa');\n </script>\n <!-- Script For Mobile Base Banner-->\n <script async=\"\" src=\"//pagead2.googlesyndication.com/pagead/js/adsbygoogle.js\"></script>\n <script>\n (adsbygoogle = window.adsbygoogle || []).push({\n google_ad_client: \"ca-pub-6517686434458516\",\n enable_page_level_ads: true\n });\n </script>\n <!-- Script For Mobile Base Banner END-->\n\n\n <script type=\"text/javascript\">\n function AddClass(Class, Element, HasPriority) {\n if (HasPriority == 0) {\n this.className = 'container ' + Class;\n }\n }\n </script>\n \n<meta name=\"description\" content=\"Best Beauty Salons in Abbottabad for quality beauty care and treatments. \"><meta name=\"keywords\" content=\"beauty salons,beauty care,beauty treatments\"><style type=\"text/css\">.fancybox-margin{margin-right:17px;}</style></head>\n<body style=\"text-shadow: rgba(255, 255, 255, 0.4) 0px 1px 1px; background-color: rgb(240, 240, 240);\">\n<div class=\"wapper\">\n <div class=\"pagecontent search_width c-no-t-margin\">\n <div class=\"cblock ele-margin-t-b-15 m-on-mob-hide\"><a href=\"../../default.aspx\">Home</a> > <a href=\"../../dynamic/categories.aspx\">Search by category</a> > <a href=\"../../dynamic/categories.aspx?class_id=12\">Personal Care</a> > <a href=\"../../dynamic/categories.aspx?class_id=134\">Barbers, Beauty Salons & Spas</a> > Beauty Salons in Abbottabad</div>\n <div class=\"refine\">\n <span>Refine Result</span>\n <span>Show Result With</span>\n <ul>\n <li>\n <input class=\"csortType csortTypeAll \" type=\"checkbox\" value=\"100\" name=\"\" checked=\"checked\" disabled=\"disabled\">\n <span class=\"\">All</span>\n </li>\n <li>\n <input class=\"csortType css-checkbox\" type=\"checkbox\" value=\"1\" name=\"\">\n <i class=\"icon-star-full c-icon-starfull-stroke\"></i>\n <span>Reviews</span>\n </li>\n <li>\n <input class=\"csortType\" type=\"checkbox\" value=\"2\" name=\"\">\n <i class=\"icon-price-tag cColor-Red\"></i>\n <span>Deals & Coupons</span>\n </li>\n <li>\n <input class=\"csortType\" type=\"checkbox\" value=\"5\" name=\"\">\n <i class=\"icon-bullhorn\"></i>\n <span>Announcements</span>\n </li>\n <li>\n <input class=\"csortType\" type=\"checkbox\" value=\"3\" name=\"\">\n <i class=\"icon-location\"></i>\n <span>Map</span>\n </li>\n <li>\n <input class=\"csortType\" type=\"checkbox\" value=\"4\" name=\"\">\n <i class=\"icon-film\"></i>\n <span>Video</span>\n </li>\n </ul>\n \n <div class=\"tab\" onclick=\"SlideTogle('Location')\">\n Search by location\n </div>\n \n <ul id=\"Location\" style=\"display: none;\">\n \n <li><a href=\"search.aspx?searchtype=cat&class_id=2566&city_id=1\">Karachi</a></li>\n \n <li><a href=\"search.aspx?searchtype=cat&class_id=2566&city_id=2\">Lahore</a></li>\n \n <li><a href=\"search.aspx?searchtype=cat&class_id=2566&city_id=56\">Islamabad</a></li>\n \n <li><a href=\"search.aspx?searchtype=cat&class_id=2566&city_id=79\">Rawalpindi</a></li>\n \n <li><a href=\"search.aspx?searchtype=cat&class_id=2566&city_id=49\">Faisalabad</a></li>\n \n <li><a href=\"search.aspx?searchtype=cat&class_id=2566&city_id=81\">Gujranwala</a></li>\n \n <li><a href=\"search.aspx?searchtype=cat&class_id=2566&city_id=78\">Peshawar</a></li>\n \n <li><a href=\"search.aspx?searchtype=cat&class_id=2566&city_id=82\">Sialkot</a></li>\n \n <li><a href=\"search.aspx?searchtype=cat&class_id=2566&city_id=53\">Sargodha</a></li>\n \n </ul>\n \n <div class=\"tab\" onclick=\"SlideTogle('Category')\">\n Search by category\n </div>\n \n <ul id=\"Category\" style=\"display: none;\">\n \n <li><a href=\"search.aspx?searchtype=cat&class_id=2571\">Hairstylists</a></li>\n \n <li><a href=\"search.aspx?searchtype=cat&class_id=2575\">Hair Removal, Wax, Threading Body & Face</a></li>\n \n <li><a href=\"search.aspx?searchtype=cat&class_id=2584\">Manicuring</a></li>\n \n <li><a href=\"search.aspx?searchtype=cat&class_id=2574\">Nail Salons & Services</a></li>\n \n <li><a href=\"search.aspx?searchtype=cat&class_id=2572\">Spas-Beauty, Health And Destination</a></li>\n \n <li><a href=\"search.aspx?searchtype=cat&class_id=2564\">Beauty Institutes</a></li>\n \n <li><a href=\"search.aspx?searchtype=cat&class_id=2569\">Estheticians</a></li>\n \n </ul>\n </div>\n <div id=\"cResultMainControl\">\n <div class=\"result_hldr\" id=\"cResultContainer\">\n <div class=\"h1\"><h1>Beauty Salons in Abbottabad.</h1></div>\n <div class=\"h1 page_desc cfont-12 cNo-Margin ele-pad-r-l-20 m-on-mob-hide\"><p class=\"cNo-Margin margin-t m-ele-top-no-margin \" style=\"line-height:18px;\">Best Beauty Salons in Abbottabad for quality beauty care and treatments, <a href=\"http://www.phonebook.com.pk/dynamic/search.aspx?SearchType=kl&k=bridal+makeup\" title=\"Bridal Makeup\" target=\"_blank\">bridal makeup</a>, <a href=\"http://www.phonebook.com.pk/dynamic/search.aspx?SearchType=kl&k=body+massage\" title=\"Body Massage\" target=\"_blank\">body massage</a>.</p></div>\n <div class=\"cMobileHidden col-md-12 col-xs-12 text-center overflow-visible cheight-25 margin-t\" style=\"background-color: rgb(240, 240, 240);\">\n <script async=\"\" src=\"//pagead2.googlesyndication.com/pagead/js/adsbygoogle.js\"></script>\n <!-- New Line Link Ad -->\n <ins class=\"adsbygoogle\" style=\"display:inline-block;width:468px;height:15px;background-color: rgb(240, 240, 240);\" data-ad-client=\"ca-pub-6517686434458516\" data-ad-slot=\"4522680219\"></ins>\n <script>\n (adsbygoogle = window.adsbygoogle || []).push({});\n </script>\n </div>\n <div id=\"cAlpNav\" class=\"margin-t-10 cAlpNav m-on-mob-hide\">\n <div class=\"text-center\"><a href=\"search.aspx?searchtype=cat&class_id=2566&city_id=55\">all</a><a href=\"search.aspx?searchtype=cat&class_id=2566&city_id=55&alp=a\">a</a><a href=\"search.aspx?searchtype=cat&class_id=2566&city_id=55&alp=b\">b</a><a href=\"search.aspx?searchtype=cat&class_id=2566&city_id=55&alp=c\">c</a><a href=\"search.aspx?searchtype=cat&class_id=2566&city_id=55&alp=d\">d</a><a href=\"search.aspx?searchtype=cat&class_id=2566&city_id=55&alp=e\">e</a><a href=\"search.aspx?searchtype=cat&class_id=2566&city_id=55&alp=f\">f</a><a href=\"search.aspx?searchtype=cat&class_id=2566&city_id=55&alp=g\">g</a><a href=\"search.aspx?searchtype=cat&class_id=2566&city_id=55&alp=h\">h</a><a href=\"search.aspx?searchtype=cat&class_id=2566&city_id=55&alp=i\">i</a><a href=\"search.aspx?searchtype=cat&class_id=2566&city_id=55&alp=j\">j</a><a href=\"search.aspx?searchtype=cat&class_id=2566&city_id=55&alp=k\">k</a><a href=\"search.aspx?searchtype=cat&class_id=2566&city_id=55&alp=l\">l</a><a href=\"search.aspx?searchtype=cat&class_id=2566&city_id=55&alp=m\">m</a><a href=\"search.aspx?searchtype=cat&class_id=2566&city_id=55&alp=n\">n</a><a href=\"search.aspx?searchtype=cat&class_id=2566&city_id=55&alp=o\">o</a><a href=\"search.aspx?searchtype=cat&class_id=2566&city_id=55&alp=p\">p</a><a href=\"search.aspx?searchtype=cat&class_id=2566&city_id=55&alp=q\">q</a><a href=\"search.aspx?searchtype=cat&class_id=2566&city_id=55&alp=r\">r</a><a href=\"search.aspx?searchtype=cat&class_id=2566&city_id=55&alp=s\">s</a><a href=\"search.aspx?searchtype=cat&class_id=2566&city_id=55&alp=t\">t</a><a href=\"search.aspx?searchtype=cat&class_id=2566&city_id=55&alp=u\">u</a><a href=\"search.aspx?searchtype=cat&class_id=2566&city_id=55&alp=v\">v</a><a href=\"search.aspx?searchtype=cat&class_id=2566&city_id=55&alp=w\">w</a><a href=\"search.aspx?searchtype=cat&class_id=2566&city_id=55&alp=x\">x</a><a href=\"search.aspx?searchtype=cat&class_id=2566&city_id=55&alp=y\">y</a><a href=\"search.aspx?searchtype=cat&class_id=2566&city_id=55&alp=z\">z</a></div></div>\n <div>\n <div id=\"cListingHldr\" class=\"listing\">\n \n<div class=\"container\">\n <div class=\"comp_info\">\n <h2><a href=\"../../company/51529-Beena-Beauty-Parlour\">Beena's Beauty Parlour</a></h2>\n <!--<img class=\"margin-t\" alt=\"Comapny Rating\" src=\"../../images/Stars>.png\" />-->\n <i class=\"cfont-12 cnoPad left icon-zero-star\"></i>\n \n <span class=\"blue margin-t\">(No Review)</span>\n \n <span class=\"cfontBold margin-t cColor-Black cColor-SilverDark\">\n Main Mansehra Road, Near Radio Pakistan, Abbottabad.\n </span>\n \n <div class=\"inline-block cMobile-Right\">\n <ul class=\"margin-t cMobile-Text-Align-Right\">\n <li>\n <a data-fancybox-type=\"iframe\" href=\"../../dynamic/emailtocustomer.aspx?Request_ID=26207&comp_name=Beena-Beauty-Parlour&isAdvertizer=0\" class=\"other_links fancybox\">Email</a>\n </li>\n <li>\n <a title=\"Call Now\" href=\"tel:+92-992-335556\" class=\"c_circle cMobileShow\"></a>\n </li>\n <li>\n <a class=\"other_links\" href=\"../../company/51529-Beena-Beauty-Parlour\" title=\"Company Detail\">Detail</a>\n </li>\n \n </ul>\n </div>\n </div>\n <div class=\"comp_info contact_info\">\n <strong><a class=\"tel\" href=\"tel:+92-992-335556\">+92-992-335556</a></strong>\n \n </div>\n</div>\n<div class=\"container\">\n <div class=\"comp_info\">\n <h2><a href=\"../../company/86977-Unique-Beauty-Salon\">Unique Beauty Salon</a></h2>\n <!--<img class=\"margin-t\" alt=\"Comapny Rating\" src=\"../../images/Stars>.png\" />-->\n <i class=\"cfont-12 cnoPad left icon-zero-star\"></i>\n \n <span class=\"blue margin-t\">(No Review)</span>\n \n <span class=\"cfontBold margin-t cColor-Black cColor-SilverDark\">\n Palki Wedding Hall, Mandian , Abbottabad.\n </span>\n \n <div class=\"inline-block cMobile-Right\">\n <ul class=\"margin-t cMobile-Text-Align-Right\">\n <li>\n <a data-fancybox-type=\"iframe\" href=\"../../dynamic/emailtocustomer.aspx?Request_ID=61717&comp_name=Unique-Beauty-Salon&isAdvertizer=0\" class=\"other_links fancybox\">Email</a>\n </li>\n <li>\n <a title=\"Call Now\" href=\"tel:+92-313-5856739\" class=\"c_circle cMobileShow\"></a>\n </li>\n <li>\n <a class=\"other_links\" href=\"../../company/86977-Unique-Beauty-Salon\" title=\"Company Detail\">Detail</a>\n </li>\n \n </ul>\n </div>\n </div>\n <div class=\"comp_info contact_info\">\n <strong><a class=\"tel\" href=\"tel:+92-313-5856739\">+92-313-5856739</a></strong>\n \n </div>\n</div></div>\n <div id=\"cRecoredInfo\" class=\"listing dotted\">Displaying listings from 1 to 10 of 10</div>\n <div class=\"text-center m-pad-l-r-10\">\n <div id=\"related-suggestions\" class=\"listing inline-block text-center cPad-b-t-10\"><span class=\"left cfont-14\"><b>Related Searches:</b></span> <div class=\"newsssss left inline\" style=\"font-style: italic;font-weight:bold;\"><a href=\"search.aspx?searchtype=cat&class_id=2584\" class=\"left ele-pad-r-l-20 text-underline cfont-14\">Manicuring</a></div><div class=\"newsssss left inline\" style=\"font-style: italic;font-weight:bold;\"><a href=\"search.aspx?searchtype=cat&class_id=2575\" class=\"left ele-pad-r-l-20 text-underline cfont-14\">Hair Removal, Wax, Threading Body & Face</a></div><div class=\"newsssss left inline\" style=\"font-style: italic;font-weight:bold;\"><a href=\"search.aspx?searchtype=cat&class_id=2571\" class=\"left ele-pad-r-l-20 text-underline cfont-14\">Hairstylists</a></div>\n <div class=\"text-left ele-margin-t-b-15 left inline\"><b>Need help with your search?</b> Browse by:<a class=\"text-left ele-pad-r-l-20 text-underline\" onclick=\"hide_show('#related-locations',this);$('#related-categories').addClass('hide');\" href=\"javascript:void(0)\">other locations <img alt=\"\" class=\"margin-l\" width=\"18\" src=\"../../images/plus.png\"></a><a class=\"text-left ele-pad-r-l-20 text-underline\" onclick=\"hide_show('#related-categories',this);$('#related-locations').addClass('hide');\" href=\"javascript:void(0)\">similar categories <img alt=\"\" class=\"margin-l\" width=\"18\" src=\"../../images/plus.png\"></a></div><ul id=\"related-locations\" class=\"col-xs-12 col-sm-12 sugesstion-box hide\">\n <li class=\"left cblock margin-l col-xs-3 col-sm-2\"><a href=\"search.aspx?searchtype=cat&class_id=2566&city_id=1\" class=\"left\">Karachi</a></li><li class=\"left cblock margin-l col-xs-3 col-sm-2\"><a href=\"search.aspx?searchtype=cat&class_id=2566&city_id=2\" class=\"left\">Lahore</a></li><li class=\"left cblock margin-l col-xs-3 col-sm-2\"><a href=\"search.aspx?searchtype=cat&class_id=2566&city_id=56\" class=\"left\">Islamabad</a></li><li class=\"left cblock margin-l col-xs-3 col-sm-2\"><a href=\"search.aspx?searchtype=cat&class_id=2566&city_id=79\" class=\"left\">Rawalpindi</a></li><li class=\"left cblock margin-l col-xs-3 col-sm-2\"><a href=\"search.aspx?searchtype=cat&class_id=2566&city_id=49\" class=\"left\">Faisalabad</a></li><li class=\"left cblock margin-l col-xs-3 col-sm-2\"><a href=\"search.aspx?searchtype=cat&class_id=2566&city_id=81\" class=\"left\">Gujranwala</a></li><li class=\"left cblock margin-l col-xs-3 col-sm-2\"><a href=\"search.aspx?searchtype=cat&class_id=2566&city_id=78\" class=\"left\">Peshawar</a></li><li class=\"left cblock margin-l col-xs-3 col-sm-2\"><a href=\"search.aspx?searchtype=cat&class_id=2566&city_id=82\" class=\"left\">Sialkot</a></li><li class=\"left cblock margin-l col-xs-3 col-sm-2\"><a href=\"search.aspx?searchtype=cat&class_id=2566&city_id=53\" class=\"left\">Sargodha</a></li></ul>\n <ul id=\"related-categories\" class=\"col-xs-12 col-sm-12 sugesstion-box hide\">\n <li class=\"left cblock margin-l col-xs-4 col-sm-4 text-left\"><a href=\"search.aspx?searchtype=cat&class_id=2574\" class=\"left\">Nail Salons & Services</a></li><li class=\"left cblock margin-l col-xs-4 col-sm-4 text-left\"><a href=\"search.aspx?searchtype=cat&class_id=2572\" class=\"left\">Spas-Beauty, Health And Destination</a></li><li class=\"left cblock margin-l col-xs-4 col-sm-4 text-left\"><a href=\"search.aspx?searchtype=cat&class_id=2564\" class=\"left\">Beauty Institutes</a></li><li class=\"left cblock margin-l col-xs-4 col-sm-4 text-left\"><a href=\"search.aspx?searchtype=cat&class_id=2569\" class=\"left\">Estheticians</a></li></ul>\n </div>\n </div>\n <div class=\"text-center\">\n </div>\n </div>\n </div>\n </div>\n </div>\n </div>\n \n<div class=\"container-fluid bg-silver m-on-mob-hide\">\n <div class=\"row cPad-b-t-10\" style=\"border-bottom:1px solid #ECECEC;\">\n \n </div>\n</div>\n<script>\n (function (i, s, o, g, r, a, m) {\n i['GoogleAnalyticsObject'] = r; i[r] = i[r] || function() {\n (i[r].q = i[r].q || []).push(arguments)\n }, i[r].l = 1 * new Date(); a = s.createElement(o),\n m = s.getElementsByTagName(o)[0]; a.async = 1; a.src = g; m.parentNode.insertBefore(a, m)\n })(window, document, 'script', '//www.google-analytics.com/analytics.js', 'ga');\n\n ga('create', 'UA-2028280-1', 'auto');\n ga('send', 'pageview');\n</script>\n<script type=\"text/javascript\" src=\"../css_responsive/script/global_functions.js\"></script>\n<script type=\"text/javascript\" src=\"../styles/scripts/fancybox/jquery.fancybox.js?v=2.1.5\"></script>\n<script type=\"text/javascript\" src=\"../css_responsive/bootstrap-3.3.4-dist/js/bootstrap.js\"></script>\n</body></html>"
soup = BeautifulSoup(html, "lxml")
for allLinks in soup.find_all(href=True):
if allLinks['href'] and not allLinks['href'].startswith("http") and not allLinks['href'].startswith("jav"):
print (allLinks['href'])
for allLinks in soup.find_all(src=True):
if allLinks['src'] and not allLinks['src'].startswith("http") and not allLinks['src'].startswith("jav"):
print (allLinks['src'])
Dieser Code druckt alle Links in der Konsole und ich kann sie erfolgreich in die absolute Pfade ändern, indem Sie mit if-elif-else zu unterscheiden "../../",“.. /", "/" und "//". Aber das Problem ist, wenn ich versuche, sie mit "re" zu ersetzen, wird das ganze html wieder durcheinander gebracht. Ich benutze BS4 anstelle von Regex, aber immer noch das gleiche Problem. Wegen der Anzahl der Zeichen kann ich hier keine Ausgabe posten, aber um des Wissens willen kann es auch "" oder ein anderes HTML-Tag vermasseln. Bitte schlagen Sie mir eine Möglichkeit vor, diese Links zu ändern und sie dort wieder einzufügen, wo sie sein müssen.
HINWEIS: Code ist am meisten minimiert gemäß akashkarothiya's Beratung.
Ich denke, ich werde meine eigene Frage beantworten müssen. :) –
können Sie Probe vermasselt Ausgabe so, dass ich herausfinden kann –
@akashkarothiya zur Zeit folgte ich Ihre Antwort auf meine vorherige Frage und es löste mein Problem. Ich werde die Antwort in wenigen Minuten veröffentlichen. Danke –