Published in 22:33 of 08/24/2012 by

Published in 22:33 of 08/24/2012

←Home

Using Python to get all the external links from a webpage

Based on the Mark Pilgrim - Dive in to Python book

Define the url lister

from sgmllib import SGMLParser

class URLLister(SGMLParser):
    def reset(self):                              
        SGMLParser.reset(self)
        self.urls = []

    def start_a(self, attrs):                     
        href = [v for k, v in attrs if k=='href']  
        if href:
            self.urls.extend(href)

Now The function which receives an URL, read that url and list all href attrs

def get_urls_from(url):
    url_list = []
    import urllib
    usock = urllib.urlopen(url)
    parser = URLLister()
    parser.feed(usock.read())         
    usock.close()      
    parser.close()                    
    map(url_list.append, 
        [item for item in parser.urls \
            if item.startswith(('http', 'ftp', 'www'))])
    return url_list

Ok, Now you can call this:

from pprint import pprint
pprint(get_urls_from("http://www.rochacbruno.com.br"))

and you get:

['http://feeds.feedburner.com/rochacbruno',
 'http://www.python.org',
 'http://www.web2py.com',
 'http://www.djangoproject.com',
 'http://www.jquery.com',
 'http://www.postgresql.org',
 'http://www.linux.org',
 'http://www.DrBusca.com',
 'http://www.cursodepython.com.br',
 'http://facebook.com/rochacbruno',
 'http://twitter.com/rochacbruno',
 'http://linkedin.com/in/rochacbruno',
 'http://angel.co/rochacbruno',
 'http://foursquare.com/rochacbruno',
 'https://plus.google.com/u/0/116110204708544946953/posts',
 'https://kippt.com/rochacbruno',
 'http://about.me/rochacbruno',
 'http://www.movu.ca',
 'http://www.menuvegano.com.br',
 'http://www.web2pyslices.com',
 'http://github.com/rochacbruno',
 'http://www.web2py.com',
 'http://associacao.python.org.br',
 'http://www.python.org/psf/members/#nominated-members',
 'http://amazon.com/author/rochacbruno',
 'https://snipt.net/rochacbruno/blog-sidebar-2/',
 'https://snipt.net/',
 'http://rochacbruno.com.br/web-apps-that-worth-a-try/',
 'http://snipt.net',
 'http://snipt.net',
 'https://github.com/nicksergeant/snipt-old',
 'http://snipt.net/pro',
 'https://kippt.com/',
 'http://kippt.com',
 'http://kippt.com',
 'http://coolendar.com',
 'http://Coolendar.com',
 'http://pythonanywhere.com',
 'http://pythonanywhere.com',
 'http://rochacbruno.com.br/web-apps-that-worth-a-try/#disqus_thread',
 'http://rochacbruno.com.br/i-am-now-a-member-of-python-software-foundation/',
 'http://linkedin.com/in/rochacbruno',
 'http://www.cursodepython.com.br',
 'http://www.blouweb.com',
 'http://www.amazon.com/Bruno-Cezar-Rocha/e/B007KZBV4M',
 'http://pyfound.blogspot.com.br/2012/08/welcome-new-psf-members.html',
 'http://www.python.org/psf/members/',
 'http://rochacbruno.com.br/i-am-now-a-member-of-python-software-foundation/#disqus_thread',
 'http://rochacbruno.com.br/web2py-manage-users-and-membership-in-the-same-form/',
 'http://stackoverflow.com/questions/11992749/web2py-how-edit-user-profile-and-membership-in-one-view',
 'http://rochacbruno.com.br/web2py-manage-users-and-membership-in-the-same-form/#disqus_thread',
 'http://rochacbruno.com.br/lazy-dal-beta-working/',
 'http://rochacbruno.com.br/lazy-dal-beta-working/#disqus_thread',
 'http://rochacbruno.com.br/lazy-dal-attempt-3-pbreit/',
 'http://rochacbruno.com.br/lazy-dal-attempt-3-pbreit/#disqus_thread',
 'http://rochacbruno.com.br/open-links-which-points-outside-your-own-site-in-a-new-window/',
 'http://rochacbruno.com.br/open-links-which-points-outside-your-own-site-in-a-new-window/#disqus_thread',
 'http://rochacbruno.com.br/websockets-com-tornado-web2py-python-jquery/',
 'http://rochacbruno.com.br/websockets-com-tornado-web2py-python-jquery/#disqus_thread',
 'http://rochacbruno.com.br/loading-html-elements-dynamically-with-web2py-and-ajax/',
 'http://rochacbruno.com.br/loading-html-elements-dynamically-with-web2py-and-ajax/#disqus_thread',
 'http://rochacbruno.com.br/breaking-a-simple-captcha-with-26-lines-of-code/',
 'http://rochacbruno.com.br/breaking-a-simple-captcha-with-26-lines-of-code/#disqus_thread',
 'http://rochacbruno.com.br/sending-emails-with-python-and-gmail/',
 'http://rochacbruno.com.br/sending-emails-with-python-and-gmail/#disqus_thread']

That was based on some examples from DiveIntoPython book

  • xmltodict: makes working with XML feel like you are working with JSON in django · 22:50 of 08/21/2013
  • KISS: Use the built in sum() instead of reduce to aggregate over a list comprehension in django · 20:31 of 01/11/2013
  • Django ListField e SeparetedValuesField in django · 14:01 of 01/11/2013

  • comments powered by Disqus Go Top