License Python Software Foundation License (Python 2.x)
Lines 27
Keywords
SGMLParser (1) url (4) web page (1)
Included in this Library
Permissions
Viewable by Everyone
Editable by All Siafoo Users
Hide
Need a quick chart or graph for your blog? Try our reStructured Text renderer. Join Siafoo Now or Learn More

Get a List of URLs from a Web Page Atom Feed 0

In Brief Use the SGMLParser to retrieve a list of URLs from a web page.
# 's
 1"""Extract list of URLs in a web page
2
3This program is part of "Dive Into Python", a free Python book for
4experienced programmers. Visit http://diveintopython.org/ for the
5latest version.
6"""
7
8__author__ = "Mark Pilgrim (mark@diveintopython.org)"
9__version__ = "$Revision: 1.2 $"
10__date__ = "$Date: 2004/05/05 21:57:19 $"
11__copyright__ = "Copyright (c) 2001 Mark Pilgrim"
12__license__ = "Python"
13
14from sgmllib import SGMLParser
15
16class URLLister(SGMLParser):
17 def reset(self):
18 SGMLParser.reset(self)
19 self.urls = []
20
21 def start_a(self, attrs):
22 href = [v for k, v in attrs if k=='href']
23 if href:
24 self.urls.extend(href)
25
26if __name__ == "__main__":
27 import urllib
28 usock = urllib.urlopen("http://diveintopython.org/")
29 parser = URLLister()
30 parser.feed(usock.read())
31 parser.close()
32 usock.close()
33 for url in parser.urls: print url

Use the SGMLParser to retrieve a list of URLs from a web page.

Comments

over 5 years ago (07 Mar 2009 at 02:50 AM) by sadani
hi,

i have tried this code for getting URLS from a webpage.
it works fine.

now i want to make some changes,
like i want to get a list of only "image" urls.
e.g; jpeg,gif etc

can anyone help me or give me a hint how to do it

thanks
over 5 years ago (07 Mar 2009 at 03:14 AM) by David Isaacson
Do you mean you want to get the images on the page, or you want to get the <a> links that point to images?