Hide
Siafoo – the intersection of pastebin, help desk, version control, and social networking Join Siafoo Now or Learn More

Get a List of URLs from a Web Page Atom Feed 0

In Brief Use the SGMLParser to retrieve a list of URLs from a web page.
# 's
 1"""Extract list of URLs in a web page
2
3This program is part of "Dive Into Python", a free Python book for
4experienced programmers. Visit http://diveintopython.org/ for the
5latest version.
6"""
7
8__author__ = "Mark Pilgrim (mark@diveintopython.org)"
9__version__ = "$Revision: 1.2 $"
10__date__ = "$Date: 2004/05/05 21:57:19 $"
11__copyright__ = "Copyright (c) 2001 Mark Pilgrim"
12__license__ = "Python"
13
14from sgmllib import SGMLParser
15
16class URLLister(SGMLParser):
17 def reset(self):
18 SGMLParser.reset(self)
19 self.urls = []
20
21 def start_a(self, attrs):
22 href = [v for k, v in attrs if k=='href']
23 if href:
24 self.urls.extend(href)
25
26if __name__ == "__main__":
27 import urllib
28 usock = urllib.urlopen("http://diveintopython.org/")
29 parser = URLLister()
30 parser.feed(usock.read())
31 parser.close()
32 usock.close()
33 for url in parser.urls: print url

Use the SGMLParser to retrieve a list of URLs from a web page.

Comments

over 7 years ago (07 Mar 2009 at 02:50 AM) by sadani
hi,

i have tried this code for getting URLS from a webpage.
it works fine.

now i want to make some changes,
like i want to get a list of only "image" urls.
e.g; jpeg,gif etc

can anyone help me or give me a hint how to do it

thanks
over 7 years ago (07 Mar 2009 at 03:14 AM) by David Isaacson
Do you mean you want to get the images on the page, or you want to get the <a> links that point to images?