
Get a List of URLs from a Web Page
0
In Brief | Use the SGMLParser to retrieve a list of URLs from a web page. |
Language | Python |
# 's
1"""Extract list of URLs in a web page
2
3This program is part of "Dive Into Python", a free Python book for
4experienced programmers. Visit http://diveintopython.org/ for the
5latest version.
6"""
7
8__author__ = "Mark Pilgrim (mark@diveintopython.org)"
9__version__ = "$Revision: 1.2 $"
10__date__ = "$Date: 2004/05/05 21:57:19 $"
11__copyright__ = "Copyright (c) 2001 Mark Pilgrim"
12__license__ = "Python"
13
14from sgmllib import SGMLParser
15
16class URLLister(SGMLParser):
17 def reset(self):
18 SGMLParser.reset(self)
19 self.urls = []
20
21 def start_a(self, attrs):
22 href = [v for k, v in attrs if k=='href']
23 if href:
24 self.urls.extend(href)
25
26if __name__ == "__main__":
27 import urllib
28 usock = urllib.urlopen("http://diveintopython.org/")
29 parser = URLLister()
30 parser.feed(usock.read())
31 parser.close()
32 usock.close()
33 for url in parser.urls: print url
Use the SGMLParser to retrieve a list of URLs from a web page.
Comments
i have tried this code for getting URLS from a webpage.
it works fine.
now i want to make some changes,
like i want to get a list of only "image" urls.
e.g; jpeg,gif etc
can anyone help me or give me a hint how to do it
thanks