파이썬 코드 50라인으로 웹 크롤러 만들기
[출처] http://theanti9.wordpress.com/2009/02/14/python-web-crawler-in-less-than-50-lines/
Python Web Crawler in Less Than 50 LinesI got kind of bored today, and wrote a pretty simple web crawler with python and it turned out to be less than 50 lines. It doesn’t store output, I’ll leave that up to anyone who wants to use the code, because, well, theres just too many ways to choose from. Right now you pass it a starting link as a parameter and it will crawl forever untill it runs out of links. But that is not a likely condition. So here ya go. Have fun. Feel free to ask questions import sys import re import urllib2 import urlparse tocrawl = set([sys.argv[1]]) crawled = set([]) keywordregex = re.compile('<meta\sname=["\']keywords["\']\scontent=["\'](.*?)["\']\s/>') linkregex = re.compile('<a\s*href=[\'|"](.*?)[\'"].*?>') while 1: try: crawling = tocrawl.pop() print crawling except KeyError: raise StopIteration url = urlparse.urlparse(crawling) try: response = urllib2.urlopen(crawling) except: continue msg = response.read() startPos = msg.find('<title>') if startPos != -1: endPos = msg.find('</title>', startPos+7) if endPos != -1: title = msg[startPos+7:endPos] print title keywordlist = keywordregex.findall(msg) if len(keywordlist) > 0: keywordlist = keywordlist[0] keywordlist = keywordlist.split(", ") print keywordlist links = linkregex.findall(msg) crawled.add(crawling) for link in (links.pop(0) for _ in xrange(len(links))): if link.startswith('/'): link = 'http://' + url[1] + link elif link.startswith('#'): link = 'http://' + url[1] + url[2] + link elif not link.startswith('http'): link = 'http://' + url[1] + '/' + link if link not in crawled: tocrawl.add(link) ** EDIT ** This was a very early draft of this program. As it turns out, I revisited this project a few months later and it evolved much more. |
'Computer Science' 카테고리의 다른 글
[파이썬] scrapy 로 웹 사이트 크롤링 (0) | 2012.10.09 |
---|---|
델파이 프로그래밍 소스 사이트(유용한 델파이 사이트) (0) | 2012.10.07 |
[프로그래밍 Tip] 아래는 프로그램 소스코드를 공개하는 사이트를 모아둔 것입니다. (0) | 2012.10.04 |
Hadoop HDFS BASIC Usage Over View : 하둡 사용 개요 (0) | 2012.08.20 |
오픈 소스 클라우드 분석 (0) | 2012.08.20 |