파이썬 코드 50라인으로 웹 크롤러 만들기

파이썬 코드 50라인으로 웹 크롤러 만들기 2012.10.05

파이썬 코드 50라인으로 웹 크롤러 만들기

2012. 10. 5. 16:02

파이썬 코드 50라인으로 웹 크롤러 만들기

[출처] http://theanti9.wordpress.com/2009/02/14/python-web-crawler-in-less-than-50-lines/

Python Web Crawler in Less Than 50 Lines

February 14, 2009The.Anti.9Leave a commentGo to comments

I got kind of bored today, and wrote a pretty simple web crawler with python and it turned out to be less than 50 lines. It doesn’t store output, I’ll leave that up to anyone who wants to use the code, because, well, theres just too many ways to choose from. Right now you pass it a starting link as a parameter and it will crawl forever untill it runs out of links. But that is not a likely condition. So here ya go. Have fun. Feel free to ask questions

import sys
import re
import urllib2
import urlparse
tocrawl = set([sys.argv[1]])
crawled = set([])
keywordregex = re.compile('<meta\sname=["\']keywords["\']\scontent=["\'](.*?)["\']\s/>')
linkregex = re.compile('<a\s*href=[\'|"](.*?)[\'"].*?>')

while 1:
	try:
		crawling = tocrawl.pop()
		print crawling
	except KeyError:
		raise StopIteration
	url = urlparse.urlparse(crawling)
	try:
		response = urllib2.urlopen(crawling)
	except:
		continue
	msg = response.read()
	startPos = msg.find('<title>')
	if startPos != -1:
		endPos = msg.find('</title>', startPos+7)
		if endPos != -1:
			title = msg[startPos+7:endPos]
			print title
	keywordlist = keywordregex.findall(msg)
	if len(keywordlist) > 0:
		keywordlist = keywordlist[0]
		keywordlist = keywordlist.split(", ")
		print keywordlist
	links = linkregex.findall(msg)
	crawled.add(crawling)
	for link in (links.pop(0) for _ in xrange(len(links))):
		if link.startswith('/'):
			link = 'http://' + url[1] + link
		elif link.startswith('#'):
			link = 'http://' + url[1] + url[2] + link
		elif not link.startswith('http'):
			link = 'http://' + url[1] + '/' + link
		if link not in crawled:
			tocrawl.add(link)

** EDIT **

This was a very early draft of this program. As it turns out, I revisited this project a few months later and it evolved much more.
If you would like to check out the more evolved form, feel free to have a look here at my github

저작자표시

'Computer Science' 카테고리의 다른 글

[파이썬] scrapy 로 웹 사이트 크롤링 (0)	2012.10.09
델파이 프로그래밍 소스 사이트(유용한 델파이 사이트) (0)	2012.10.07
[프로그래밍 Tip] 아래는 프로그램 소스코드를 공개하는 사이트를 모아둔 것입니다. (0)	2012.10.04
Hadoop HDFS BASIC Usage Over View : 하둡 사용 개요 (0)	2012.08.20
오픈 소스 클라우드 분석 (0)	2012.08.20

PREV 1 NEXT

졸리운 곰의 정보기술 여행 [김성준]