기본 카테고리

성공하는 사람들은 뭔가 다르다! 계획에서 실천까지 관리 노하우 2008.03.06
게임기획전문가 자격증 실기 - 기획서 작성의 대략 방식 2008.03.02
게임기획전문가 실기 후기 및 정보 2008.03.02 2
Recover MySQL root password 2008.02.28
Beautiful Soup 2008.02.26
Python and HTML Processing 2008.02.26
리눅스에서 웹 스파이더(Web spider) 구현하기 (한글) 2008.02.16
[팁] vi editor와 관련된 유용한 팀[펌] 2008.02.14
알아두면 편리한 윈도우 명령어 2008.02.14
윈도우에 상응하는 리눅스 프로그램 4/4 2008.02.13
리눅스에 파이썬 설치하기 2008.02.13
sql 학습 2008.02.13

성공하는 사람들은 뭔가 다르다! 계획에서 실천까지 관리 노하우

2008. 3. 6. 13:19

성공하는 사람들은 뭔가 다르다! 계획에서 실천까지 관리 노하우 (1)
2008년 3월 6일 / 삼성

시작이 반이라고 하지만 절반의 성공은 성공이 아니다. 뒤집어 생각해보면 절반의 성과를 앞두고 있다 해도 이제 그저 시작했을 뿐인 것이다. 그만큼 성공의 길은 험난하고 힘겹다. 9회말 2아웃이 되어도 결코 안심하거나 방심할 수 없는 것이 야구의 묘미인 것처럼 우리의 인생도 마찬가지 아닐까? 어떤 목표, 어떤 계획이든 시작만큼 그것을 지켜내는 관리가 중요하고, 마지막까지 긴장을 늦추지 않고 완성해 나가는 뚝심이 필요하다.계획을 성공적으로 관리하려면 어떻게 해야 할까? 1) 계획은 요란한 공약이 아닌 자신과의 약속이다 사람마다 차이는 있지만 대부분의 사람들은 남을 의식하면서 살아간다. 자신만을 위한 사적인 일에도 혹시 남이 어떻게 생각할까를 먼저 생각하는 사람들이 의외로 많다고 한다. 그것은 아마도 일기나 편지를 쓰면서 학교 선생님이나 부모님 또는 친구를 의식하기 시작하면서 자기도 모르게 배우게 된 습관이 아닐까? 보여 주기 위해 일기를 쓰는 것이 바람직하지 않다는 걸 알면서도 어른이 된 우리는 연말연시나 특정한 때가 되면 이루지 못할 거창한 계획을 세우거나, 남들에게 뒤쳐지기 싫어서 마음에도 없는 목표를 자랑하듯 내세우기도 한다. 그러나 진리는 언제나 가까이 있다. 계획은 전시용이 아니라 나를 위한, 나만의 것이고, 자신과의 약속이다. 그것도 매우 신성한 약속이다. 2) 계획을 평가하고 구조조정 하라 개인적인 목표를 이루고자 하는 계획은 분명 사적인 것이지만, 그것을 관리하는 데는 객관적인 관점이 필요하다. 지난 두 달 동안의 생활을 돌아보면 앞으로 나머지 열 달이 어떻게 흘러갈지 예측할 수 있는 힌트를 얻을 수 있을 것이고, 그것을 바탕으로 계획을 평가해 볼 수 있는 것이다. 스스로를 냉정하게 평가해 보자. 그리고 계획의 거품을 제거하고 현실화하는 구조조정의 과정을 거쳐야 한다. 의지로 가득 찼던 처음의 계획을 다이어트 하는 일은 결코 부끄러운 일이 아니다. 오히려 진정한 계획의 완성이라고 생각해도 좋다. 3) 나만의 패턴식 라이프 사이클을 확립하라 계획을 제대로 관리하지 못하는 사람들의 특징 중 하나는 계획을 따로 놀게 한다는 것이다. 즉, 그런 사람들의 생활은 '계획은 계획, 생활은 생활'이다. 물론 당장 눈앞에 주어진 일들이 너무 많아서 그 일들을 소화하기도 벅찬 사람들은 여러 가지 핑계를 대고 싶을 것이다. 하지만 계획은 꿈속의 일이 아니며, 생활이 아무리 바빠도 새로운 계획은 계속 꿈꿔야 한다. 또한 우리는 성공만을 위해 사는 것이 아니고, 생활하는 가운데 성공해야 한다. 그렇다면 어떻게 해야 생활도 충실히 하면서 새로 세운 계획을 따로 놀게 하지 않을 수 있을까? 그 비결은 바로 라이프 사이클 관리에 있다. 라이프 사이클 관리란 곧 효율적인 시간관리로 귀결되는데, 그를 위해서는 패턴식 사고가 필수적이다. 제대로 수립된 계획이라면 행동지향적인 요소가 포함돼 있을 것이고, 그를 위한 고정 시간을 배정해야 한다. 주기적인 시간을 배정할 수 없는 계획이라면 계획을 연기하든지 축소하는 것이 좋다. 생활과 따로 노는 계획은 생활의 활력소는 커녕 오히려 스트레스를 유발하는 방해 요소가 되어 우리를 괴롭힌다. 4) 멀티 태스킹이 아니라 '멀티 플래닝'으로 승부하라 컴퓨터가 등장한 이래 많은 사람들에게 각광 받는 것이 멀티태스킹이라는 동시처리 방식이다. 한 번에 여러 가지 일을 해낼 수 있는 능력이 어찌 매력적으로 보이지 않을까? 하지만 우리는 컴퓨터가 아니며, 컴퓨터 또한 엄밀히 말하면 동시에 여러 가지 일을 처리하는 것이라기보다는 치밀한 계산과 엄청난 속도를 통해 여러 가지 일을 순차적으로 처리해내는 것이다. CPU가 두 개 있는 컴퓨터는 예외가 될 수도 있겠으나 머리 둘 달린 사람은 존재하지 않으니 아직은 불가능한 일이다. 멀티태스킹을 욕심내는 것보다 우리에게 정작 중요한 것은 계획을 관리하는 방식이다. 필자는 그것을 '멀티 플래닝(Multi Planning)' 능력이라고 부르고 싶다. 실행은 멀티하게 할 수 없지만 계획, 즉 플래닝을 멀티하게 하는 것은 가능하기 때문이다. 멀티 플래닝을 통해 우리는 수많은 일들을 중요도와 연관성에 따라 꼬리에 꼬리를 무는 실행으로 펼쳐 나갈 수 있다. 그러나 아무런 도구 없이는 불가능하다. 단지 똑똑한 머리와 비상한 기억력만으로는 쉽지 않은 일이다. 다시 말해서 계획 도구를 활용해야 하는 일인 것이다. 5) 수첩, 다이어리 가리지 말고 기록의 달인이 돼라 멀티 플래닝을 해야 하는 사람들에게 기록은 필수다. 기록의 도구가 곧 계획의 도구라고 할 수 있다. 우리는 이미 크고 작은 노트와 캘린더, 다양한 형식의 다이어리에 둘러싸여 살고 있다. 기록 도구의 선택과 활용 방법에 정답 또는 왕도가 있을까? 간단한 가이드를 제시하면 다음과 같다. 첫째, 기록 도구는 단순한 것을 선택하고 점차 업그레이드하라. 복잡한 시스템 다이어리가 누구에게나 맞는 것은 아니다. 단순한 노트(다이어리)를 사용하다가 익숙해진 후 신중하게 고기능 다이어리를 고려해도 늦지 않다. 둘째, 다이어리의 노예가 되지 말고 소신 있는 메모광이 돼라. 자동차를 생각해 보라. 주로 쓰는 기능은 몇 가지 안 된다. 다이어리 양식이나 매뉴얼에 너무 얽매이지 말고 업무 특성과 성격에 맞게 기록해 나가는 게 좋다. 기록의 목적은 '계획 생활'이지 '다이어리 잘 쓰기'가 아니라는 걸 명심하자. 셋째, 완전 통합을 하지 못하면 용도별 통합이라도 추구하라. 기록 도구는 항상 휴대하고 다니며 통합적으로 활용하는 게 가장 좋다. 하지만 경우에 따라 용도별로 분산 통합을 시도하는 것도 바람직하다. 예를 들어 업무용, 자기개발용, 가정용 등 기록 도구를 따로 쓰더라도 일정관리만큼은 반드시 혼선이 없도록 연동 관리하는 지혜를 발휘하라. 천하의 나폴레옹도 기억보다는 기록이 중요하다고 했다. 자신의 머리를 믿기보다는 기록의 힘을 믿어보자. - 필자 김일희 / 대북공작소 대표 저서로 <작심후3일-바쁠수록 계획하라>가 있다.

'기본 카테고리' 카테고리의 다른 글

2008.03.14 게임기획전문가필기 결과 (0)	2008.03.14
대리, 과장, 부장 등의 영어 표현은? (0)	2008.03.07
게임기획전문가 자격증 실기 - 기획서 작성의 대략 방식 (0)	2008.03.02
게임기획전문가 실기 후기 및 정보 (2)	2008.03.02
Recover MySQL root password (0)	2008.02.28

게임기획전문가 자격증 실기 - 기획서 작성의 대략 방식

2008. 3. 2. 17:55

게임기획전문가 자격증 실기 - 기획서 작성의 대략 방식 [게임기획]

2004-09-08 오후 3:13:32

http://blog.hanafos.com/ssw711/20

게임기획전문가 자격증 실기 - 기획서 작성의 대랙 방식을 설명해 드립니다.

내용자체의 양식학습도 힘들고 파트별 개념잡기도 쉽지 않기에 전문기관을 두드리거나 전문가에게 개인지도를

받는 것이 가장 좋으며 현실적인 취업부분도 가능합니다.

대부분 기획자들이 게임 기획서를 쓰는 데 있어서 어떤 작업 경로를 통해서 무슨 내용들을 써야하는 지 감을 잡기

어려울 때가 많습니다. 특히 정확한 작업 플로우를 거치지 않고, 무작위로 작업을 하다가는 기획 도중 혼란에 빠지

거나 자신이 쓰고자 하는 의도의 기획서가 나오지 않을 수도 있습니다. 아래의 내용은 그 동안 제가 기획서 쓰면서

나름대로 쌓였던 노하우를 정리한 것으로 참조가 되었으면 합니다.

먼저 기획서를 쓰기 전에 가장 먼저 해야할 작업은 개발 준비 작업을 정리해야 합니다. 이 작업의 경우 보통 기획자

혼자서 진행하는 경우도 있지만, 사내의 마케팅 팀과 더불어 일을 추진할 수도 있지요. 개발 전 준비가 필요한 것은

다음과 같은 것들이 있습니다.

1.게임 장르
2.게임Target
3.게임 제작 비용
4.게임 필요 개발 인력
5.게임 개발 일정
6.게임이 가지는 Feature

물론 위와 같은 거 외에도 게임 보급 전략 및 게임 규모등과 같은 다른 요소들도 있지만,
일단 게임 개발 전에 필요한 것들을 정리하여 준비한다는 개념으로 크게 보면 되겠습니다.

위와 같은 작업이 끝나면, 본격적인 게임 기획 작업을 시작할 수 있습니다. 기획 작업 중 가장 먼저 해야 할 것은 게임 기본 설정 잡기입니다. 게임의 구체적인 내용을 기획하기 전에 필요한 청사진을 그리는 작업이라고 할까요?

그 작업으로는 다음과 같은 것들이 있습니다.

1.게임 스토리
2.게임 설정(세계관, 시간 개념, 공간 개념 설정)
3.컨셉 리스트 작성

위와 같은 리스트를 작성을 했으면 그 다음으로 필요한 것은 게임 내에 필요한 리소스를 정리 하는 작업을 해야

합니다. 리소스 정리 역시 생각나는 데로 정리하는 것이 아니라, 기준을 세워서 해야 하는데요. 아래의 기준은 제

나름대로 세웠던 기준입니다.

1.케릭터 리소스
케릭터에 대한 설명을 하는 내용으로 1인칭 시점의 어드벤쳐 혹은 육성 시뮬레이션이나, 격투 게임과 같이 케릭터를 중심으로 하는 게임에서 케릭터 중심의 리소스의 데이터를 정리하여 그 게임의 기획을 시작하는 경우가 많습니다.

2.오브젝트 리소스
오브젝트 리소스는 게임 내에서 필요한 오브젝트 리소스들을 정리하여서 그 설정을 잡아 주는 것입니다. 집이면 집

아이템이면 아이템 설정을 잡아주는 것이지요. 하지만, 오브젝트를 실마리로 하여서 기획안의 초입을 들어가는

경우는 거의 아직까지는 보지 못했습니다.

3.시스템 리소스
시스템 리소스는 게임 내에서 들어가는 게임 시스템을 기준으로 하여서 그 기획의 실마리를 잡아가는 것입니다.

많은 rpg나 시뮬레이션 중심으로 이러한 방식이 사용되고 있습니다. 하지만 시스템 리소스는 단독으로 언급될 수

있는 것은 아니고 월드 리소스와 케릭터 리소스 오브젝트 리소스 등과 같이 혼합되어서 언급되는 경우가 많기 때문에 이것 역시 시스템 단독으로 리소스 정리를 들어갈 수는 없겠지요.말이 너무 어려운 것 같아서 예를 들어서 설명을

드리겠습니다. 시스템적으로 정치 체제를 넣는데 이러한 정치는 귀족들의 선거에 의해서 선출이 된다는 룰을 정했

다면, 그 선거를 하는 구역을 어떻게 나눌 것인가에 대한 지형적인 설정을 월드 리소스에서 해주어야하며, 케릭터

파라미터로 계급과 직업의 파라미터가 필요하다는 설정도 잡아야겠지요. 다음으로 투표소와 같은 오브젝트 세팅도

필요하겠지요.

4.월드 리소스
rpg나 어드벤쳐 게임에서 주로 기준으로 많이 쓰이는 리소스입니다. 맵이나 마을 등과 같은 것들을 정리하는 작업

으로 이것들을 먼저 설정을 해놓는 작업으로 시작하여서 기획서를 시작하는 경우가 많습니다.

5.스토리 리소스
이러한 게임의 경우 보통 야게임이나 어드벤쳐 게임에서 많이 채용하고 있는 방식입니다. 이 방식으로는 온라인

게임을 만드는 것은 불가능하겠지요. 아마도 게임이 가지는 양방향성 면에서 현저하게 떨어지니까요. 하지만 페키지 게임을 만드는 데 탄탄한 스토리를 중심으로 구현한다면 좋은 작품을 만들수 있겠지요. 그 예로는 코지마 히데오의 메탈기어 솔리드와 같은 게임을 들 수가 있습니다.

게임기획전문가 실기대비GO

'기본 카테고리' 카테고리의 다른 글

대리, 과장, 부장 등의 영어 표현은? (0)	2008.03.07
성공하는 사람들은 뭔가 다르다! 계획에서 실천까지 관리 노하우 (0)	2008.03.06
게임기획전문가 실기 후기 및 정보 (2)	2008.03.02
Recover MySQL root password (0)	2008.02.28
Beautiful Soup (0)	2008.02.26

게임기획전문가 실기 후기 및 정보

2008. 3. 2. 17:50

게임기획전문가 실기 후기 및 정보

게임기획/기획자되어가기 2007/10/06 15:52

방금 게임기획전문가 실기시험 보고 왔습니다.(2007년 10월 6일) 혼자 준비하면서 이 넓은 인터넷의 바다에서도 제대로 된 정보를 찾은 적이 없어서 이렇게 후기를 올립니다. 올해는 끝났으니 내년에 준비하시는 기획자 및 지망생 여러분들에게 도움이 되었으면 합니다.

●장소 : 저는 뚝섬유원지역 부근 산업인력공단 동부지사에서 시험쳤습니다.

●시험시간 : 8시 30분까지 입실이라고 수험표에 적혀있습니다. 저는 10분 지각했는데 저보다 10분 늦게 오는 사람도 있더군요. 9시까지는 봐주는 것 같으니까 혹시 늦잠 자더라도 포기하지 말고 달려가서 시험보고 오세요. 그리고 9시 5분에 시작했는데 저는 늦게 푼 편이 아니었는데 1시 정도에 스케치까지 끝냈습니다. 게임학원 같은 곳에서 단체로 온 몇 분은 빨리 나가시더군요. 학원에서 자격증에 대한 공부도 따로 하나 봅니다.

●시험방식 : 14문제가 나오는데 그것을 워드로 작성해서 출력하고 출력물 위의 지정한 공백에 스케치 하고나서 디스켓과 함께 제출하고 나오면 됩니다. 스캔 뜨고 이런 복잡한 과정 없습니다. 편하게 가서 보세요. 그리고 저는 다 풀고 출력하니까 5장 나왔습니다. 많이 하신 분들은 7장도 한 것 같더군요. 컨닝은 아니고 슬쩍 둘러봤지요.

●지참물 : 산업인력공단 사이트에는 지참물로 싸인펜, 스케치용 연필, 스카치 테이프 등을 제시하고 있는데, 다 필요없습니다. 시험이 워드로 진행되기 때문에 스케치용 연필 밖에 쓸 일이 없더군요. A4용지로 출력한 다음 자신이 설정한 공백에 스케치를 해야 하기 때문에 저는 그냥 샤프로 했습니다.

●시험문제 : 아마 이게 제일 궁금하겠지요?

1번 : Game과 Story의 차이
2번 : Game과 Simulation의 차이
3번 : Game과 Puzzle의 차이
4번 : Game과 Toy의 차이
해설 -> 게임 개론서나 라프코스터의 재미이론 같은 책들 찾아보세요. 금방 나옵니다.

5번 : 다음의 가상의 어떤 국가의 상황이다. 이 여건에 맞추어 기획한 MMORPG에 대한 마케팅 전략에 대해 기술하라.
1) 비디오게임기 보급현황 : 전 가전의 90% 이상 보급되어있다.
2) 정보통신 인프라 : 초고속 인터넷이 전 가정에 보급되어 있다. 하지만 다운로드에 대해 부정적으로 인식하고 있는 사람들이 많다.
3) 소비자 구매성향 : 게임 CD 판매율이 높아지고 있다.
4) 아이템 매매에 대한 소비자의 인식 : 아이템을 구입하고 타인에게 거래하는 것에 대해 매우 부정적으로 인식하고 있다.
해설 -> 마케팅에 대해 전문적으로 배우지 않았어도 유추는 하실 수 있겠죠? 비디오게임 보급률과 구매성향, 다운로드에 대한 부정적 인식을 보았을 때 당연히 콘솔용 CD로 출시되어야 하고 콘솔용이니까 싱글과 멀티 두 가지 플레이가 다 가능해야 하고 MMORPG니까 CD는 싸게 판매하고 멀티플레이 이용자들에게는 저렴한 정액요금을 적용해야겠지요? 아이템에 대한 인식이 부정적이니까 부분 유료화는 별로구요. 물론 정답 아닙니다. 다른 답도 있으면 답글 좀 부탁합니다.

6번~9번 : 가장 시간을 많이 잡아먹는 혼합형 문제다. '로미오와 줄리엣'의 스토리를 게임 시나리오로 준 다음 밑의 문제들을 해결하게 한다.

6번 : 두 명의 주인공 캐릭터를 스케치하고 컨셉에 대해 디자이너에게 설명하라.
해설 -> 머... 답이 있나요. 자신이 상상한 로미오와 줄리엣 그리고 컨셉 설명해야죠. 전 그냥 원본 로미오와 줄리엣에 맞춰서 했어요. 디카프리오 영화처럼 퓨전하게 할까 하다가 스케치 할 자신이 없어서요.

7번 : 줄리엣의 방이 보이는 곳을 중심으로 줄리엣의 집을 스케치하고 배경 컨셉에 대해 디자이너에게 설명하라.
해설 -> 이것도 답 없죠. 저는 너무 방을 중심으로 스케치한 듯 하네요. 거의 빵점이지 않을까. 집의 형태와 정원을 좀더 자세하게 그렸어야 하는데... 아쉬움이 남는 항목이네요.

8번 : 디볼트와 로미오의 격투장면을 철권과 같은 대전형 게임으로 만들 경우 인터페이스에 대해 스케치하고 구성 요소들에 대해 설명하라.
해설 -> 대전형 게임 인터페이스 그린 다음 부분 부분에 대해 설명했어요.

9번 : 로렌스 신부는 로미오에게 줄리엣과 모의한 내용의 편지를 전달해야 하는데 착오가 생겨 실패한다. 이것을 모티브로 니드 포 스피드와 같은 레이싱 게임을 만들려고 한다. (제약 - 정해진 시간 안에 해야한다.)
1) 게임의 목표를 설정하라.
해설 -> 주어진 제약을 뛰어넘어 로미오에게 편지를 전달하는 것이겠지요.
2) 장애요소 3가지를 설정하고 그것의 역활에 대해 기술하라.
해설 -> 제약이 시간이니 이것은 상수고, 그러면 속도와 거리가 변수가 될 수 있겠네요. 거리도 상수로 지정할 수 있겠지만 그러면 변수가 하나여서 너무 재미가 없겠죠. 장애요소는 커브길, 마차, 새총 세 가지로 설정했습니다. 셋 중에 커브길은 거리, 마차와 새총은 속도에 영향을 주고록 설정했습니다. 그리고 난이도에 따라 그 출현량은 변동한다고 추가설명 했습니다.
3) 시작점, A지역, B지역, C지역, 도착점의 다섯 부분으로 나누어 난이도의 레벨을 설정하라. 단, 난이도가 최대가 되는 지점이 하나는 있어야 하며 근거있게 수행하라.
해설 -> 이건 그래프를 워드프로세서 안에서 만들어야 하는데요. 일반적인 소설의 플롯에 대한 이론에 맞춰서 하면 됩니다. 시작점을 발단으로 잡고 C지역을 절정으로 해서 난이도를 최대로 하고 도착점을 결말로 하면 될 것 같습니다.

10번 : 다음 게임의 장르를 적고 게임의 특징과 재미요소에 대해 논하라.
1) 스타크래프트
2) 철권
3) 팰콘4.0
해설 ->이 정도 게임은 알아서 설명하실 줄 알아야 기획자 합니다. 그래서 저는 아직 기획자가 못되나 봅니다. ㅠㅠ. 철권은 2D인지 3D인지 기억이 안나서 슬쩍 그 부분은 빼고 설명했구요. 팰콘은 그 쪽 장르의 일반적인 특징에 대해 간략히 적었고 추측한 내용을 슬그머니 추가했습니다. 아무래도 버전이 4.0이니 기능이 많이 좋겠다 싶어서요. 스타크는 머 워낙 많이 해 본 게임이라.

11~14번 : 다음 주어진 성향을 가진 MMORPG 유저들에 맞는 서비스에 대해 기획자의 관점에서 논하라.
11번 : Killer
12번 : Achiever
13번 : Socializer
14번 : Explorer
해설 -> 제우미디어에서 나온 '온라인 게임기획, 이렇게 한다'라는 책에 보면 나와있습니다.

문제 어떤가요? 어렵나요? 6~9번의 것을 제외하고는 평소에 게임에 관심있고 게임기획 개론서 독서를 몇 번만 했다면 작성은 할 수 있을 것입니다. 그리고 6~9번의 것은 간단한 게임이라도 몇 번만 기획해 보았다면 작성할 수 있을 것입니다. 실기는 일주일 준비한다고 되는 시험이 아니었습니다. 평소에 관심있는만큼 성적이 나올것 같네요. 아참... 기획자 준비하시는 분들... 제발 준비하지 마세요. 경쟁률 높아집니다. 아직 취업도 못했는데... ㅎㅎ

TAG 게임기획전문가

'기본 카테고리' 카테고리의 다른 글

성공하는 사람들은 뭔가 다르다! 계획에서 실천까지 관리 노하우 (0)	2008.03.06
게임기획전문가 자격증 실기 - 기획서 작성의 대략 방식 (0)	2008.03.02
Recover MySQL root password (0)	2008.02.28
Beautiful Soup (0)	2008.02.26
Python and HTML Processing (0)	2008.02.26

Recover MySQL root password

2008. 2. 28. 10:52

Recover MySQL root password

You can recover MySQL database server password with following five easy steps.

Step # 1: Stop the MySQL server process.

Step # 2: Start the MySQL (mysqld) server/daemon process with the –skip-grant-tables option so that it will not prompt for password

Step # 3: Connect to mysql server as the root user

Step # 4: Setup new root password

Step # 5: Exit and restart MySQL server

Here are commands you need to type for each step (login as the root user):

Step # 1 : Stop mysql service

# /etc/init.d/mysql stop
Output:

Stopping MySQL database server: mysqld.

Step # 2: Start to MySQL server w/o password:

# mysqld_safe --skip-grant-tables &
Output:

[1] 5988Starting mysqld daemon with databases from /var/lib/mysqlmysqld_safe[6025]: started

Step # 3: Connect to mysql server using mysql client:

# mysql -u root
Output:

Welcome to the MySQL monitor.  Commands end with ; or \g.Your MySQL connection id is 1 to server version: 4.1.15-Debian_1-logType 'help;' or '\h' for help. Type '\c' to clear the buffer.mysql>

Step # 4: Setup new MySQL root user password

mysql> use mysql; mysql> update user set password=PASSWORD("NEW-ROOT-PASSWORD") where User='root'; mysql> flush privileges; mysql> quit

Step # 5: Stop MySQL Server:

# /etc/init.d/mysql stop
Output:

Stopping MySQL database server: mysqldSTOPPING server from pid file /var/run/mysqld/mysqld.pidmysqld_safe[6186]: ended[1]+  Done                    mysqld_safe --skip-grant-tables

Step # 6: Start MySQL server and test it

# /etc/init.d/mysql start # mysql -u root -p

Want to stay up to date with the latest Linux tips, news and announcements? Subscribe to our free e-mail newsletter or full RSS feed to get all updates. You can Email this page to a friend.

'기본 카테고리' 카테고리의 다른 글

게임기획전문가 자격증 실기 - 기획서 작성의 대략 방식 (0)	2008.03.02
게임기획전문가 실기 후기 및 정보 (2)	2008.03.02
Beautiful Soup (0)	2008.02.26
Python and HTML Processing (0)	2008.02.26
리눅스에서 웹 스파이더(Web spider) 구현하기 (한글) (0)	2008.02.16

Beautiful Soup

2008. 2. 26. 13:17


Your ad here, right now: $0.02

You didn't write that awful page. You're just trying to get some data out of it. Right now, you don't really care what HTML is supposed to look like.

Neither does this parser.

Beautiful Soup

"A tremendous boon." -- Python411 Podcast

Beautiful Soup is a Python HTML/XML parser designed for quick turnaround projects like screen-scraping. Three features make it powerful:

Beautiful Soup won't choke if you give it bad markup. It yields a parse tree that makes approximately as much sense as your original document. This is usually good enough to collect the data you need and run away.
Beautiful Soup provides a few simple methods and Pythonic idioms for navigating, searching, and modifying a parse tree: a toolkit for dissecting a document and extracting what you need. You don't have to create a custom parser for each application.
Beautiful Soup automatically converts incoming documents to Unicode and outgoing documents to UTF-8. You don't have to think about encodings, unless the document doesn't specify an encoding and Beautiful Soup can't autodetect one. Then you just have to specify the original encoding.

Beautiful Soup parses anything you give it, and does the tree traversal stuff for you. You can tell it "Find all the links", or "Find all the links of class externalLink", or "Find all the links whose urls match "foo.com", or "Find the table heading that's got bold text, then give me that text."

Valuable data that was once locked up in poorly-designed websites is now within your reach. Projects that would have taken hours take only minutes with Beautiful Soup.

Download Beautiful Soup

The latest version is Beautiful Soup version 3.0.5, released December 12, 2007. You can download it as a single, self-contained file, or as a tarball with installer script and unit tests. Beautiful Soup is licensed under the same terms as Python itself, so you can drop it into almost any Python application (or into your library path) and start using it immediately.

Beautiful Soup works with Python versions 2.3 and up. It works best with Python versions 2.4 and up. If you don't have Python 2.4, you should install the cjkcodecs, iconvcodec, and chardet libraries. If you don't do this, Beautiful Soup will still work, but it won't be very good at parsing documents in Asian encodings.

Older versions are still available: the 1.x series works with Python 1.5, and the 2.x series has a fairly large installed base.

This document (source) is part of Crummy, the webspace of Leonard Richardson (contact information). It was last modified on Friday, December 21 2007, 18:57:10 Nowhere Standard Time and last built on Monday, February 25 2008, 23:00:01 Nowhere Standard Time.

Document tree:

http://www.crummy.com/

software/

BeautifulSoup/

Site Search:

'기본 카테고리' 카테고리의 다른 글

게임기획전문가 실기 후기 및 정보 (2)	2008.03.02
Recover MySQL root password (0)	2008.02.28
Python and HTML Processing (0)	2008.02.26
리눅스에서 웹 스파이더(Web spider) 구현하기 (한글) (0)	2008.02.16
[팁] vi editor와 관련된 유용한 팀[펌] (0)	2008.02.14

Python and HTML Processing

2008. 2. 26. 13:05

Python and HTML Processing

Home	People	HTML	Emulation
		Python

Abstract

Various Web surfing tasks that I regularly perform could be made much easier, and less tedious, if I could only use Python to fetch the HTML pages and to process them, yielding the information I really need. In this document I attempt to describe HTML processing in Python using readily available tools and libraries.

NOTE: This document is not quite finished. I aim to include sections on using mxTidy to deal with broken HTML as well as some tips on cleaning up text retrieved from HTML resources.

Prerequisites

Depending on the methods you wish to follow in this tutorial, you need the following things:

For the "SGML parser" method, a recent release of Python is probably enough. You can find one at the Python download page.
For the "XML parser" method, a recent release of Python is required, along with a capable XML processing library. I recommend using libxml2dom, since it canhandle badly-formed HTML documents as well as well-formed XML or XHTML documents. However, PyXML also provides support for such documents.
For fetching Web pages over secure connections, it is important that SSL support is enabled either when building Python from source, or in any packaged distribution of Python that you might acquire. Information about this is given in the source distribution of Python, but you can download replacement socket libraries with SSL support for older versions of Python for Windows from Robin Dunn's site.

Activities

Accessing sites, downloading content, and processing such content, either to extract useful information for archiving or to use such content to navigate further into the site, require combinations of the following activities. Some activities can be chosen according to preference: whether the SGML parser or the XML parser (or parsing framework) is used depends on which style of programming seems nicer to a given developer (although one parser may seem to work better in some situations). However, technical restrictions usually dictate whether certain libraries are to be used instead of others: when handling HTTP redirects, it appears that certain Python modules are easier to use, or even more suited to handling such situations.

Fetching Web Pages

Fetching standard Web pages over HTTP is very easy with Python:

import urllib
# Get a file-like object for the Python Web site's home page.
f = urllib.urlopen("http://www.python.org")
# Read from the object, storing the page's contents in 's'.
s = f.read()
f.close()

Supplying Data

Sometimes, it is necessary to pass information to the Web server, such as information which would come from an HTML form. Of course, you need to know which fields are available in a form, but assuming that you already know this, you can supply such data in the urlopen function call:

# Search the Vaults of Parnassus for "XMLForms".
# First, encode the data.
data = urllib.urlencode({"find" : "XMLForms", "findtype" : "t"})
# Now get that file-like object again, remembering to mention the data.
f = urllib.urlopen("http://www.vex.net/parnassus/apyllo.py", data)
# Read the results back.
s = f.read()
s.close()

The above example passed data to the server as an HTTPPOST request. Fortunately, the Vaults of Parnassus is happy about such requests, but this is not always the case with Web services. We can instead choose to use a different kind of request, however:

# We have the encoded data. Now get the file-like object...
f = urllib.urlopen("http://www.vex.net/parnassus/apyllo.py?" + data)
# And the rest...

The only difference is the use of a? (question mark) character and the adding of data onto the end of the Vaults of Parnassus URL, but this constitutes an HTTPGET request, where the query (our additional data) is included in the URL itself.

Fetching Secure Web Pages

Fetching secure Web pages using HTTPS is also very easy, provided that your Python installation supports SSL:

import urllib
# Get a file-like object for a site.
f = urllib.urlopen("https://www.somesecuresite.com")
# NOTE: At the interactive Python prompt, you may be prompted for a username
# NOTE: and password here.
# Read from the object, storing the page's contents in 's'.
s = f.read()
f.close()

Including data which forms the basis of a query, as illustrated above, is also possible with URLs starting withhttps.

Handling Redirects

Many Web services use HTTP redirects for various straightforward or even bizarre purposes. For example, a fairly common technique employed on "high traffic" Web sites is the HTTP redirection load balancing strategy where the initial request to the publicised Web site (eg.http://www.somesite.com) is redirected to another server (eg.http://www1.somesite.com) where a user's session is handled.

Fortunately, urlopen handles redirects, at least in Python 2.1, and therefore any such redirection should be handled transparently by urlopen without your program needing to be aware that it is happening. It is possible to write code to deal with redirection yourself, and this can be done using the httplib module; however, the interfaces provided by that module are more complicated than those provided above, if somewhat more powerful.

Using the SGML Parser

Given a character string from a Web service, such as the value held by s in the above examples, how can one understand the content provided by the service in such a way that an "intelligent" response can be made? One method is by using an SGML parser, since HTML is a relation of SGML, and HTML is probably the content type most likely to be experienced when interacting with a Web service.

In the standard Python library, the sgmllib module contains an appropriate parser class called SGMLParser. Unfortunately, it is of limited use to us unless we customise its activities somehow. Fortunately, Python's object-oriented features, combined with the design of the SGMLParser class, provide a means of customising it fairly easily.

Defining a Parser Class

First of all, let us define a new class inheriting from SGMLParser with a convenience method that I find very convenient indeed:

import sgmllib

class MyParser(sgmllib.SGMLParser):
    "A simple parser class."

    def parse(self, s):
        "Parse the given string 's'."
        self.feed(s)
        self.close()

    # More to come...

What the parse method does is provide an easy way of passing some text (as a string) to the parser object. I find this nicer than having to remember calling the feed method, and since I always tend to have the entire document ready for parsing, I do not need to use feed many times - passing many pieces of text which comprise an entire document is an interesting feature of SGMLParser (and its derivatives) which could be used in other situations.

Deciding What to Remember

Of course, implementing our own customised parser is only of interest if we are looking to find things in a document. Therefore, we should aim to declare these things before we start parsing. We can do this in the __init__ method of our class:

    # Continuing from above...

    def __init__(self, verbose=0):
        "Initialise an object, passing 'verbose' to the superclass."

        sgmllib.SGMLParser.__init__(self, verbose)
        self.hyperlinks = []

    # More to come...

Here, we initialise new objects by passing information to the __init__ method of the superclass (SGMLParser); this makes sure that the underlying parser is set up properly. We also initialise an attribute called hyperlinks which will be used to record the hyperlinks found in the document that any given object will parse.

Care should be taken when choosing attribute names, since use of names defined in the superclass could potentially cause problems when our parser object is used, because a badly chosen name would cause one of our attributes to override an attribute in the superclass and result in our attributes being manipulated for internal parsing purposes by the superclass. We might hope that the SGMLParser class uses attribute names with leading double underscores (__) since this isolates such attributes from access by subclasses such as our own MyParser class.

Remembering Document Details

We now need to define a way of extracting data from the document, but SGMLParser provides a mechanism which notifies us when an interesting part of the document has been read. SGML and HTML are textual formats which are structured by the presence of so-called tags, and in HTML, hyperlinks may be represented in the following way:

<a href="http://www.python.org">The Python Web site</a>

How SGMLParser Operates

An SGMLParser object which is parsing a document recognises starting and ending tags for things such as hyperlinks, and it issues a method call on itself based on the name of the tag found and whether the tag is a starting or ending tag. So, as the above text is recognised by an SGMLParser object (or an object derived from SGMLParser, like MyParser), the following method calls are made internally:

self.start_a(("href", "http://www.python.org"))
self.handle_data("The Python Web site")
self.end_a()

Note that the text between the tags is considered as data, and that the ending tag does not provide any information. The starting tag, however, does provide information in the form of a sequence of attribute names and values, where each name/value pair is placed in a 2-tuple:

# The form of attributes supplied to start tag methods:
# (name, value)
# Examples:
# ("href", "http://www.python.org")
# ("target", "python")

Why SGMLParser Works

Why does SGMLParser issue a method call on itself, effectively telling itself that a tag has been encountered? The basic SGMLParser class surely does not know what to do with such information. Well, if another class inherits from SGMLParser, then such calls are no longer confined to SGMLParser and instead act on methods in the subclass, such as MyParser, where such methods exist. Thus, a customised parser class (eg. MyParser) once instantiated (made into an object) acts like a stack of components, with the lowest level of the stack doing the hard parsing work and passing items of interest to the upper layers - it is a bit like a factory with components being made on the ground floor and inspection of those components taking place in the laboratories in the upper floors!

Class	Activity
...	Listens to reports, records other interesting things
`MyParser`	Listens to reports, records interesting things
`SGMLParser`	Parses documents, issuing reports at each step

Introducing Our Customisations

Now, if we want to record the hyperlinks in the document, all we need to do is to define a method called start_a which extracts the hyperlink from the attributes which are provided in the startinga tag. This can be defined as follows:

    # Continuing from above...

    def start_a(self, attributes):
        "Process a hyperlink and its 'attributes'."

        for name, value in attributes:
            if name == "href":
                self.hyperlinks.append(value)

    # More to come...

All we need to do is traverse the attributes list, find appropriately named attributes, and record the value of those attributes.

Retrieving the Details

A nice way of providing access to the retrieved details is to define a method, although Python 2.2 provides additional features to make this more convenient. We shall use the old approach:

    # Continuing from above...

    def get_hyperlinks(self):
        "Return the list of hyperlinks."

        return self.hyperlinks

Trying it Out

Now that we have defined our class, we can instantiate it, making a new MyParser object. After that, it is just a matter of giving it a document to work with:

import urllib, sgmllib

# Get something to work with.
f = urllib.urlopen("http://www.python.org")
s = f.read()

# Try and process the page.
# The class should have been defined first, remember.
myparser = MyParser()
myparser.parse(s)

# Get the hyperlinks.
print myparser.get_hyperlinks()

The print statement should cause a list to be displayed, containing various hyperlinks to locations on the Python home page and other sites.

The Example File

The above example code can be downloaded and executed to see the results.

Finding More Specific Content

Of course, if it is sufficient for you to extract information from a document without worrying about where in the document it came from, then the above level of complexity should suit you perfectly. However, one might want to extract information which only appears in certain places or constructs - a good example of this is the text between starting and ending tags of hyperlinks which we saw above. If we just acquired every piece of text using a handle_data method which recorded everything it saw, then we would not know which piece of text described a hyperlink and which piece of text appeared in any other place in a document.

    # An extension of the above class.
    # This is not very useful.

    def handle_data(self, data):
        "Handle the textual 'data'."

        self.descriptions.append(data)

Here, the descriptions attribute (which we would need to initialise in the __init__ method) would be filled with lots of meaningless textual data. So how can we be more specific? The best approach is to remember not only the content that SGMLParser discovers, but also to remember what kind of content we have seen already.

Remembering Our Position

Let us add some new attributes to the __init__ method.

        # At the end of the __init__ method...

        self.descriptions = []
        self.inside_a_element = 0

The descriptions attribute is defined as we anticipated, but the inside_a_element attribute is used for something different: it will indicate whether or not SGMLParser is currently investigating the contents of ana element - that is, whether SGMLParser is between the startinga tag and the endinga tag.

Let us now add some "logic" to the start_a method, redefining it as follows:

    def start_a(self, attributes):
        "Process a hyperlink and its 'attributes'."

        for name, value in attributes:
            if name == "href":
                self.hyperlinks.append(value)
                self.inside_a_element = 1

Now, we should know when a startinga tag has been seen, but to avoid confusion, we should also change the value of the new attribute when the parser sees an endinga tag. We do this by defining a new method for this case:

    def end_a(self):
        "Record the end of a hyperlink."

        self.inside_a_element = 0

Fortunately, it is not permitted to "nest" hyperlinks, so it is not relevant to wonder what might happen if an ending tag were to be seen after more than one starting tag had been seen in succession.

Recording Relevant Data

Now, given that we can be sure of our position in a document and whether we should record the data that is being presented, we can define the "real" handle_data method as follows:

    def handle_data(self, data):
        "Handle the textual 'data'."

        if self.inside_a_element:
            self.descriptions.append(data)

This method is not perfect, as we shall see, but it does at least avoid recording every last piece of text in the document.

We can now define a method to retrieve the description data:

    def get_descriptions(self):
        "Return a list of descriptions."

        return self.descriptions

And we can add the following line to our test program in order to display the descriptions:

print myparser.get_descriptions()

The Example File

The example code with these modifications can be downloaded and executed to see the results.

Problems with Text

Upon running the modified example, one thing is apparent: there are a few descriptions which do not make sense. Moreover, the number of descriptions does not match the number of hyperlinks. The reason for this is the way that text is found and presented to us by the parser - we may be presented with more than one fragment of text for a particular region of text, so that more than one fragment of text may be signalled between a startinga tag and an endinga tag, even though it is logically one block of text.

We may modify our example by adding another attribute to indicate whether we are just beginning to process a region of text. If this new attribute is set, then we add a description to the list; if not, then we add any text found to the most recent description recorded.

The __init__ method is modified still further:

        # At the end of the __init__ method...

        self.starting_description = 0

Since we can only be sure that a description is being started immediately after a startinga tag has been seen, we redefine the start_a method as follows:

    def start_a(self, attributes):
        "Process a hyperlink and its 'attributes'."

        for name, value in attributes:
            if name == "href":
                self.hyperlinks.append(value)
                self.inside_a_element = 1
                self.starting_description = 1

Now, the handle_data method needs redefining as follows:

    def handle_data(self, data):
        "Handle the textual 'data'."

        if self.inside_a_element:
            if self.starting_description:
                self.descriptions.append(data)
                self.starting_description = 0
            else:
                self.descriptions[-1] += data

Clearly, the method becomes more complicated. We need to detect whether the description is being started and act in the manner discussed above.

The Example File

The example code with these modifications can be downloaded and executed to see the results.

Conclusions

Although the final example file produces some reasonable results - there are some still strange descriptions, however, and we have not taken images used within hyperlinks into consideration - the modifications that were required illustrate that as more attention is paid to the structure of the document, the more effort is required to monitor the origins of information. As a result, we need to maintain state information within the MyParser object in a not-too-elegant way.

For application purposes, the SGMLParser class, its derivatives, and related approaches (such as SAX) are useful for casual access to information, but for certain kinds of querying, they can become more complicated to use than one would initially believe. However, these approaches can be used for another purpose: that of building structures which can be accessed in a more methodical fashion, as we shall see below.

UsingXML Parsers

Given a character string s, containing an HTML document which may have been retrieved from a Web service (using an approach described in an earlier section of this document), let us now consider an alternative method of interpreting the contents of this document so that we do not have to manage the complexity of remembering explicitly the structure of the document that we have seen so far. One of the problems with SGMLParser was that access to information in a document happened "serially" - that is, information was presented to us in the order in which it was found - but it may have been more appropriate to access the document information according to the structure of the document, so that we could request all parts of the document corresponding to the hyperlink elements present in that document, before examining each document portion for the text within each hyperlink element.

In the XML world, a standard called the Document Object Model (DOM) has been devised to provide a means of access to document information which permits us to navigate the structure of a document, requesting different sections of that document, and giving us the ability to revisit such sections at any time; the use of Python with XML and the DOM is described in another document. If all Web pages were well-formed XML - that is, they all complied with the expectations and standards set out by the XML specifications - then any XML parser would be sufficient to process any HTML document found on the Web. Unfortunately, many Web pages useless formal variants of HTML which are rejected by XML parsers. Thus, we need to employ particular tools and additional techniques to convert such pages to DOM representations.

Below, we describe how Web pages may beprocessed using the PyXML toolkit and with the libxml2dom package to obtain a top-level document object. Since both approaches yield an object which is broadly compatible with the DOM standard, the subsequent description of how we then inspect such documents applies regardless of whichever toolkit or package we have chosen.

Using PyXML

It is possible to use Python's XML framework with the kind of HTML found on the Web by employing a special "reader" class which builds a DOM representation from an HTML document, and the consequences of this are described below.

Creating the Reader

An appropriate class for reading HTML documents is found deep in the xml package, and we shall instantiate this class for subsequent use:

from xml.dom.ext.reader import HtmlLib
reader = HtmlLib.Reader()

Of course, there are many different ways of accessing the Reader class concerned, but I have chosen not to import Reader into the common namespace. One good reason for deciding this is that I may wish to import other Reader classes from other packages or modules, and we clearly need a way to distinguish between them. Therefore, I import the HtmlLib name and access the Reader class from within that module.

Loading a Document

Unlike SGMLParser, we do not need to customise any class before we load a document. Therefore, we can "postpone" any consideration of the contents of the document until after the document has been loaded, although it is very likely that you will have some idea of the nature of the contents in advance and will have written classes or functions to work on the DOM representation once it is available. After all, real programs extracting particular information from a certain kind of document do need to know something about the structure of the documents they process, whether that knowledge is put in a subclass of a parser (as in SGMLParser) or whether it is "encoded" in classes and functions which manipulate the DOM representation.

Anyway, let us load the document and obtain a Document object:

doc = reader.fromString(s)

Note that the "top level" of a DOM representation is always a Document node object, and this is what doc refers to immediately after the document is loaded.

Using libxml2dom

Obtaining documents using libxml2dom is slightly more straightforward:

import libxml2dom
doc = libxml2dom.parseString(s, html=1)

If the document text is well-formed XML, we could omit thehtml parameter or set it to have a false value. However, if we are not sure whether the text is well-formed,no significant issues will arise from setting the parameterin the above fashion.

Deciding What to Extract

Now, it is appropriate to decide which information is to be found and retrieved from the document, and this is where some tasks appear easier than with SGMLParser (and related frameworks). Let us consider the task of extracting all the hyperlinks from the document; we can certainly find all the hyperlink elements as follows:

a_elements = doc.getElementsByTagName("a")

Since hyperlink elements comprise the startinga tag, the endinga tag, and all data between them, the value of the a_elements variable should be a list of objects representing regions in the document which would appear like this:

<a href="http://www.python.org">The Python Web site</a>

Querying Elements

To make the elements easier to deal with, each object in the list is not the textual representation of the element as given above. Instead, an object is created for each element which provides a more convenient level of access to the details. We can therefore obtain a reference to such an object and find out more about the element it represents:

# Get the first element in the list. We don't need to use a separate variable,
# but it makes it clearer.
first = a_elements[0]
# Now display the value of the "href" attribute.
print first.getAttribute("href")

What is happening here is that the first object (being the firsta element in the list of those found) is being asked to return the value of the attribute whose name is href, and if such an attribute exists, a string is returned containing the contents of the attribute: in the case of the above example, this would be...

http://www.python.org

If the href attribute had not existed, such as in the following example element, then a value of None would have been returned.

<a name="Example">This is not a hyperlink. It is a target.</a>

Namespaces

Previously, this document recommended the usage of namespaces and the getAttributeNS method, rather than the getAttribute method. Whilst XML processing may involve extensive use of namespaces, some HTML parsers do not appear to expose them quite as one would expect: for example, not associating the XHTML namespace with XHTML elements in a document. Thus, it can be advisable to ignore namespaces unless their usage is unavoidable in order to distinguish between elements in mixed-content documents (XHTML combined with SVG, for example).

Finding More Specific Content

We are already being fairly specific, in a sense, in the way that we have chosen to access thea elements within the document, since we start from a particular point in the document's structure and search for elements from there. In the SGMLParser examples, we decided to look for descriptions of hyperlinks in the text which is enclosed between the starting and ending tags associated with hyperlinks, and we were largely successful with that, although there were some issues that could have been handled better. Here, we shall attempt to find everything that is descriptive within hyperlink elements.

Elements, Nodes and Child Nodes

Each hyperlink element is represented by an object whose attributes can be queried, as we did above in order to get the href attribute's value. However, elements can also be queried about their contents, and such contents take the form of objects which represent "nodes" within the document. (The nature of XML documents is described in another introductory document which discusses the DOM.) In this case, it is interesting for us to inspect the nodes which reside within (or under) each hyperlink element, and since these nodes are known generally as "child nodes", we access them through the childNodes attribute on each so-called Node object.

# Get the child nodes of the first "a" element.
nodes = first.childNodes

Node Types

Nodes are the basis of any particular piece of information found in an XML document, so any element found in a document is based on a node and can be explicitly identified as an element by checking its "node type":

print first.nodeType
# A number is returned which corresponds to one of the special values listed in
# the xml.dom.Node class. Since elements inherit from that class, we can access
# these values on 'first' itself!
print first.nodeType == first.ELEMENT_NODE
# If first is an element (it should be) then display the value 1.

One might wonder how this is useful, since the list of hyperlink elements, for example, is clearly a list of elements - that is, after all, what we asked for. However, if we ask an element for a list of "child nodes", we cannot immediately be sure which of these nodes are elements and which are, for example, pieces of textual data. Let us therefore examine the "child nodes" of first to see which of them are textual:

for node in first.childNodes:
    if node.nodeType == node.TEXT_NODE:
        print "Found a text node:", node.nodeValue

Navigating the Document Structure

If we wanted only to get the descriptive text within each hyperlink element, then we would need to visit all nodes within each element (the "child nodes") and record the value of the textual elements. However, this would not quite be enough - consider the following document region:

<a href="http://www.python.org">A <em>really</em> important page.</a>

Within thea element, there are text nodes and an em element - the text within that element is not directly available as a "child node" of thea element. If we did not consider textual child nodes of each child node, then we would miss important information. Consequently, it becomes essential to recursively descend inside thea element collecting child node values. This is not as hard as it sounds, however:

def collect_text(node):
    "A function which collects text inside 'node', returning that text."

    s = ""
    for child_node in node.childNodes:
        if child_node.nodeType == child_node.TEXT_NODE:
            s += child_node.nodeValue
        else:
            s += collect_text(child_node)
    return s

# Call 'collect_text' on 'first', displaying the text found.
print collect_text(first)

To contrast this with the SGMLParser approach, we see that much of the work done in that example to extract textual information is distributed throughout the MyParser class, whereas the above function, which looks quite complicated, gathers the necessary operations into a single place, thus making it look complicated.

Getting Document Regions as Text

Interestingly, it is easier to retrieve whole sections of the original document as text for each of the child nodes, thus collecting the complete contents of thea element as text. For this, we just need to make use of a function provided in the xml.dom.ext package:

from xml.dom.ext import PrettyPrint
# In order to avoid getting the "a" starting and ending tags, prettyprint the
# child nodes.
s = ""
for child_node in a_elements[0]:
    s += PrettyPrint(child_node)
# Display the region of the original document between the tags.
print s

Unfortunately, documents produced by libxml2dom do not work withPrettyPrint. However, we can use a method on each node object instead:

# In order to avoid getting the "a" starting and ending tags, prettyprint the
# child nodes.
s = ""
for child_node in a_elements[0]:
    s += child_node.toString(prettyprint=1)
# Display the region of the original document between the tags.
print s

It is envisaged that libxml2dom will eventually work better with such functions and tools.

'기본 카테고리' 카테고리의 다른 글

Recover MySQL root password (0)	2008.02.28
Beautiful Soup (0)	2008.02.26
리눅스에서 웹 스파이더(Web spider) 구현하기 (한글) (0)	2008.02.16
[팁] vi editor와 관련된 유용한 팀[펌] (0)	2008.02.14
알아두면 편리한 윈도우 명령어 (0)	2008.02.14

리눅스에서 웹 스파이더(Web spider) 구현하기 (한글)

2008. 2. 16. 22:01

리눅스에서 웹 스파이더(Web spider) 구현하기 (한글)

간단한 스파이더와 스크래퍼로 인터넷 콘텐트 모으기

문서 옵션

		이 페이지 출력
		이 페이지를 이메일로 보내기

제안 및 의견

피드백

난이도 : 중급

M. Tim Jones, Consultant Engineer, Emulex

2007 년 4 월 17 일

웹 스파이더(Web spider)는 인터넷을 크롤링 하며 정보를 수집하고, 필터링 하며, 사용자를 위한 정보를 한데 모으는 소프트웨어 에이전트입니다. 일반 스크립팅 언어와 웹 모듈을 사용하면 웹 스파이더를 쉽게 구현할 수 있습니다. 이 글에서는 리눅스�용 스파이더와 스크래퍼를 구현하여 웹 사이트를 크롤링 하며 정보를 모으는 방법을 설명합니다.

스파이더(spider)는 특정 목적을 위해 특정한 방법으로 인터넷을 크롤링(crawl) 하는 프로그램이다. 이 프로그램의 목적은 정보를 수집하거나 웹 사이트의 구조와 유효성을 파악하는 것이다. 스파이더는 Google과 AltaVista 같은 현대적인 검색 엔진의 기초가 된다. 이러한 스파이더들은 웹에서 자동으로 데이터를 검색하여, 검색어에 가장 잘 맞는 웹 사이트의 내용을 인덱싱 하는 다른 애플리케이션에 전달한다.

에이전트로서의 웹 스파이더

웹 스파이더와 스크래퍼는 소프트웨어 로봇 또는 에이전트(1980년대 초, Alan Kay에 의해 만들어진 단어)의 또 다른 형태이다. Alan이 만든 에이전트라는 개념은 컴퓨터 세계에서의 사용자용 프록시였다. 이 에이전트에는 목표와, 그 목표에 대한 작업이 주어질 수 있었다. 이것이 어떤 한계에 도달하면 사용자에게서 어드바이스를 요청할 수 있고 계속적으로 그 목표를 수행할 수 있었다.

오늘날 에이전트는 자율성(autonomy), 적응성(adaptiveness), 통신, 다른 에이전트와의 협업(collaboration) 같은 애트리뷰트들로 구분된다. 에이전트 이동성(mobility)과 개성(personality) 같은 기타 애트리뷰트들은 오늘날 에이전트 연구의 목표이다. 이 글에서 설명하는 웹 스파이더는 에이전트 분류법에서 Task-Specific Agents로 분류된다.

스파이더와 비슷한 것으로 웹 스크래퍼(Web scraper)가 있다. 스크래퍼는 스파이더의 한 유형으로서, 웹에서 제품이나 서비스 비용 같은 특수한 내용이 스크래핑 대상이 된다. 한 가지 사용 예제로는 가격 비교가 있는데, 해당 제품의 가격을 파악하여 본인 제품의 가격을 조정하고, 이에 따라 광고를 하는 것이다. 스크래퍼는 많은 웹 소스들에서 데이터를 모으고 그 정보를 사용자에게 제공한다.

생물학적인 동기

스파이더의 본질을 생각할 때, 고립성이 아닌 환경과의 인터랙션에 맞추어 이를 생각하게 된다. 스파이더는 자신의 길을 보고 감지하며, 한 장소에서 또 다른 장소로 의미 있는 방식으로 이동한다. 웹 스파이더도 비슷한 방식으로 작동한다. 웹 스파이더는 고급 언어로 작성된 프로그램이며, Hypertext Transfer Protocol (HTTP) 같은 네트워킹 프로토콜을 사용하여 환경과 인터랙팅 한다. 스파이더가 여러분과 통신하기 원한다면, Simple Mail Transfer Protocol (SMTP)을 사용하여 이메일 메시지를 보낼 수 있다.

스파이더는 HTTP 또는 SMTP 로 국한되지 않는다. 일부 스파이더는 SOAP 또는 Extensible Markup Language Remote Procedure Call (XML-RPC) 프로토콜 같은 웹 서비스를 사용한다. 다른 스파이더는 Network News Transfer Protocol (NNTP)을 통해 뉴스 그룹과 소통하거나, Really Simple Syndication (RSS) 피드로 흥미로운 뉴스 아이템들을 찾는다. 대부분의 스파이더는 본질적으로 명암 강도(light-dark intensity)와 움직임의 변화만 볼 수 있지만, 웹 스파이더들은 많은 유형의 프로토콜들을 사용하여 보고 감지할 수 있다.

위로

스파이더와 스크래퍼의 애플리케이션

스파이더의 눈과 다리

웹 스파이더가 인터넷을 보고 움직이는 주요 수단은 HTTP이며, HTTP는 메시지 중심 프로토콜로서, 이곳에서 클라이언트는 서버와 연결되어 해당 요청을 수행하며, 서버는 응답을 제공한다. 각각의 요청과 응답은 헤더와 바디로 구성되고, 헤더는 상태 정보와 바디의 내용에 대한 디스크립션을 제공한다.

HTTP는 세 가지 기본적인 유형의 요청을 제공한다. 첫 번째가 HEAD인데, 이것은 서버에 있는 자산에 대한 정보를 요청한다. 두 번째는 GET으로서 파일 또는 이미지 같은 자산을 요청한다. 마지막으로, POST 요청은 클라이언트가 웹 페이지를 통해(일반적으로 웹 폼을 통해) 서버와 인터랙팅 할 수 있다.

웹 스파이더와 스크래퍼는 유용한 애플리케이션이고, 따라서 좋든 나쁘든, 여러 가지 다양한 유형의 사용법이 있다. 이러한 기술을 사용하는 몇 가지 애플리케이션에 대해 살펴보도록 하자.

검색 엔진 웹 크롤러(crawler)

웹 스파이더는 인터넷 검색을 쉽고 효율적으로 만든다. 검색 엔진은 많은 웹 스파이더들을 사용하여 인터넷 상의 웹 페이지들을 크롤링 하고, 콘텐트를 리턴하며, 이를 인덱싱 한다. 이것이 완료되면, 검색 엔진은 로컬 인덱스를 빠르게 검색하여 검색에 맞는 가장 합당한 결과를 찾는다. Google은 PageRank 알고리즘을 사용하는데, 검색 결과의 웹 페이지 랭크(rank)는 얼마나 많은 페이지들이 여기에 링크되어 있는지를 나타내는 것이다. 이것은 투표(vote)로서도 작동하는데, 높은 투표를 가진 페이지들은 가장 높은 랭크를 얻는다.

이와 같이 인터넷을 검색하는 것은 웹 콘텐트와 인덱서를 통신하는데 있어서 대역폭과 결과를 인덱싱 하는 전산 비용 관점에서 볼 때 비용이 많이 든다. 많은 스토리지가 이와 같은 것을 필요로 하지만, Google이 Gmail 사용자들에게 1,000 메가바이트의 스토리지를 제공한다고 생각한다면 이것은 문제도 아니다.

웹 스파이더는 일련의 정책을 사용하여 인터넷 상의 흐름을 최소화 한다. Google은 80억 개 이상의 웹 페이지들을 인덱싱 한다. 실행 정책은 크롤러가 인덱서로 어떤 페이지들을 가져오는지, 웹 사이트로 가서 이를 다시 체크하는 빈도수는 어느 정도인지에 대한 politeness 정책을 정의한다. 웹 서버는 robot.txt라고 하는 파일을 사용하여 크롤러를 차단할 수 있다.

기업용 웹 크롤러

표준 검색 엔진 스파이더와 마찬가지로, 기업용 웹 스파이더는 일반인이 사용할 수 없는 콘텐트를 인덱싱 한다. 예를 들어, 기업들은 사원들이 사용하는 내부 웹 사이트를 갖고 있다. 이러한 유형의 스파이더는 로컬 환경으로 제한된다. 검색이 제한되기 때문에 더 많은 전산 파워가 사용되며, 전문화 되고 보다 완벽한 인덱스가 가능하다. Google은 한 단계 더 나아가서 데스크탑 검색 엔진을 제공하여 여러분 개인용 컴퓨터의 콘텐트를 인덱싱 한다.

전문화된 크롤러

콘텐트를 압축하거나 통계를 만들어 내는 등, 특수한 크롤러도 있다. 압축 크롤러는 웹 사이트를 크롤링 하면서, 콘텐트를 로컬로 가져와서 장기적인 저장 미디어에 저장되도록 한다. 이것은 백업용으로 사용될 수 있고, 더 크게는 인터넷 콘텐트의 스냅샷을 만들기도 한다. 통계는 인터넷 콘텐트와 무엇이 부족한지를 이해하는데 도움이 된다. 크롤러는 얼마나 많은 웹 서버들이 실행되는지, 특정 유형의 웹 서버들이 얼마나 많은지, 사용할 수 있는 웹 페이지 수, 깨진 링크의 수(HTTP 404 error, page not found 등을 리턴함) 등을 규명하는데 사용된다.

기타 전문적인 크롤러에는 웹 사이트 체커(checker)도 있다. 이 크롤러는 소실된 콘텐트를 찾고, 모든 링크들을 검사하며, 여러분의 Hypertext Markup Language (HTML)이 유효한지를 확인한다.

이메일을 모으는 크롤러

이제 어두운 쪽으로 가보도록 하자. 불행하게도, 일부 썩은 사과들이 인터넷을 망치고 있다. 이메일을 모으는 크롤러들은 이메일 주소가 있는 웹 사이트를 검색하여 대량의 스팸을 생성하는데 사용한다. 포스티니(Postini) 보고서(2005년 8월)에 따르면, 포스티니(Postini) 사용자들의 모든 이메일 메시지들의 70%가 스팸 이라고 한다.

이메일 모으기는 가장 흔한 크롤러 동작 메커니즘 중 하나이다. 이 글에서는 이 마지막 크롤러 예제를 설명한다.

지금까지, 웹 스파이더와 스크래퍼를 설명했다. 다음 네 가지 예제들은 Ruby와 Python 같은 현대적인 스크립팅 언어를 사용하여 리눅스용 스파이더와 스크래퍼를 구현하는 방법을 설명하겠다.

위로

예제 1: 일반 스크래퍼

이 예제를 통해 주어진 웹 사이트에 대해 어떤 종류의 웹 서버가 실행되는지를 규명하는 방법을 설명하겠다. 이것은 매우 재미있고, 정부, 학계, 업계에서 어떤 종류의 웹 서버를 사용하는지도 알 수 있다.

Listing 1은 HTTP 서버를 규명하기 위해 웹 사이트를 스크래핑 하는 Ruby 스크립트이다. Net::HTTP 클래스는 HTTP 클라이언트와 GET, HEAD, POST HTTP 메소드를 실행한다. HTTP 서버에 요청을 보낼 때 마다, HTTP 메시지 응답의 일부에서는 콘텐트가 제공되는 서버를 나타낸다. 그 사이트에서 페이지를 다운로드 하기 보다는, HEAD 메소드를 사용하여 루트 페이지('/')에 대한 정보를 얻는다. HTTP 서버가 성공적인 응답을 보내는 한("200" 응답 코드로 나타남), 응답의 각 라인을 반복하면서 server 키를 검색하고, 이것을 찾으면 값을 프린트 한다. 이 키의 값은 HTTP 서버를 나타내는 스트링이다.

Listing 1. 간단한 메타데이터 스크래핑을 위한 Ruby 스크립트(srvinfo.rb)

                #!/usr/local/bin/rubyrequire 'net/http'# Get the first argument from the command-line (the URL)url = ARGV[0]begin  # Create a new HTTP connection  httpCon = Net::HTTP.new( url, 80 )  # Perform a HEAD request  resp, data = httpCon.head( "/", nil )  # If it succeeded (200 is success)  if resp.code == "200" then    # Iterate through the response hash    resp.each {|key,val|      # If the key is the server, print the value      if key == "server" then        print "  The server at "+url+" is "+val+"\n"      end    }  endend

srvinfo 스크립트를 사용하는 방법을 설명하는 것 외에도, Listing 2는 많은 정부, 학계, 비즈니스 웹 사이트에서 가져온 결과들도 보여준다. Apache (68%)부터 Sun과 Microsoft� Internet Information Services (IIS)까지 다양하다. 서버가 리포팅 되지 않은 경우도 있다. 미크로네시아(Federated States of Micronesi)는 구 버전의 Apache를 실행하고 있고(이제 업데이트가 필요하다.), Apache.org는 첨단을 달리고 있다는 사실이 흥미롭다.

Listing 2. 서버 스크래퍼의 사용 예제

                [mtj@camus]$ ./srvrinfo.rb www.whitehouse.gov  The server at www.whitehouse.gov is Apache[mtj@camus]$ ./srvrinfo.rb www.cisco.com  The server at www.cisco.com is Apache/2.0 (Unix)[mtj@camus]$ ./srvrinfo.rb www.gov.ru  The server at www.gov.ru is Apache/1.3.29 (Unix)[mtj@camus]$ ./srvrinfo.rb www.gov.cn[mtj@camus]$ ./srvrinfo.rb www.kantei.go.jp  The server at www.kantei.go.jp is Apache[mtj@camus]$ ./srvrinfo.rb www.pmo.gov.to  The server at www.pmo.gov.to is Apache/2.0.46 (Red Hat Linux)[mtj@camus]$ ./srvrinfo.rb www.mozambique.mz  The server at www.mozambique.mz is Apache/1.3.27    (Unix) PHP/3.0.18 PHP/4.2.3[mtj@camus]$ ./srvrinfo.rb www.cisco.com  The server at www.cisco.com is Apache/1.0 (Unix)[mtj@camus]$ ./srvrinfo.rb www.mit.edu  The server at www.mit.edu is MIT Web Server Apache/1.3.26 Mark/1.5 	(Unix) mod_ssl/2.8.9 OpenSSL/0.9.7c[mtj@camus]$ ./srvrinfo.rb www.stanford.edu  The server at www.stanford.edu is Apache/2.0.54 (Debian GNU/Linux) 	mod_fastcgi/2.4.2 mod_ssl/2.0.54 OpenSSL/0.9.7e WebAuth/3.2.8[mtj@camus]$ ./srvrinfo.rb www.fsmgov.org  The server at www.fsmgov.org is Apache/1.3.27 (Unix) PHP/4.3.1[mtj@camus]$ ./srvrinfo.rb www.csuchico.edu  The server at www.csuchico.edu is Sun-ONE-Web-Server/6.1[mtj@camus]$ ./srvrinfo.rb www.sun.com  The server at www.sun.com is Sun Java System Web Server 6.1[mtj@camus]$ ./srvrinfo.rb www.microsoft.com  The server at www.microsoft.com is Microsoft-IIS/6.0[mtj@camus]$ ./srvrinfo.rb www.apache.orgThe server at www.apache.org is Apache/2.2.3 (Unix) 	mod_ssl/2.2.3 OpenSSL/0.9.7g

이것은 유용한 데이터이고, 정부와 학교들이 자신들의 웹 서버로 무엇을 사용하는지를 알 수 있어서 재미있다. 다음 예제에서는 보다 덜 유용한 주식 시세 스크래퍼를 설명하겠다.

위로

예제 2: 주식 시세 스크래퍼

이 예제에서는, 간단한 웹 스크래퍼(스크린 스크래퍼(screen scraper))를 구현하여 주식 시세 정보를 모으도록 하겠다. 다음과 같이 응답 웹 페이지에 한 패턴을 활용하는 방식을 사용할 것이다.

Listing 3. 주식 시세용 웹 스크래퍼

                #!/usr/local/bin/rubyrequire 'net/http'host = "www.smartmoney.com"link = "/eqsnaps/index.cfm?story=snapshot&symbol="+ARGV[0]begin  # Create a new HTTP connection  httpCon = Net::HTTP.new( host, 80 )  # Perform a HEAD request  resp = httpCon.get( link, nil )  stroffset = resp.body =~ /class="price">/  subset = resp.body.slice(stroffset+14, 10)  limit = subset.index('<')  print ARGV[0] + " current stock price " + subset[0..limit-1] +          " (from stockmoney.com)\n"end

이 Ruby 스크립트에서, HTTP 클라이언트를 서버로 연결하고(이 경우, www.smartmoney.com), (&symbol=<symbol>을 통해) 사용자에 의해 전달된 것처럼 주식 시세를 요청하는 링크를 구현한다. 나는 HTTP GET 메소드를 사용하여 이 링크를 요청하고(전체 응답 페이지를 가져오기 위해서), class="price">를 검색하고 바로 뒤에 주식의 현재 시세가 바로 나타난다. 이것은 웹 페이지에서 재단되어 사용자에게 디스플레이 된다.

주식 시세 스크래퍼를 사용하기 위해, 관심 있는 주식 심볼을 가진 스크립트를 호출한다. (Listing 4)

Listing 4. 주식 시세 스크래퍼의 사용 예제

                [mtj@camus]$ ./stockprice.rb ibmibm current stock price 79.28 (from stockmoney.com)[mtj@camus]$ ./stockprice.rb intlintl current stock price 21.69 (from stockmoney.com)[mtj@camus]$ ./stockprice.rb ntnt current stock price 2.07 (from stockmoney.com)[mtj@camus]$

위로

예제 3: 주식 시세 스크래퍼와 통신하기

예제 2의 주식 시세용 웹 스크래퍼는 매력적이지만, 이 스크래퍼가 주식 시세를 늘 모니터링 하고, 관심 있는 주식이 오르거나 하락할 때 여러분에게 알려주도록 한다면 더욱 유용할 것이다. 기다림을 끝났다. Listing 5에서, 웹 스크래퍼를 업데이트 하여 주식을 지속적으로 모니터링 하고 주가 변동이 있을 때 이메일 메시지를 보내도록 하였다.

Listing 5. 이메일 알림을 보낼 수 있는 주식 스크래퍼

                #!/usr/local/bin/rubyrequire 'net/http'require 'net/smtp'## Given a web-site and link, return the stock price#def getStockQuote(host, link)    # Create a new HTTP connection    httpCon = Net::HTTP.new( host, 80 )    # Perform a HEAD request    resp = httpCon.get( link, nil )    stroffset = resp.body =~ /class="price">/    subset = resp.body.slice(stroffset+14, 10)    limit = subset.index('<')    return subset[0..limit-1].to_fend## Send a message (msg) to a user.# Note: assumes the SMTP server is on the same host.#def sendStockAlert( user, msg )    lmsg = [ "Subject: Stock Alert\n", "\n", msg ]    Net::SMTP.start('localhost') do |smtp|      smtp.sendmail( lmsg, "rubystockmonitor@localhost.localdomain", [user] )    endend## Our main program, checks the stock within the price band every two# minutes, emails and exits if the stock price strays from the band.## Usage: ./monitor_sp.rb <symbol> <high> <low> <email_address>#begin  host = "www.smartmoney.com"  link = "/eqsnaps/index.cfm?story=snapshot&symbol="+ARGV[0]  user = ARGV[3]  high = ARGV[1].to_f  low = ARGV[2].to_f  while 1    price = getStockQuote(host, link)    print "current price ", price, "\n"    if (price > high) || (price < low) then      if (price > high) then        msg = "Stock "+ARGV[0]+" has exceeded the price of "+high.to_s+               "\n"+host+link+"\n"      end      if (price < low) then        msg = "Stock "+ARGV[0]+" has fallen below the price of "+low.to_s+               "\n"+host+link+"\n"      end      sendStockAlert( user, msg )      exit    end    sleep 120  endend

Ruby 스크립트는 다소 길지만, Listing 3의 주식 스크래핑 스크립트를 기반으로 구현한 것이다. 새로운 함수 getStockQuote는 주식 스크래핑 함수를 캡슐화 한다. 또 다른 함수인 sendStockAlert는 메시지를 이메일 주소로 보낸다. (두 개 모두 사용자가 정의한 것이다.) 주 프로그램은 그저 반복적으로 주식 시세를 확인하고, 변동이 있는지를 체크하고, 사용자에게 이메일 알림을 보내는 것이다. 서버에 부담을 주고 싶지 않았기 때문에 주식 시세를 체크하는 사이에 딜레이를 적용했다.

Listing 6은 주식 시세 모니터링 실행 예제이다. 2분 마다 주식이 체크되고 프린트 된다. 주가가 상한선을 넘으면, 이메일 알림이 보내지고 스크립트가 종료한다.

Listing 6. 주식 모니터 스크립트 데모

                [mtj@camus]$ ./monitor_sp.rb ibm 83.00 75.00 mtj@mtjones.comcurrent price 82.06current price 82.32current price 82.75current price 83.36

결과 이메일은 그림 1과 같다. 스크립팅 된 데이터의 소스에 링크가 걸려있다.

그림1. Listing 5의 Ruby 스크립트에서 보낸 이메일 알림

이제 스크래퍼를 떠나서 웹 스파이더의 구조에 대해 살펴보도록 하자.

위로

예제 4: 웹 사이트 크롤러

마지막 예제에서는 웹 사이트를 크롤링 하는 웹 스파이더에 대해 설명하도록 하겠다. 보안을 위해 사이트 밖에 머무르지 않고, 대신 하나의 웹 페이지만 탐구하도록 하겠다.

웹 사이트를 크롤링 하고, 이 안에서 제공되는 링크를 따라가려면, HTML 페이지를 파싱해야 한다. 웹 페이지를 성공적으로 파싱할 수 있다면 다른 리소스에 대한 링크를 구분할 수 있다. 어떤 것은 로컬 리소스(파일)을 지정하고, 다른 것은 비 로컬 리소스(다른 웹 페이지에 대한 링크)를 나타낸다.

웹을 크롤링 하려면, 주어진 웹 페이지로 시작하여, 그 페이지에 있는 모든 링크를 파악하고, 이들을 to-visit 큐에 대기시킨 다음, to-visit 큐에서 첫 번째 아이템을 사용하여 이 프로세스를 반복한다. 이것은 breadth-first traversal(너비 우선 순회)이다. (발견된 첫 번째 링크를 통해 나아가는 것과는 대조적이다. 이것은 depth-first behavior(깊이 우선 순회)라고 한다.)

비 로컬(non-local) 링크를 피하고 로컬 웹 페이지로만 탐색한다면 웹 크롤러에게 하나의 웹 사이트를 제공한다. (Listing 7) 이 경우, 나는 Ruby에서 Python으로 전환하여 Python의 유용한 HTMLParser 클래스를 활용한다.

Listing 7. Python 웹 사이트 크롤러 (minispider.py)

                #!/usr/local/bin/pythonimport httplibimport sysimport refrom HTMLParser import HTMLParserclass miniHTMLParser( HTMLParser ):  viewedQueue = []  instQueue = []  def get_next_link( self ):    if self.instQueue == []:      return ''    else:      return self.instQueue.pop(0)  def gethtmlfile( self, site, page ):    try:      httpconn = httplib.HTTPConnection(site)      httpconn.request("GET", page)      resp = httpconn.getresponse()      resppage = resp.read()    except:      resppage = ""    return resppage  def handle_starttag( self, tag, attrs ):    if tag == 'a':      newstr = str(attrs[0][1])      if re.search('http', newstr) == None:        if re.search('mailto', newstr) == None:          if re.search('htm', newstr) != None:            if (newstr in self.viewedQueue) == False:              print "  adding", newstr              self.instQueue.append( newstr )              self.viewedQueue.append( newstr )          else:            print "  ignoring", newstr        else:          print "  ignoring", newstr      else:        print "  ignoring", newstrdef main():  if sys.argv[1] == '':    print "usage is ./minispider.py site link"    sys.exit(2)  mySpider = miniHTMLParser()  link = sys.argv[2]  while link != '':    print "\nChecking link ", link    # Get the file from the site and link    retfile = mySpider.gethtmlfile( sys.argv[1], link )    # Feed the file into the HTML parser    mySpider.feed(retfile)    # Search the retfile here    # Get the next link in level traversal order    link = mySpider.get_next_link()  mySpider.close()  print "\ndone\n"if __name__ == "__main__":  main()

이 크롤러의 기본 디자인은 첫 번째 링크를 로딩하여 큐를 검사하는 것이다. 이 큐는 next-to-interrogate 큐로서 작동한다. 링크가 체크되면, 발견된 새로운 링크들이 같은 큐에 로딩된다.

먼저, Python의 HTMLParser 클래스에서 miniHTMLParser라고 하는 새로운 클래스를 이끌어 낸다. 이 클래스는 몇 가지 일을 수행한다. 먼저, 시작 HTML 태그를 만날 때 마다 콜백 메소드(handle_starttag)를 사용하는 나의 HTML 파서이다. 나는 또한 이 클래스를 사용하여 크롤링에서 발견된 (get_next_link) 링크에 액세스 하고 이 링크에서 나타난 파일(이 경우, HTML 파일)을 가져온다.

두 개의 인스턴스 변수들이 이 클래스 안에 포함되는데, viewedQueue에는 지금까지 조사된 링크가 포함되어 있고, instQueue는 조사 될 링크들을 나타내고 있다.

여러분도 보듯, 클래스 메소드는 단순하다. get_next_link 메소드는 instQueue가 비어있는지 여부를 확인하고 리턴한다. 그렇지 않으면, 다음 아이템이 pop 메소드를 통해 리턴된다. gethtmlfile 메소드는 HTTPConnectionK를 사용하여 사이트로 연결하고 정해진 페이지의 내용을 리턴한다. 마지막으로 handle_starttag는 웹 페이지의 모든 시작 태그에 호출된다. (feed 메소드를 통해 HTML 파서로 피딩(feed) 된다.) 이 함수에서, 링크가 비 로컬 링크(http를 포함하고 있을 경우)인지 여부, 이것이 이메일 주소인지 여부(mailto), 링크에 이것이 웹 페이지라는 것을 나타내는 'htm'이 포함되었는지의 여부를 검사한다. 또한, 전에 한번도 방문한 적 없는 곳인지를 확인하고, 그렇지 않을 경우, 링크는 my interrogate에 로딩되고 큐에 나타난다.

main 메소드는 단순하다. 나는 새로운 miniHTMLParser 인스턴스를 만들고 사용자 정의 사이트(argv[1])와 링크(argv[2])로 시작한다. 링크의 콘텐트를 가져다가, 이것을 HTML 파서에 피딩하고, 다음에 방문할 링크가 있다면 그 다음 링크를 가져온다. 방문할 링크가 있는 한 루프는 계속된다.

웹 스파이더를 호출하려면, 웹 사이트 주소와 링크를 제공한다.

./minispider.py www.fsf.org /

이 경우, Free Software Foundation(자유 소프트웨어 재단)에서 루트 파일을 요청하고 있다. 이 명령어의 결과는 Listing 8과 같다. 요청 큐에 추가된 새로운 링크와 비 로컬 링크 같은 무시된 링크를 볼 수 있다. 리스팅 밑에, 루트에서 발견된 그 링크에 대한 질의를 볼 수 있다.

Listing 8. minispider 스크립트의 결과

                [mtj@camus]$ ./minispider.py www.fsf.org /Checking link  /  ignoring hiddenStructure  ignoring http://www.fsf.org  ignoring http://www.fsf.org  ignoring http://www.fsf.org/news  ignoring http://www.fsf.org/events  ignoring http://www.fsf.org/campaigns  ignoring http://www.fsf.org/resources  ignoring http://www.fsf.org/donate  ignoring http://www.fsf.org/associate  ignoring http://www.fsf.org/licensing  ignoring http://www.fsf.org/blogs  ignoring http://www.fsf.org/about  ignoring https://www.fsf.org/login_form  ignoring http://www.fsf.org/join_form  ignoring http://www.fsf.org/news/fs-award-2005.html  ignoring http://www.fsf.org/news/fsfsysadmin.html  ignoring http://www.fsf.org/news/digital-communities.html  ignoring http://www.fsf.org/news/patents-defeated.html  ignoring /news/RSS  ignoring http://www.fsf.org/news  ignoring http://www.fsf.org/blogs/rms/entry-20050802.html  ignoring http://www.fsf.org/blogs/rms/entry-20050712.html  ignoring http://www.fsf.org/blogs/rms/entry-20050601.html  ignoring http://www.fsf.org/blogs/rms/entry-20050526.html  ignoring http://www.fsf.org/blogs/rms/entry-20050513.html  ignoring http://www.fsf.org/index_html/SimpleBlogFullSearch  ignoring documentContent  ignoring http://www.fsf.org/index_html/sendto_form  ignoring javascript:this.print();  adding licensing/essays/free-sw.html  ignoring /licensing/essays  ignoring http://www.gnu.org/philosophy  ignoring http://www.freesoftwaremagazine.com  ignoring donate  ignoring join_form  adding associate/index_html  ignoring http://order.fsf.org  adding donate/patron/index_html  adding campaigns/priority.html  ignoring http://r300.sf.net/  ignoring http://developer.classpath.org/mediation/OpenOffice2GCJ4  ignoring http://gcc.gnu.org/java/index.html  ignoring http://www.gnu.org/software/classpath/  ignoring http://gplflash.sourceforge.net/  ignoring campaigns  adding campaigns/broadcast-flag.html  ignoring http://www.gnu.org  ignoring /fsf/licensing  ignoring http://directory.fsf.org  ignoring http://savannah.gnu.org  ignoring mailto:webmaster@fsf.org  ignoring http://www.fsf.org/Members/root  ignoring http://www.plonesolutions.com  ignoring http://www.enfoldtechnology.com  ignoring http://blacktar.com  ignoring http://plone.org  ignoring http://www.section508.gov  ignoring http://www.w3.org/WAI/WCAG1AA-Conformance  ignoring http://validator.w3.org/check/referer  ignoring http://jigsaw.w3.org/css-validator/check/referer  ignoring http://plone.org/browsersupportChecking link  licensing/essays/free-sw.html  ignoring mailto:webmasterChecking link  associate/index_html  ignoring mailto:webmasterChecking link  donate/patron/index_html  ignoring mailto:webmasterChecking link  campaigns/priority.html  ignoring mailto:webmasterChecking link  campaigns/broadcast-flag.html  ignoring mailto:webmasterdone[mtj@camus]$

이 예제는 웹 스파이더의 크롤링 단계를 나타내고 있다. 이 파일이 클라이언트에 의해 읽혀진 후에, 페이지의 콘텐트가 검사된다.

위로

리눅스 스라이더링(spidering) 툴

두 개의 스크래퍼와 스파이더를 구현하는 방법을 배웠다. 이러한 기능을 제공하는 리눅스 툴도 있다.

Web get을 뜻하는 wget 명령어는 웹 사이트를 반복적으로 실행하고 관심 내용을 가져오는 유용한 명령어이다. 웹 사이트, 관심이 있는 내용, 기타 관리 옵션들을 지정할 수 있다. 이 명령어는 파일들을 여러분의 로컬 호스트로 가져온다. 예를 들어, 다음 명령어는 여러분이 정의한 URL로 연결하여 세 단계만 반복하여 mp3, mpg, mpeg, 또는 avi 확장자를 가진 파일을 가져온다.

wget -A mp3,mpg,mpeg,avi -r -l 3 http://<some URL>

curl 명령어도 비슷한 방법으로 작동한다. 계속해서 많은 것들이 활발히 개발되고 있다. 이와 비슷한 다른 명령어로는 snarf, fget, fetch 등이 있다.

위로

법적 문제

웹 스파이더를 사용하는 인터넷에서의 데이터 마이닝에 대한 소송들이 있었고, 잘 처리되지 않고 있다. Farechase, Inc.는 최근 American Airlines로부터 스크린 스크래핑과 관련하여 고소를 당했다. 이 소송은 American Airlines의 사용자 계약에 위반되는 데이터를 모았다는 점이 소송에 걸렸다. 소송이 실패하자, American Airlines는 불법 침해를 주장했고 이것은 성공을 거두었다. 다른 소송 건으로는 스파이더와 스크래퍼가 합법적 사용자의 대역폭을 가져가는 것과 관련한 것이었다. 모두가 근거 있는 소송들이고 Politeness 정책들을 수립하는 것이 더욱 중요해지고 있다. (참고자료)

위로

맺음말

소셜 북마크


	mar.gar.in

	Digg

	del.icio.us

	Slashdot

웹의 크롤링과 스크래핑은 재미도 있고 이롭기도 하다. 하지만, 앞서 언급한 것처럼, 법적인 문제도 있다. 스파이더링이나 스크래핑을 할 때, 서버에서 사용할 수 있는 robots.txt 파일을 준수하고, 이것을 여러분의 Politeness 정책들에 추가하도록 한다. SOAP 같은 새로운 프로토콜들은 스파이더링을 더욱 쉽게 만들고, 일반 웹 작동에는 영향을 덜 준다. 시맨틱 웹 같은 노력이 스파이더링을 더욱더 단순화 하기 때문에 스파이더링의 솔루션과 방식은 계속해서 성장할 전망이다.

기사의 원문보기

Build a Web spider on Linux

참고자료

교육

Wikipedia Web crawler
Email Spiders
The Web Robots Pages
"Scrapers, Robots and Spiders: The Battle Over Internet Data Mining" (Gesmer Updegrove LLP, 2006)
한국 developerWorks 리눅스 존- 리눅스 개발자를 위한 기술자료 보기
developerWorks 기술 이벤트와 웹캐스트.

제품 및 기술 얻기

Searchtools.com의 Source Code for Web Robot Spiders
Order the SEK for Linux, Linux from DB2�, Lotus�, Rational�, Tivoli�, WebSphere� 최신 IBM 시험판 소프트웨어
IBM 시험판 소프트웨어 다운로드(한국 developerWorks)

토론

developerWorks 블로그와 다양한 한국 developerWorks community 참여하기

필자소개


		M. Tim Jones는 임베디드 소프트웨어 아키텍트이며, GNU/Linux Application Programming, AI Application Programming, BSD Sockets Programming from a Multilanguage Perspective의 저자이기도 하다. 정지 우주선용 커널부터 임베디드 시스템 아키텍처와 네트워킹 프로토콜 개발까지 광범위한 개발 경험을 갖고 있다. 현재 Emulex Corp. (Longmont, Colorado)의 자문 엔지니어이다.

'기본 카테고리' 카테고리의 다른 글

Beautiful Soup (0)	2008.02.26
Python and HTML Processing (0)	2008.02.26
[팁] vi editor와 관련된 유용한 팀[펌] (0)	2008.02.14
알아두면 편리한 윈도우 명령어 (0)	2008.02.14
윈도우에 상응하는 리눅스 프로그램 4/4 (0)	2008.02.13

[팁] vi editor와 관련된 유용한 팀[펌]

2008. 2. 14. 17:35

[팁] vi editor와 관련된 유용한 팀[펌]

글쓴이 : 예진맘

'기본 카테고리' 카테고리의 다른 글

Python and HTML Processing (0)	2008.02.26
리눅스에서 웹 스파이더(Web spider) 구현하기 (한글) (0)	2008.02.16
알아두면 편리한 윈도우 명령어 (0)	2008.02.14
윈도우에 상응하는 리눅스 프로그램 4/4 (0)	2008.02.13
리눅스에 파이썬 설치하기 (0)	2008.02.13

알아두면 편리한 윈도우 명령어

2008. 2. 14. 12:35

알아두면 편리한 윈도우 명령어

인터넷 연결에 문제가 생겨서 ISP 업체에 문의를 하면 가장 먼저 요구하는 것이 [시작]-[실행] 창을 열어 ping을 쳐보라고 한다. 이 ping은 특정 네트워크와 통신이 되지는 확인하는 간단한 방법이다. 이렇게 유용한 명령어엔 어떠한 것들이 있는지 자세히 알아보도록 하자.

윈도우는 GUI(Graphics User Interface)를 적용해 그림을 통해 누구나 손쉽게 컴퓨터를 사용할 수 있는 환경을 제공해준다. 하지만 이러한 윈도우의 전신은 MS-DOS라고 불리는 운영체제로 현재의 그래픽 환경과는 달리 텍스트(text)로 명령을 내리고 그 결과를 받는 구조로 이루어져 있었다. 이러한 명령어는 현재의 윈도우 내부에 계속 존재하고 있으며 이러한 환경을 콘솔(console)이라고 부른다. 텍스트로 명령을 내려야 하므로 상당히 직관적이며 사용법 또한 간단한 것이 특징이다. 따라서 이러한 명령어들을 많이 알면 알수록 컴퓨터를 사용하는데 도움이 된다.

PING
Ping은 네트워크상의 특정 호스트가 통신이 가능한지 여부를 확인하기 위해 주로 사용하는 명령어다. 특정 호스트로 응답 요청을 보내면 해당 호스트가 네트워킹이 가능하다면 응답을 주므로 쉽게 확인할 수 있다. 반대로 항상 통신이 가능하다고 봐도 무방한 ISP업체의 도메인 네임 서버(문자로된 인터넷 주소를 IP주소로 변경해주는 시스템)로 응답 요청을 보내면 나의 컴퓨터가 인터넷이 가능한지 여부를 확인할 수 있다.

사용 방법은 “Ping [옵션] 대상호스트”이다. 실제 예제로 안랩의 한국 홈페이지인 home.ahnlab.com을 대상으로 테스트를 하기 위해선 다음과 같이 명령어를 입력해야 한다. 만약 IP주소를 알고 있다면 직접 IP주소를 입력해도 무방하다.

ping home.ahnlab.com
ping 211.233.80.22

하지만, 일반적인 환경에서 ping 테스트는 4번만 하므로 지속적으로 확인하기 위해서는 다음과 같이 옵션 [-t]를 사용하여야 한다. 종료하기 위해서는 [Ctrl] + [C]를 눌러야 한다.

ping –t home.ahnlab.com
ping –t 211.233.80.22

[그림 1] ping home.ahnlab.com

SHUTDOWN
Shutdown은 말 그대로 윈도우를 종료하기 위해서 사용하는 명령어다. [시작] -> [컴퓨터 끄기]를 사용해 컴퓨터를 종료할 수 있지만, shutdown명령은 보다 더 다양한 기능을 제공한다. 또한 윈도우 종료 기능을 중지할 수도 있어, RPC 공격에 의해 윈도우가 갑자기 종료될 경우 추가 작업을 수행할 수 있는 시간을 벌 수 있어 유용한 명령어다. 다음은 shutdown을 통해 수행할 수 있는 대표적인 기능들이다.

(1) 특정 시간 후 윈도우 종료하기
[shutdown –s –t 종료시간]을 입력함으로 가능하다. 여기서 말하는 종료시간은 초 단위이다.따라서 10분 후 종료하길 원한다면 [shutdown –s –t 600]을 적어주면 된다.

[그림 2] shutdown -s -t 600 -c "윈도우 예약 종료 시험"

(2) 윈도우 종료 중지하기
(1)에서와 같이 예약 종료를 실행했지만 부득이하게 종료를 중지 해야 할 경우가 발생할 수 있다. 또는 윈도우 보안 패치가 되어 있지 않을 경우 RPC공격을 통해 위와 같은 메시지가 발생할 수 있다. 이 경우 [shutdown –a]를 입력함으로 종료 기능을 중지시킬 수 있다.

(3) GUI(Graphics User Interface) 환경으로 실행하기
(1), (2)와 같이 직접 입력하기 힘들다면 [shutdown –i]를 입력함으로 친숙한 GUI환경으로 예약 종료를 실행할 수 있다. 단, 이 경우 반듯이 [설명] 부분에 종료하려는 사유를 적어야 [확인]버튼이 활성화된다. 명령어를 직접 입력하는 것이 어색하거나 어려운 사용자들에겐 좋은 대안이 된다.

[그림 3] shutdown -i

이 이외에도 원격지 컴퓨터를 종료하는 기능등이 있지만, 일반적인 환경에선 거의 사용되지 않는 기능이다.

IPCONFIG
Ipconfig는 TCP/IP 네트워크 관련 설정 사항을 확인하고 또 갱신할 수 있는 명령어다. 이 기능을 사용하면 네트워크를 다시 설정하므로 윈도우를 재부팅 하지 않고도 네트워크 설정을 갱신할 수 있다. 네트워크 설정 사항을 직접 눈으로 확인해야 하므로 가급적 명령 프롬프트에서 실행하는 것이 좋다. 방법은 [시작] -> [실행] -> CMD를 입력 후 [확인]버튼을 누르면 된다. 이때 실행되는 검은색 텍스트창이 바로 명령 프롬프트이다. 자주 사용되는 명령어는 다음과 같다.

(1) IP주소 설정 상태 확인
현재 나의 컴퓨터에 설치된 네트워크 장비에 할당된 IP주소를 직접 확인할 수 있다. 명령어는 [ipconfig /all]로 확인할 수 있다.
여기서는 네트워크카드 설명, 물리적으로 설정된 주소(네트워크카드의 맥 주소), DHCP(Dynamic Host Configuration Protocol – 동적 호스트 설정 규약) 여부, 할당된 IP 주소, 서브넷 마스크, 기본 게이트웨어, DNS 서버가 상세히 표기된다.

[그림 4] ipconfig /all

(2) 네트워크 연결 상태 해제/갱신
여러가지 이유로 네트워크 설정이 잘 못 되는 경우가 종종 있다. 이 경우 대부분 윈도우를 종료하고 다시 부팅을 한다. 하지만 ipconfig의 사용법을 알면 간단하게 네트워크 정보만 해제 또는 갱신이 가능하다.
해제의 경우 [ipconfig /release]를 입력하면 연결된 네트워크 연결을 해제한다. 이때는 네트워킹이 되지 않으므로 외부와 통신이 불가능하다. 보안패치가 되지 않은 윈도우를 사용하고 있을 외부 공격이 의심되면 이런식으로 네트워크를 해제하고 보안패치를 수행하면 보다 안전하게 작업이 가능하다.
그리고 다시 네트워크를 사용하기 위해 갱신하는 방법은 [ipconfig /renew]를 입력하면 된다. 갱신이 완료되면 [ipconfig /all]을 입력 했을 때와 동일한 화면을 볼 수 있다.

레지스트리 편집기
윈도우의 중요 시스템 설정사항과 응용 프로그램의 설정이 저장되는 곳이 바로 레지스트리이다. 이 레지스트리의 내용을 직접 확인하고 수정할 수 있게 도와주는 프로그램이 바로 레지스트리 편집기다. 윈도우에 내장되어 있는 레지스트리 편집기는 크게 2가지가 있다. 그래픽 환경으로 작업을 할 수 있는 [regedit]와 텍스트로 명령을 수행하는 [reg]가 있다.

그래픽 환경의 레지스트리 편집기 실행 방법은 [시작] -> [실행] -> [regedit]를 입력하고 [확인]을 누르면 된다.

[그림 5] regedit

레지스트리 편집기가 실행되면 다음과 같은 화면을 볼 수 있다.

[그림 6] GUI 환경의 레지스트리 편집기

하지만 레지스트리 편집기에서 데이터를 수정하면 별도의 저장 작업 없이 바로 적용 되므로 신중을 기해야 한다.

텍스트 명령어 입력 환경의 레지스트리 편집기는 명령 프롬프트에서 reg를 입력하면 실행할 수 있다. 하지만 모든 명령어를 직접 입력해야 하므로 다소 까다롭다. 하지만 GUI환경을 구동할 수 없을 경우 매우 큰 도움이 된다. 관련 명령어는 [reg /?]를 입력함으로 알 수 있다.

[그림 7] 명령어 입력 방식의 레지스트리 편집기 (reg /? 실행 화면)

단, 레지스트리는 윈도우의 중요한 시스템 정보와 응용 소프트웨어의 각종 설정이 저장되는 중요한 데이터이다. 따라서 해당 데이터가 잘 못 편집되면 윈도우를 정상적으로 사용할 수 없으므로 윈도우에 전문적인 지식이 있는 사용자가 특별히 변경해야 할 사항이 있을 경우에만 사용 해야 한다.

윈도우 작업 관리자
윈도우는 동시에 여러가지 프로그램이 실행된다. 이렇게 실행되는 프로그램중 특정 프로그램이 CPU를 혼자 사용하고 있으면 윈도우 전체가 느려지는 현상이 발생한다. 또한 컴퓨터 자원이 얼마나 사용되고 있는지 확인하고 싶을때도 문득 있다. 이럴때 사용하는 프로그램이 윈도우 작업 관리자이다. 일반적으로 윈도우 작업 관리자는 [CTRL] + [ALT] + [DEL]키를 동시에 눌러야 실행이 된다. 하지만 이 윈도우 작업관리자를 보다 손쉽게 실행시킬 수 있는 방법이 있다. [시작] -> [실행] -> [taskmgr]을 입력 후 [확인] 버튼을 누르면 윈도우 작업 관리자는 실행된다.

[그림 8] taskmgr (윈도우 작업 관리자)

윈도우 작업 관리자는 [응용 프로그램], [프로세스], [성능], [네트워킹], [사용자]와 같이 총 5개의 탭을 가지고 있다. 각각은 다음과 같은 항목을 보여준다.
(1) 응용 프로그램은 윈도우 바탕화면에 실행되어 눈에 보이는 프로그램들의 목록을 보여준다. 여기서 [상태]가 “응답없음”으로 표기되는 항목은 현재 프로그램이 정상적으로 실행되지 않는다는 뜻이다. 이 항목은 일정 시간이 지나면 자동 종료 되지만 그동안은 작업에 방해를 받게 되므로 해당 항목을 마우스로 선택하고 오른쪽 버튼을 눌러 [작업 끝내기]를 클릭하면 바로 종료할 수 있다.
(2) 프로세스는 실행되어 동작중인 프로그램들을 보여준다. 응용 프로그램과는 달리 사용자 눈에 보이지 않는 상태로 실행되는 프로그램까지 포함되므로 상당히 많은 프로세스 항목이 보일것이다. 여기서는 특정 프로세스가 어느 사용자에 의해 실행 되었으며, CPU를 얼마나 사용하고 있는지, 그리고 메모리를 얼마나 사용하고 있는지를 알 수 있다. 컴퓨터가 특별한 이유 없이 느려졌다면 이 항목을 살펴보고 특정 프로그램이 지나치게 CPU를 많이 사용하는 경우 해당 프로그램을 종료하여 그 문제를 해결할 수 있다. 단, explorer.exe와 같이 윈도우 중요 프로세스를 종료했을 경우 윈도우를 정상적으로 사용할 수 없으므로 주의를 기울여야 한다.
(3) 성능은 현재 CPU와 메모리를 얼마나 사용하고 있는지를 직관적으로 보여준다.
(4) 네트워킹에선 현재 내 컴퓨터의 네트워크가 얼마나 사용되고 있는지를 보여준다.
(5) 사용자에선 현재 윈도우에 로그온되어 있는 사용자를 보여준다. 사용자 전환등으로 불필요하게 윈도우 자원을 사용하고 있다면 불필요한 사용자를 [로그오프]시킬 수 있다. 기타 접속이 허용되지 않은 사용자가 로그온 되어 있을 경우 [연결 끊기]를 사용하여 접속을 차단할 수 있다.

텔넷(Telnet)
예전엔 자주 사용 되였지만, 최근엔 잘 사용되지 않는 telnet 클라이언트가 윈도우 내부에 내장되어 있다. 따라서 급하게 유닉스나 리눅스 서버등으로 접속해야할 경우 매우 요긴하게 사용할 수 있다.
telnet [접속할 호스트] [접속할 호스트의 포트번호] 를 입력하면 접속이 가능하다.
예를 들어 telnet://server.home.com 이라는 서버에 접속하고 하는 경우 다음과 같이 입력하면 된다.

접속 예제) telnet server.home.com

FTP(File Transfer Protocol)
FTP역시 최근 웹(Web)이 널리 보급됨에 따라 잘 사용되지 않지만 서버를 가지고 있는 사용자의 경우 가장 효과적인 파일 송수신 방법이기에 종종 사용된다. 사용법 또한 텔넷과 동일하다
예를 들어 ftp://fileserver.home.com 에 접속하고자 하면 다음과 같이 입력하면 된다.

접속 예제) ftp fileserver.home.com

NETSTAT
Netstat는 네트워크 프로토콜에 대한 통계와 현재 TCP/IP 네트워크 연결 상태등을 보여주는 명령어다. 다양한 기능을 가지고 있지만 자주 사용되는 명령어를 살펴보면 다음과 같다.

(1) 연결 및 수신 대기중인 항목 보기
현재 내 컴퓨터에 연결되었거나 연결을 기다리고 있는 항목들을 확인할 수 있다.
[netstat –a]를 입력하면 내역을 확인할 수 있다.
(2) 네트워크를 사용중인 프로그램 목록 보기
현재 내 컴퓨터에서 네트워크를 사용하는 프로그램 목록을 확인할 수 있다. 공격자의 명령을 기다리는 스파이웨어들을 확인할 때 주로 사용된다.
[netstat –b]를 입력하면 내역을 확인할 수 있다.
(3) 네트워크 통계 보기
네트워크 카드가 사용 가능한 상태가 된 시점부터 송수신된 데이터 통계를 살펴볼 수 있다.
[netstat –e]를 입력하면 내역을 확인할 수 있다.

내 컴퓨터 정보 확인
대다수 사용자들이 내 컴퓨터의 하드웨어에 대한 설정 및 중요 설정 정보를 모르고 있다. 이를 한눈에 확인할 수 있는 명령어가 있는데 바로 [systeminfo]가 그것이다.
[시작] -> [실행] -> [cmd]를 입력하고 명령 프롬프트 모드에서 [systeminfo]를 입력하면 현재 내 컴퓨터에 대한 다양한 정보를 한눈에 확인할 수 있다. 하지만 항목이 너무 많아 미처 다 보지도 못하고 지나가 버린 정보들이 많은데 이를 하나 하나 확인하기 위해서는 다음과 같이 입력하면 된다.

[그림 9] systeminfo | more

기타 윈도우 중요 정책 실행 명령어
윈도우에서 중요한 정책을 확인하고 변경할 수 있는 다양한 기능들이 윈도우엔 내장되어 있다. 하지만 꼭꼭 숨어 있어 이를 발견하기는 쉽지 않다. 따라서 다음 명령어를 입력해 쉽게 실행해 볼 수 있다. 실행 방법은 [시작] -> [실행] -> 해당 명령어 입력 후 [확인]버튼을 클릭하면 된다. 각각의 명령어는 다음과 같다. 하지만 잘못된 정책 변경은 정상적인 윈도우 사용을 불가능하게 할 수 있으므로 잘 모르는 항목은 수정하지 않는 것이 바람직하다.

COMPMGMT.MSC : 컴퓨터 관리로 컴퓨터의 세부 설정을 직접 변경할 수 있다.
DEVMGMT.MSC : 장치 관리자로 컴퓨터에 연결된 각종 장치들을 관리할 수 있다.
DFRG.MSC : 디스크 조각 모음으로 디스크내에 조각나서 저장된 파일을 하나의 조각으로 묶어줘 디스크 읽기 성능 및 수명을 개선하는데 도움을 준다.
EVENTVWR.MSC : 윈도우의 각종 이벤트 로그를 확인할 수 있다. 만약 윈도우에 문제가 발생했을 경우 확인할 수 있는 단서를 제공하기도 한다.
FSMGMT.MSC : 공유된 폴더와 파일을 확인할 수 있다. 공유된 폴더와 파일을 한눈에 보여줘 불필요한 공유 폴더를 확인하고 공유를 해제하여 보안성을 높일 수 있다.
GPEDIT.MSC : 로컬 컴퓨터 정책으로 윈도우의 다양한 정책은 물론 보안 정책을 변경할 수 있다.
LUSRMGR.MSC : 로컬 사용자 및 그룹 정책으로 윈도우 사용자를 추가/삭제/관리할 수 있으며, 그룹을 설정하고 보안 권한을 부여할 수 있다.
PERFMON.MSC : 시스템 성능을 모니터링한 결과를 볼 수 있다.
RSOP.MSC : 정책의 결과 집합으로 윈도우에 로그인된 사용자에게 적용되었거나 적용될 정책을 확인할 수 있습니다.
SECPOL.MSC : 로컬 보안 정책으로 현재 컴퓨터의 보안 정책을 확인하고 변경할 수 있다.
SERVICES.MSC : 현재 윈도우에 설치된 각종 서비스 항목과 그 상태를 확인하고 변경할 수 있다. 불필요한 서비스를 중지시켜 불필요한 자원을 절약할 수 있다.@

[안철수연구소 2008-2-12]

출처 : 안철수연구소 [2008/02/12]

'기본 카테고리' 카테고리의 다른 글

리눅스에서 웹 스파이더(Web spider) 구현하기 (한글) (0)	2008.02.16
[팁] vi editor와 관련된 유용한 팀[펌] (0)	2008.02.14
윈도우에 상응하는 리눅스 프로그램 4/4 (0)	2008.02.13
리눅스에 파이썬 설치하기 (0)	2008.02.13
sql 학습 (0)	2008.02.13

윈도우에 상응하는 리눅스 프로그램 4/4

2008. 2. 13. 21:58

윈도우에 상응하는 리눅스 프로그램 4/4

글쓴이 : 화산폭발 날짜 : 06-11-11 15:20 조회 : 2404

9) 과학, 특수 프로그램
10) 에뮤레이터
11) 기타

9) Scientific and special programs.
Useful links:	-	Scientific Applications on Linux - many links to both OSS and proprietary applications.
Math system in MathCad style	Mathcad	Gap.
Math system in Matlab style	Matlab	1) Matlab for Linux. [FTP] 2) Octave. (+ Gnuplot) 3) Scilab. 4) R. 5) Yorick. 6) rlab. 7) Yacas. 8) Euler.
Math system in Mathematica style	Mathematica	1) Mathematica for Linux. [Prop] 2) Maxima. 3) MuPad. 4) NumExp.
Math system in Maple style	Maple	1) Maple for Linux. [Prop] 2) Maxima. 3) MuPad.
Equation / math editor	Mathtype, MS Equation Editor	1) OpenOffice Math. 2) MathMLed. 3) Kformula (Koffice). 4) LyX. 5) Texmacs.
Programs for three-dimensional modeling	SolidWorks, ...	ProEngineer Linux. [Prop]
Programs for three-dimensional modeling	CATIA for Windows	CATIA. It was designed under Unix, and from version 4 (2000) it was ported under Windows (not too successfully).
Programs for three-dimensional modeling	SolidEdge for Windows	SolidEdge (part of more powerful package Unigraphics).
Engineering	ANSYS for Windows	ANSYS.
CAD/CAM/CAE	Autocad, Microstation	1) Varkon. 2) Linuxcad. [Prop, ~100$] 3) Varicad. [Prop] 4) Cycas. [Prop] 5) Tomcad. 6) Thancad. 7) Fandango. 8) Lignumcad. 9) Giram. 10) Jcad. 11) QSCad. 12) FreeEngineer. 13) Ocadis. 14) PythonCAD.
CAD/CAM/CAE, simplified	ArchiCAD	Qcad.
Desktop Publishing Systems	Adobe PageMaker, QuarkXPress	Adobe Framemaker. [Proprietary, cancelled]
Small desktop publishing systems	MS Publisher	1) Scribus - Desktop Publishing for Linux. 2) KWord.
Diagram and chart designer	Microsoft Visio	1) Kivio (Koffice). 2) Dia. 3) KChart. 4) xfig. 5) Tgif + dotty. 6) Tulip.
Geographic image processing software	Erdas Imagine, ER Mapper, ENVI	ENVI for Linux.
GIS (Geographical information system)	ArcView	1) Grass. 2) Quantum GIS. 3) PostGIS.
Vectorization of bitmaps	MapEdit, Easy Trace	1) Autotrace.
Software CNC, controlling machine tools	OpenCNC [Prop]	EMC.
Advanced text processing system in TeX style	MikTex, emTeX (DOS)	1) TeX. 2) TeTeX / LaTeX 3) LyX (WYSIWYM). 4) Kile.
Convenient, functional and user-friendly TeX-files / dvi-files editor.	WinEdt	1) Kile (KDE Integrated LaTeX Environment). 2) Ktexmaker2. 3) Tk LaTeX Editor.
Statistical Computing Language and Environment	S-PLUS	R.
Statistical analysis	SPSS, Statistica	Many links - here. 1) PSPP. 2) OpenStat2. 3) "Probability and Statistics Utilities for Linux users"
Econometrics Software	Eviews, Gretl	1) Gretl.
Emulation of the circuit	Electronic Workbench	1) Geda. 2) Oregano. 3) Xcircuit. 4) Gnome Assisted Electronics. 5) SPICE. 6) SPICE OPUS. 7) NG-SPICE.
Program to draw chemical structures	Chemdraw, Isisdraw	Xdrawchem.
Downloader and player for Olympus dictophone	Olympus DSS Player	???
Market analysis	MetaStock	???
Electronics scheme design	PCAD	1) Eagle. 2) Geda.
The oscilloscope emulation	Winoscillo	Xoscope.
Measurement of the temperature and voltages on motherboard	MBMonitor, PCAlert	KHealthCare (KDE).
S.M.A.R.T-attributes and temperature of the hard disk	Come on CD with mainboard, Active SMART	1) smartctl. 2) Hddtemp-0.3. 3) IDEload-0.2. 4) Smartsuite-2.1. 5) Smartmontools. 6) Ide-smart. 7) Smartsuite.
Memory testing	SiSoft SANDRA	Memtest86.
Program for watching temperatures, fanspeeds, etc	SiSoft SANDRA, SiSoft SAMANTHA	1) Ksensors. 2) Lm_sensors. 3) xsensors. 4) wmsensormon and other applets for AfterStep / WindowMaker / FluxBox.
HDD testing / benchmarking	SiSoft SANDRA, SiSoft SAMANTHA	1) hdparm. 2) Bonnie++. 3) IOzone. 4) Dbench. 5) Bonnie. 6) IO Bench. 7) Nhfsstone. 8) SPEC SFS. [Prop]
Video testing / benchmarking	Final Reality	1) X11perf. 2) Viewperf.
Realtime Control	SHA Sybera Hardware Access	DIAPM RTAI - Realtime Application Interface.
Simulator of nets	???	1) NS.
Neural network simulation	???	1) Xnbc. 2) Stuttgart Neural Network Simulator (SNNS).
"Sensor for LCD"	???	1) Sensors-lcd.
Electrocardiogrammas viewer	???	1) ecg2png.
A software technology, that turns x86 computer into a full-function PLC-like process controller	SoftPLC	1) MatPLC.
Catalog of the software for translators	-	Linux for translators.
Translation memory	1) Trados Translators Workbench 2) Deja Vu 3) Star Transit 4) SDLX 5) OmegaT	1) OmegaT.
Catalog of educational software	-	1) SchoolForge. 2) Seul / EDU.
Designing and viewing DTDs	NearFar Designer [Prop]	???
10) Emulators.
Virtual machine emulator	VMWare for Windows	1) VMWare for Linux. [Prop] 2) Win4Lin. [Proprietary, $89]. 3) Bochs. 4) Plex86. 5) User Mode Linux.
Linux emulator	1) CygWin. 2) MKS Toolkit. 3) Bash for Windows. 3) Minimalist GNU For Windows.	1) User Mode Linux.
X Window System (XFree) emulator	XFree under CygWin.	-
Windows emulator	-	1) Wine. (GUI: gwine, tkwine) 2) Transgaming WineX. (GUI: tqgui) [Prop] 3) Crossover Office.
Sony PlayStation emulator	ePSXe, ...	1) ePSXe. 2) Pcsx.
ZX Spectrum emulator	X128, Speccyal, SpecX, SpecEmu, UnrealSpeccy, ...	1) Xzx. 2) Glukalka. 3) Fuse. 4) ZXSP-X. 5) FBZX. 6) SpectEmu.
Arcade machines emulator	???	1) Xmame / Xmess. 2) Advancemame. Frontends: advancemenu. ckmame. flynn. gmame. gnomame. grok. grustibus. gxmame. it. it's quit. fancy. kmamerun. kmamu. qmamecat. startxmame. setcleaner. tkmame.
ST emulator	???	1) StonX.
C64 emulator	???	1) Vice.
Amiga emulator	???	1) UAE.
Mac 68k emulator	???	1) Basilisk II.
Game boy emulator	???	1) Vboy. 2) VGBA. (GUI: vgb-gui)
Atari 2600 Video Computer System emulator	1) Stella	1) Stella.
NES emulator	1) Zsnes. 2) Snes9x.	1) Zsnes. 2) Snes9x. 3) FWNes. 4) GTuxNes.
M680x0 Arcade emulator	1) Rainemu.	1) Rainemu.
Multi / other emulators	???	1) M.E.S.S. 2) Zinc.
11) Other / Humour :)
Space simulator	1) Openuniverse. 2) Celestia. 3) Zetadeck.	1) Openuniverse. 2) Celestia. 3) Kstars. 4) Zetadeck.
TV driver	-	RivaTV.
System, running from CD without installing (Live CD)	Impossible	1) Knoppix. 2) Cool Linux. 3) Blin. 4) DemoLinux. 5) DyneBolic. 6) Gentoo (live CD). 7) Lonix. 8) Virtual Linux. 9) Bootable Business Card (LNX-BBC). 10) ByzantineOS. 11) FreeLoader Linux. 12) MoviX. 13) Freeduc CD. 14) SuSE live-eval CD. 15) Freedom Linux. 16) Eagle Linux.
Boot rescue/tools diskette	Windows system diskette	1) Linux system diskette. 2) Tomsrtbt. 3) BanShee Linux.
Local file systems mount	ext2fs (driver), explore2fs (program) - ext2/3 under Windows	Linux-NTFS. (driver for NTFS partitions mounting)
Installing software and uninstalling	InstallShield, WISE, GhostInstaller, Microsoft Installer - the analog of rpm	1) Rpm. 2) Urpmi. 3) GnoRpm. 4) Nautilus RPM. 5) Apt-get & frontends (synaptic, aptitude, ...). 6) Apt-rpm. (for RedHat, SuSE, ALT Linux, etc) 7) yum (Yellowdog Updater Modified) 8) yum enhanced by ASPLinux.
Installing software from source and uninstalling	Minimalist GNU For Windows	1) make install, make uninstall 2) CheckInstall. 3) Sinstall. 4) Emerge (Gentoo). 5) Apt-get & frontends (synaptic, aptitude, ...).
System update	Windows Update	1) Ximian Red Carpet. 2) Red Hat Network. 3) MandrakeOnline. 4) SuSE YaST Online Update. 5) Caldera Volution Online. 6) Apt. 7) Gentoo ebuilds (portage). 8) Debian GNU/Linux package search. 9) Yum.
Certification	MCSD, MCT, MCSE	1) Red Hat Certification. 2) Sair Linux and GNU Certification. 3) Linux Professional Institute Certification (LPIC). 4) Linux+. 5) Prometric. 6) VUE.
Icons on desktop	Explorer	1) Desktop File Manager. 2) Idesk.
Work with screensavers	Desktop properties	1) xset. 2) xlockmore. 3) xscreensaver. 4) kscreensaver.
Place for keeping "removed" files	Trash	1) Trash Can. 2) Libtrash.
Checking the hard disk	Scandisk	fsck -check or reiserfsck -check. Not needed with journaled file systems (reiserfs, ext3, jfs, xfs).
Defragmentation	defrag	Not needed.
GUI of the system	Windows Explorer	Kde, Gnome, IceWM, Windowmaker, Blackbox, Fluxbox, ...
Windows XP GUI	Windows XP	XPde.
Flavors of the sytem	9x, NT, XP	RedHat, Mandrake, Knoppix, Debian, SuSE, ALT, ASP, Gentoo, Slackware, Linux From Scratch, ...
Tactics	FUD (fear, uncertainty, doubt)	Open Source! "First they ignore you, then they laugh at you, then they fight you, then you win".
Source code of the kernel freely available	No	Of course :)
Command line	1) command.com :). 2) cmd.exe 3) Windows Script-xing Host 4) 4DOS / 4NT 5) Minimalist GNU For Windows 6) Unix tools for Windows (AT&T)	1) Bash. 2) Csh. 3) Zsh. 4) Ash. 5) Tcsh.
Free of charge operating system	Microsoft Windows. (Imagine yourself that in Russia there are 95% of users having a pirate copy of Windows :).	Linux - the Free operating system!!
-	Nimda	Slapper.
-	Wincih, klez, etc	No analogs
Backdoors and hidden keys	Decide it yourself :).	-
Easter eggs, undocumented possibilities	Logo with Windows developers, Doom in Excel 95, 3D-racing in Excel 2000, etc, etc...	-
The magazines	Windows Magazine	1) Linux Journal. 2) Linux Gazette. 3) Linux magazine. 4) Linux pratico (Italy).
-	Blue Screen Of Death (BSOD)	1) Kernel panic. 2) Screensaver "bsod" :).
Whom it is necessary to curse for bugs and defects of the system	M$, Bill Gates personally	1) Developers of the distribution. 2) All the Linux people and Linus Torvalds personally :). 3) Yourself and your own /dev/hands :)).
-	M$.com	GNU.org, FSF.org
-	Windows.com	Linux.org
-	Bill Gates, "Road ahead"	Linus Torvalds, "Just for fun" :).
-	Bill Gates, "Business @ the speed of thought"	Richard M. Stallman, "The right to read".

'기본 카테고리' 카테고리의 다른 글

[팁] vi editor와 관련된 유용한 팀[펌] (0)	2008.02.14
알아두면 편리한 윈도우 명령어 (0)	2008.02.14
리눅스에 파이썬 설치하기 (0)	2008.02.13
sql 학습 (0)	2008.02.13
MySQL5 (0)	2008.02.05

리눅스에 파이썬 설치하기

2008. 2. 13. 21:49

'기본 카테고리' 카테고리의 다른 글

알아두면 편리한 윈도우 명령어 (0)	2008.02.14
윈도우에 상응하는 리눅스 프로그램 4/4 (0)	2008.02.13
sql 학습 (0)	2008.02.13
MySQL5 (0)	2008.02.05
정규표현식 기초 (0)	2008.02.01

sql 학습

2008. 2. 13. 13:59

http://sql.1keydata.com/kr/

http://www.igotit.co.kr/zbxe/PAGE_SPECIAL

'기본 카테고리' 카테고리의 다른 글

윈도우에 상응하는 리눅스 프로그램 4/4 (0)	2008.02.13
리눅스에 파이썬 설치하기 (0)	2008.02.13
MySQL5 (0)	2008.02.05
정규표현식 기초 (0)	2008.02.01
에디터용 글꼴 (0)	2008.02.01

PREV 1 ···18 19 20 21 22 23 24 ···38 NEXT

기본 카테고리

'기본 카테고리' 카테고리의 다른 글

'기본 카테고리' 카테고리의 다른 글

'기본 카테고리' 카테고리의 다른 글

Step # 1 : Stop mysql service

Step # 2: Start to MySQL server w/o password:

Step # 3: Connect to mysql server using mysql client:

Step # 4: Setup new MySQL root user password

Step # 5: Stop MySQL Server:

Step # 6: Start MySQL server and test it

'기본 카테고리' 카테고리의 다른 글

Beautiful Soup

'기본 카테고리' 카테고리의 다른 글

Python and HTML Processing

Fetching Web Pages

Supplying Data

Fetching Secure Web Pages

Handling Redirects

Using the SGML Parser

Defining a Parser Class

Deciding What to Remember

Remembering Document Details

How SGMLParser Operates

Why SGMLParser Works

Introducing Our Customisations

Retrieving the Details

Trying it Out

The Example File

Finding More Specific Content

Remembering Our Position

Recording Relevant Data

The Example File

Problems with Text

The Example File

Conclusions

UsingXML Parsers

Using PyXML

Creating the Reader

Loading a Document

Using libxml2dom

Deciding What to Extract

Querying Elements

Namespaces

Finding More Specific Content

Elements, Nodes and Child Nodes

Node Types

Navigating the Document Structure

Getting Document Regions as Text

'기본 카테고리' 카테고리의 다른 글

리눅스에서 웹 스파이더(Web spider) 구현하기 (한글)

'기본 카테고리' 카테고리의 다른 글

'기본 카테고리' 카테고리의 다른 글

'기본 카테고리' 카테고리의 다른 글

'기본 카테고리' 카테고리의 다른 글

'기본 카테고리' 카테고리의 다른 글

'기본 카테고리' 카테고리의 다른 글

티스토리툴바