분류 전체보기

Recover MySQL root password 2008.02.28
Beautiful Soup 2008.02.26
Python and HTML Processing 2008.02.26
리눅스에서 웹 스파이더(Web spider) 구현하기 (한글) 2008.02.16
[팁] vi editor와 관련된 유용한 팀[펌] 2008.02.14
알아두면 편리한 윈도우 명령어 2008.02.14
윈도우에 상응하는 리눅스 프로그램 4/4 2008.02.13
리눅스에 파이썬 설치하기 2008.02.13
sql 학습 2008.02.13
MySQL5 2008.02.05
정규표현식 기초 2008.02.01
에디터용 글꼴 2008.02.01

Recover MySQL root password

You can recover MySQL database server password with following five easy steps.

Step # 1: Stop the MySQL server process.

Step # 2: Start the MySQL (mysqld) server/daemon process with the –skip-grant-tables option so that it will not prompt for password

Step # 3: Connect to mysql server as the root user

Step # 4: Setup new root password

Step # 5: Exit and restart MySQL server

Here are commands you need to type for each step (login as the root user):

Step # 1 : Stop mysql service

# /etc/init.d/mysql stop
Output:

Stopping MySQL database server: mysqld.

Step # 2: Start to MySQL server w/o password:

# mysqld_safe --skip-grant-tables &
Output:

[1] 5988Starting mysqld daemon with databases from /var/lib/mysqlmysqld_safe[6025]: started

Step # 3: Connect to mysql server using mysql client:

# mysql -u root
Output:

Welcome to the MySQL monitor.  Commands end with ; or \g.Your MySQL connection id is 1 to server version: 4.1.15-Debian_1-logType 'help;' or '\h' for help. Type '\c' to clear the buffer.mysql>

Step # 4: Setup new MySQL root user password

mysql> use mysql; mysql> update user set password=PASSWORD("NEW-ROOT-PASSWORD") where User='root'; mysql> flush privileges; mysql> quit

Step # 5: Stop MySQL Server:

# /etc/init.d/mysql stop
Output:

Stopping MySQL database server: mysqldSTOPPING server from pid file /var/run/mysqld/mysqld.pidmysqld_safe[6186]: ended[1]+  Done                    mysqld_safe --skip-grant-tables

Step # 6: Start MySQL server and test it

# /etc/init.d/mysql start # mysql -u root -p

Want to stay up to date with the latest Linux tips, news and announcements? Subscribe to our free e-mail newsletter or full RSS feed to get all updates. You can Email this page to a friend.

'기본 카테고리' 카테고리의 다른 글

게임기획전문가 자격증 실기 - 기획서 작성의 대략 방식 (0)	2008.03.02
게임기획전문가 실기 후기 및 정보 (2)	2008.03.02
Beautiful Soup (0)	2008.02.26
Python and HTML Processing (0)	2008.02.26
리눅스에서 웹 스파이더(Web spider) 구현하기 (한글) (0)	2008.02.16

Beautiful Soup

2008. 2. 26. 13:17


Your ad here, right now: $0.02

You didn't write that awful page. You're just trying to get some data out of it. Right now, you don't really care what HTML is supposed to look like.

Neither does this parser.

Beautiful Soup

"A tremendous boon." -- Python411 Podcast

Beautiful Soup is a Python HTML/XML parser designed for quick turnaround projects like screen-scraping. Three features make it powerful:

Beautiful Soup won't choke if you give it bad markup. It yields a parse tree that makes approximately as much sense as your original document. This is usually good enough to collect the data you need and run away.
Beautiful Soup provides a few simple methods and Pythonic idioms for navigating, searching, and modifying a parse tree: a toolkit for dissecting a document and extracting what you need. You don't have to create a custom parser for each application.
Beautiful Soup automatically converts incoming documents to Unicode and outgoing documents to UTF-8. You don't have to think about encodings, unless the document doesn't specify an encoding and Beautiful Soup can't autodetect one. Then you just have to specify the original encoding.

Beautiful Soup parses anything you give it, and does the tree traversal stuff for you. You can tell it "Find all the links", or "Find all the links of class externalLink", or "Find all the links whose urls match "foo.com", or "Find the table heading that's got bold text, then give me that text."

Valuable data that was once locked up in poorly-designed websites is now within your reach. Projects that would have taken hours take only minutes with Beautiful Soup.

Download Beautiful Soup

The latest version is Beautiful Soup version 3.0.5, released December 12, 2007. You can download it as a single, self-contained file, or as a tarball with installer script and unit tests. Beautiful Soup is licensed under the same terms as Python itself, so you can drop it into almost any Python application (or into your library path) and start using it immediately.

Beautiful Soup works with Python versions 2.3 and up. It works best with Python versions 2.4 and up. If you don't have Python 2.4, you should install the cjkcodecs, iconvcodec, and chardet libraries. If you don't do this, Beautiful Soup will still work, but it won't be very good at parsing documents in Asian encodings.

Older versions are still available: the 1.x series works with Python 1.5, and the 2.x series has a fairly large installed base.

This document (source) is part of Crummy, the webspace of Leonard Richardson (contact information). It was last modified on Friday, December 21 2007, 18:57:10 Nowhere Standard Time and last built on Monday, February 25 2008, 23:00:01 Nowhere Standard Time.

Document tree:

http://www.crummy.com/

software/

BeautifulSoup/

Site Search:

'기본 카테고리' 카테고리의 다른 글

게임기획전문가 실기 후기 및 정보 (2)	2008.03.02
Recover MySQL root password (0)	2008.02.28
Python and HTML Processing (0)	2008.02.26
리눅스에서 웹 스파이더(Web spider) 구현하기 (한글) (0)	2008.02.16
[팁] vi editor와 관련된 유용한 팀[펌] (0)	2008.02.14

Python and HTML Processing

2008. 2. 26. 13:05

Python and HTML Processing

Home	People	HTML	Emulation
		Python

Abstract

Various Web surfing tasks that I regularly perform could be made much easier, and less tedious, if I could only use Python to fetch the HTML pages and to process them, yielding the information I really need. In this document I attempt to describe HTML processing in Python using readily available tools and libraries.

NOTE: This document is not quite finished. I aim to include sections on using mxTidy to deal with broken HTML as well as some tips on cleaning up text retrieved from HTML resources.

Prerequisites

Depending on the methods you wish to follow in this tutorial, you need the following things:

For the "SGML parser" method, a recent release of Python is probably enough. You can find one at the Python download page.
For the "XML parser" method, a recent release of Python is required, along with a capable XML processing library. I recommend using libxml2dom, since it canhandle badly-formed HTML documents as well as well-formed XML or XHTML documents. However, PyXML also provides support for such documents.
For fetching Web pages over secure connections, it is important that SSL support is enabled either when building Python from source, or in any packaged distribution of Python that you might acquire. Information about this is given in the source distribution of Python, but you can download replacement socket libraries with SSL support for older versions of Python for Windows from Robin Dunn's site.

Activities

Accessing sites, downloading content, and processing such content, either to extract useful information for archiving or to use such content to navigate further into the site, require combinations of the following activities. Some activities can be chosen according to preference: whether the SGML parser or the XML parser (or parsing framework) is used depends on which style of programming seems nicer to a given developer (although one parser may seem to work better in some situations). However, technical restrictions usually dictate whether certain libraries are to be used instead of others: when handling HTTP redirects, it appears that certain Python modules are easier to use, or even more suited to handling such situations.

Fetching Web Pages

Fetching standard Web pages over HTTP is very easy with Python:

import urllib
# Get a file-like object for the Python Web site's home page.
f = urllib.urlopen("http://www.python.org")
# Read from the object, storing the page's contents in 's'.
s = f.read()
f.close()

Supplying Data

Sometimes, it is necessary to pass information to the Web server, such as information which would come from an HTML form. Of course, you need to know which fields are available in a form, but assuming that you already know this, you can supply such data in the urlopen function call:

# Search the Vaults of Parnassus for "XMLForms".
# First, encode the data.
data = urllib.urlencode({"find" : "XMLForms", "findtype" : "t"})
# Now get that file-like object again, remembering to mention the data.
f = urllib.urlopen("http://www.vex.net/parnassus/apyllo.py", data)
# Read the results back.
s = f.read()
s.close()

The above example passed data to the server as an HTTPPOST request. Fortunately, the Vaults of Parnassus is happy about such requests, but this is not always the case with Web services. We can instead choose to use a different kind of request, however:

# We have the encoded data. Now get the file-like object...
f = urllib.urlopen("http://www.vex.net/parnassus/apyllo.py?" + data)
# And the rest...

The only difference is the use of a? (question mark) character and the adding of data onto the end of the Vaults of Parnassus URL, but this constitutes an HTTPGET request, where the query (our additional data) is included in the URL itself.

Fetching Secure Web Pages

Fetching secure Web pages using HTTPS is also very easy, provided that your Python installation supports SSL:

import urllib
# Get a file-like object for a site.
f = urllib.urlopen("https://www.somesecuresite.com")
# NOTE: At the interactive Python prompt, you may be prompted for a username
# NOTE: and password here.
# Read from the object, storing the page's contents in 's'.
s = f.read()
f.close()

Including data which forms the basis of a query, as illustrated above, is also possible with URLs starting withhttps.

Handling Redirects

Many Web services use HTTP redirects for various straightforward or even bizarre purposes. For example, a fairly common technique employed on "high traffic" Web sites is the HTTP redirection load balancing strategy where the initial request to the publicised Web site (eg.http://www.somesite.com) is redirected to another server (eg.http://www1.somesite.com) where a user's session is handled.

Fortunately, urlopen handles redirects, at least in Python 2.1, and therefore any such redirection should be handled transparently by urlopen without your program needing to be aware that it is happening. It is possible to write code to deal with redirection yourself, and this can be done using the httplib module; however, the interfaces provided by that module are more complicated than those provided above, if somewhat more powerful.

Using the SGML Parser

Given a character string from a Web service, such as the value held by s in the above examples, how can one understand the content provided by the service in such a way that an "intelligent" response can be made? One method is by using an SGML parser, since HTML is a relation of SGML, and HTML is probably the content type most likely to be experienced when interacting with a Web service.

In the standard Python library, the sgmllib module contains an appropriate parser class called SGMLParser. Unfortunately, it is of limited use to us unless we customise its activities somehow. Fortunately, Python's object-oriented features, combined with the design of the SGMLParser class, provide a means of customising it fairly easily.

Defining a Parser Class

First of all, let us define a new class inheriting from SGMLParser with a convenience method that I find very convenient indeed:

import sgmllib

class MyParser(sgmllib.SGMLParser):
    "A simple parser class."

    def parse(self, s):
        "Parse the given string 's'."
        self.feed(s)
        self.close()

    # More to come...

What the parse method does is provide an easy way of passing some text (as a string) to the parser object. I find this nicer than having to remember calling the feed method, and since I always tend to have the entire document ready for parsing, I do not need to use feed many times - passing many pieces of text which comprise an entire document is an interesting feature of SGMLParser (and its derivatives) which could be used in other situations.

Deciding What to Remember

Of course, implementing our own customised parser is only of interest if we are looking to find things in a document. Therefore, we should aim to declare these things before we start parsing. We can do this in the __init__ method of our class:

    # Continuing from above...

    def __init__(self, verbose=0):
        "Initialise an object, passing 'verbose' to the superclass."

        sgmllib.SGMLParser.__init__(self, verbose)
        self.hyperlinks = []

    # More to come...

Here, we initialise new objects by passing information to the __init__ method of the superclass (SGMLParser); this makes sure that the underlying parser is set up properly. We also initialise an attribute called hyperlinks which will be used to record the hyperlinks found in the document that any given object will parse.

Care should be taken when choosing attribute names, since use of names defined in the superclass could potentially cause problems when our parser object is used, because a badly chosen name would cause one of our attributes to override an attribute in the superclass and result in our attributes being manipulated for internal parsing purposes by the superclass. We might hope that the SGMLParser class uses attribute names with leading double underscores (__) since this isolates such attributes from access by subclasses such as our own MyParser class.

Remembering Document Details

We now need to define a way of extracting data from the document, but SGMLParser provides a mechanism which notifies us when an interesting part of the document has been read. SGML and HTML are textual formats which are structured by the presence of so-called tags, and in HTML, hyperlinks may be represented in the following way:

<a href="http://www.python.org">The Python Web site</a>

How SGMLParser Operates

An SGMLParser object which is parsing a document recognises starting and ending tags for things such as hyperlinks, and it issues a method call on itself based on the name of the tag found and whether the tag is a starting or ending tag. So, as the above text is recognised by an SGMLParser object (or an object derived from SGMLParser, like MyParser), the following method calls are made internally:

self.start_a(("href", "http://www.python.org"))
self.handle_data("The Python Web site")
self.end_a()

Note that the text between the tags is considered as data, and that the ending tag does not provide any information. The starting tag, however, does provide information in the form of a sequence of attribute names and values, where each name/value pair is placed in a 2-tuple:

# The form of attributes supplied to start tag methods:
# (name, value)
# Examples:
# ("href", "http://www.python.org")
# ("target", "python")

Why SGMLParser Works

Why does SGMLParser issue a method call on itself, effectively telling itself that a tag has been encountered? The basic SGMLParser class surely does not know what to do with such information. Well, if another class inherits from SGMLParser, then such calls are no longer confined to SGMLParser and instead act on methods in the subclass, such as MyParser, where such methods exist. Thus, a customised parser class (eg. MyParser) once instantiated (made into an object) acts like a stack of components, with the lowest level of the stack doing the hard parsing work and passing items of interest to the upper layers - it is a bit like a factory with components being made on the ground floor and inspection of those components taking place in the laboratories in the upper floors!

Class	Activity
...	Listens to reports, records other interesting things
`MyParser`	Listens to reports, records interesting things
`SGMLParser`	Parses documents, issuing reports at each step

Introducing Our Customisations

Now, if we want to record the hyperlinks in the document, all we need to do is to define a method called start_a which extracts the hyperlink from the attributes which are provided in the startinga tag. This can be defined as follows:

    # Continuing from above...

    def start_a(self, attributes):
        "Process a hyperlink and its 'attributes'."

        for name, value in attributes:
            if name == "href":
                self.hyperlinks.append(value)

    # More to come...

All we need to do is traverse the attributes list, find appropriately named attributes, and record the value of those attributes.

Retrieving the Details

A nice way of providing access to the retrieved details is to define a method, although Python 2.2 provides additional features to make this more convenient. We shall use the old approach:

    # Continuing from above...

    def get_hyperlinks(self):
        "Return the list of hyperlinks."

        return self.hyperlinks

Trying it Out

Now that we have defined our class, we can instantiate it, making a new MyParser object. After that, it is just a matter of giving it a document to work with:

import urllib, sgmllib

# Get something to work with.
f = urllib.urlopen("http://www.python.org")
s = f.read()

# Try and process the page.
# The class should have been defined first, remember.
myparser = MyParser()
myparser.parse(s)

# Get the hyperlinks.
print myparser.get_hyperlinks()

The print statement should cause a list to be displayed, containing various hyperlinks to locations on the Python home page and other sites.

The Example File

The above example code can be downloaded and executed to see the results.

Finding More Specific Content

Of course, if it is sufficient for you to extract information from a document without worrying about where in the document it came from, then the above level of complexity should suit you perfectly. However, one might want to extract information which only appears in certain places or constructs - a good example of this is the text between starting and ending tags of hyperlinks which we saw above. If we just acquired every piece of text using a handle_data method which recorded everything it saw, then we would not know which piece of text described a hyperlink and which piece of text appeared in any other place in a document.

    # An extension of the above class.
    # This is not very useful.

    def handle_data(self, data):
        "Handle the textual 'data'."

        self.descriptions.append(data)

Here, the descriptions attribute (which we would need to initialise in the __init__ method) would be filled with lots of meaningless textual data. So how can we be more specific? The best approach is to remember not only the content that SGMLParser discovers, but also to remember what kind of content we have seen already.

Remembering Our Position

Let us add some new attributes to the __init__ method.

        # At the end of the __init__ method...

        self.descriptions = []
        self.inside_a_element = 0

The descriptions attribute is defined as we anticipated, but the inside_a_element attribute is used for something different: it will indicate whether or not SGMLParser is currently investigating the contents of ana element - that is, whether SGMLParser is between the startinga tag and the endinga tag.

Let us now add some "logic" to the start_a method, redefining it as follows:

    def start_a(self, attributes):
        "Process a hyperlink and its 'attributes'."

        for name, value in attributes:
            if name == "href":
                self.hyperlinks.append(value)
                self.inside_a_element = 1

Now, we should know when a startinga tag has been seen, but to avoid confusion, we should also change the value of the new attribute when the parser sees an endinga tag. We do this by defining a new method for this case:

    def end_a(self):
        "Record the end of a hyperlink."

        self.inside_a_element = 0

Fortunately, it is not permitted to "nest" hyperlinks, so it is not relevant to wonder what might happen if an ending tag were to be seen after more than one starting tag had been seen in succession.

Recording Relevant Data

Now, given that we can be sure of our position in a document and whether we should record the data that is being presented, we can define the "real" handle_data method as follows:

    def handle_data(self, data):
        "Handle the textual 'data'."

        if self.inside_a_element:
            self.descriptions.append(data)

This method is not perfect, as we shall see, but it does at least avoid recording every last piece of text in the document.

We can now define a method to retrieve the description data:

    def get_descriptions(self):
        "Return a list of descriptions."

        return self.descriptions

And we can add the following line to our test program in order to display the descriptions:

print myparser.get_descriptions()

The Example File

The example code with these modifications can be downloaded and executed to see the results.

Problems with Text

Upon running the modified example, one thing is apparent: there are a few descriptions which do not make sense. Moreover, the number of descriptions does not match the number of hyperlinks. The reason for this is the way that text is found and presented to us by the parser - we may be presented with more than one fragment of text for a particular region of text, so that more than one fragment of text may be signalled between a startinga tag and an endinga tag, even though it is logically one block of text.

We may modify our example by adding another attribute to indicate whether we are just beginning to process a region of text. If this new attribute is set, then we add a description to the list; if not, then we add any text found to the most recent description recorded.

The __init__ method is modified still further:

        # At the end of the __init__ method...

        self.starting_description = 0

Since we can only be sure that a description is being started immediately after a startinga tag has been seen, we redefine the start_a method as follows:

    def start_a(self, attributes):
        "Process a hyperlink and its 'attributes'."

        for name, value in attributes:
            if name == "href":
                self.hyperlinks.append(value)
                self.inside_a_element = 1
                self.starting_description = 1

Now, the handle_data method needs redefining as follows:

    def handle_data(self, data):
        "Handle the textual 'data'."

        if self.inside_a_element:
            if self.starting_description:
                self.descriptions.append(data)
                self.starting_description = 0
            else:
                self.descriptions[-1] += data

Clearly, the method becomes more complicated. We need to detect whether the description is being started and act in the manner discussed above.

The Example File

The example code with these modifications can be downloaded and executed to see the results.

Conclusions

Although the final example file produces some reasonable results - there are some still strange descriptions, however, and we have not taken images used within hyperlinks into consideration - the modifications that were required illustrate that as more attention is paid to the structure of the document, the more effort is required to monitor the origins of information. As a result, we need to maintain state information within the MyParser object in a not-too-elegant way.

For application purposes, the SGMLParser class, its derivatives, and related approaches (such as SAX) are useful for casual access to information, but for certain kinds of querying, they can become more complicated to use than one would initially believe. However, these approaches can be used for another purpose: that of building structures which can be accessed in a more methodical fashion, as we shall see below.

UsingXML Parsers

Given a character string s, containing an HTML document which may have been retrieved from a Web service (using an approach described in an earlier section of this document), let us now consider an alternative method of interpreting the contents of this document so that we do not have to manage the complexity of remembering explicitly the structure of the document that we have seen so far. One of the problems with SGMLParser was that access to information in a document happened "serially" - that is, information was presented to us in the order in which it was found - but it may have been more appropriate to access the document information according to the structure of the document, so that we could request all parts of the document corresponding to the hyperlink elements present in that document, before examining each document portion for the text within each hyperlink element.

In the XML world, a standard called the Document Object Model (DOM) has been devised to provide a means of access to document information which permits us to navigate the structure of a document, requesting different sections of that document, and giving us the ability to revisit such sections at any time; the use of Python with XML and the DOM is described in another document. If all Web pages were well-formed XML - that is, they all complied with the expectations and standards set out by the XML specifications - then any XML parser would be sufficient to process any HTML document found on the Web. Unfortunately, many Web pages useless formal variants of HTML which are rejected by XML parsers. Thus, we need to employ particular tools and additional techniques to convert such pages to DOM representations.

Below, we describe how Web pages may beprocessed using the PyXML toolkit and with the libxml2dom package to obtain a top-level document object. Since both approaches yield an object which is broadly compatible with the DOM standard, the subsequent description of how we then inspect such documents applies regardless of whichever toolkit or package we have chosen.

Using PyXML

It is possible to use Python's XML framework with the kind of HTML found on the Web by employing a special "reader" class which builds a DOM representation from an HTML document, and the consequences of this are described below.

Creating the Reader

An appropriate class for reading HTML documents is found deep in the xml package, and we shall instantiate this class for subsequent use:

from xml.dom.ext.reader import HtmlLib
reader = HtmlLib.Reader()

Of course, there are many different ways of accessing the Reader class concerned, but I have chosen not to import Reader into the common namespace. One good reason for deciding this is that I may wish to import other Reader classes from other packages or modules, and we clearly need a way to distinguish between them. Therefore, I import the HtmlLib name and access the Reader class from within that module.

Loading a Document

Unlike SGMLParser, we do not need to customise any class before we load a document. Therefore, we can "postpone" any consideration of the contents of the document until after the document has been loaded, although it is very likely that you will have some idea of the nature of the contents in advance and will have written classes or functions to work on the DOM representation once it is available. After all, real programs extracting particular information from a certain kind of document do need to know something about the structure of the documents they process, whether that knowledge is put in a subclass of a parser (as in SGMLParser) or whether it is "encoded" in classes and functions which manipulate the DOM representation.

Anyway, let us load the document and obtain a Document object:

doc = reader.fromString(s)

Note that the "top level" of a DOM representation is always a Document node object, and this is what doc refers to immediately after the document is loaded.

Using libxml2dom

Obtaining documents using libxml2dom is slightly more straightforward:

import libxml2dom
doc = libxml2dom.parseString(s, html=1)

If the document text is well-formed XML, we could omit thehtml parameter or set it to have a false value. However, if we are not sure whether the text is well-formed,no significant issues will arise from setting the parameterin the above fashion.

Deciding What to Extract

Now, it is appropriate to decide which information is to be found and retrieved from the document, and this is where some tasks appear easier than with SGMLParser (and related frameworks). Let us consider the task of extracting all the hyperlinks from the document; we can certainly find all the hyperlink elements as follows:

a_elements = doc.getElementsByTagName("a")

Since hyperlink elements comprise the startinga tag, the endinga tag, and all data between them, the value of the a_elements variable should be a list of objects representing regions in the document which would appear like this:

<a href="http://www.python.org">The Python Web site</a>

Querying Elements

To make the elements easier to deal with, each object in the list is not the textual representation of the element as given above. Instead, an object is created for each element which provides a more convenient level of access to the details. We can therefore obtain a reference to such an object and find out more about the element it represents:

# Get the first element in the list. We don't need to use a separate variable,
# but it makes it clearer.
first = a_elements[0]
# Now display the value of the "href" attribute.
print first.getAttribute("href")

What is happening here is that the first object (being the firsta element in the list of those found) is being asked to return the value of the attribute whose name is href, and if such an attribute exists, a string is returned containing the contents of the attribute: in the case of the above example, this would be...

http://www.python.org

If the href attribute had not existed, such as in the following example element, then a value of None would have been returned.

<a name="Example">This is not a hyperlink. It is a target.</a>

Namespaces

Previously, this document recommended the usage of namespaces and the getAttributeNS method, rather than the getAttribute method. Whilst XML processing may involve extensive use of namespaces, some HTML parsers do not appear to expose them quite as one would expect: for example, not associating the XHTML namespace with XHTML elements in a document. Thus, it can be advisable to ignore namespaces unless their usage is unavoidable in order to distinguish between elements in mixed-content documents (XHTML combined with SVG, for example).

Finding More Specific Content

We are already being fairly specific, in a sense, in the way that we have chosen to access thea elements within the document, since we start from a particular point in the document's structure and search for elements from there. In the SGMLParser examples, we decided to look for descriptions of hyperlinks in the text which is enclosed between the starting and ending tags associated with hyperlinks, and we were largely successful with that, although there were some issues that could have been handled better. Here, we shall attempt to find everything that is descriptive within hyperlink elements.

Elements, Nodes and Child Nodes

Each hyperlink element is represented by an object whose attributes can be queried, as we did above in order to get the href attribute's value. However, elements can also be queried about their contents, and such contents take the form of objects which represent "nodes" within the document. (The nature of XML documents is described in another introductory document which discusses the DOM.) In this case, it is interesting for us to inspect the nodes which reside within (or under) each hyperlink element, and since these nodes are known generally as "child nodes", we access them through the childNodes attribute on each so-called Node object.

# Get the child nodes of the first "a" element.
nodes = first.childNodes

Node Types

Nodes are the basis of any particular piece of information found in an XML document, so any element found in a document is based on a node and can be explicitly identified as an element by checking its "node type":

print first.nodeType
# A number is returned which corresponds to one of the special values listed in
# the xml.dom.Node class. Since elements inherit from that class, we can access
# these values on 'first' itself!
print first.nodeType == first.ELEMENT_NODE
# If first is an element (it should be) then display the value 1.

One might wonder how this is useful, since the list of hyperlink elements, for example, is clearly a list of elements - that is, after all, what we asked for. However, if we ask an element for a list of "child nodes", we cannot immediately be sure which of these nodes are elements and which are, for example, pieces of textual data. Let us therefore examine the "child nodes" of first to see which of them are textual:

for node in first.childNodes:
    if node.nodeType == node.TEXT_NODE:
        print "Found a text node:", node.nodeValue

Navigating the Document Structure

If we wanted only to get the descriptive text within each hyperlink element, then we would need to visit all nodes within each element (the "child nodes") and record the value of the textual elements. However, this would not quite be enough - consider the following document region:

<a href="http://www.python.org">A <em>really</em> important page.</a>

Within thea element, there are text nodes and an em element - the text within that element is not directly available as a "child node" of thea element. If we did not consider textual child nodes of each child node, then we would miss important information. Consequently, it becomes essential to recursively descend inside thea element collecting child node values. This is not as hard as it sounds, however:

def collect_text(node):
    "A function which collects text inside 'node', returning that text."

    s = ""
    for child_node in node.childNodes:
        if child_node.nodeType == child_node.TEXT_NODE:
            s += child_node.nodeValue
        else:
            s += collect_text(child_node)
    return s

# Call 'collect_text' on 'first', displaying the text found.
print collect_text(first)

To contrast this with the SGMLParser approach, we see that much of the work done in that example to extract textual information is distributed throughout the MyParser class, whereas the above function, which looks quite complicated, gathers the necessary operations into a single place, thus making it look complicated.

Getting Document Regions as Text

Interestingly, it is easier to retrieve whole sections of the original document as text for each of the child nodes, thus collecting the complete contents of thea element as text. For this, we just need to make use of a function provided in the xml.dom.ext package:

from xml.dom.ext import PrettyPrint
# In order to avoid getting the "a" starting and ending tags, prettyprint the
# child nodes.
s = ""
for child_node in a_elements[0]:
    s += PrettyPrint(child_node)
# Display the region of the original document between the tags.
print s

Unfortunately, documents produced by libxml2dom do not work withPrettyPrint. However, we can use a method on each node object instead:

# In order to avoid getting the "a" starting and ending tags, prettyprint the
# child nodes.
s = ""
for child_node in a_elements[0]:
    s += child_node.toString(prettyprint=1)
# Display the region of the original document between the tags.
print s

It is envisaged that libxml2dom will eventually work better with such functions and tools.

'기본 카테고리' 카테고리의 다른 글

Recover MySQL root password (0)	2008.02.28
Beautiful Soup (0)	2008.02.26
리눅스에서 웹 스파이더(Web spider) 구현하기 (한글) (0)	2008.02.16
[팁] vi editor와 관련된 유용한 팀[펌] (0)	2008.02.14
알아두면 편리한 윈도우 명령어 (0)	2008.02.14

리눅스에서 웹 스파이더(Web spider) 구현하기 (한글)

2008. 2. 16. 22:01

리눅스에서 웹 스파이더(Web spider) 구현하기 (한글)

간단한 스파이더와 스크래퍼로 인터넷 콘텐트 모으기

문서 옵션

		이 페이지 출력
		이 페이지를 이메일로 보내기

제안 및 의견

피드백

난이도 : 중급

M. Tim Jones, Consultant Engineer, Emulex

2007 년 4 월 17 일

웹 스파이더(Web spider)는 인터넷을 크롤링 하며 정보를 수집하고, 필터링 하며, 사용자를 위한 정보를 한데 모으는 소프트웨어 에이전트입니다. 일반 스크립팅 언어와 웹 모듈을 사용하면 웹 스파이더를 쉽게 구현할 수 있습니다. 이 글에서는 리눅스�용 스파이더와 스크래퍼를 구현하여 웹 사이트를 크롤링 하며 정보를 모으는 방법을 설명합니다.

스파이더(spider)는 특정 목적을 위해 특정한 방법으로 인터넷을 크롤링(crawl) 하는 프로그램이다. 이 프로그램의 목적은 정보를 수집하거나 웹 사이트의 구조와 유효성을 파악하는 것이다. 스파이더는 Google과 AltaVista 같은 현대적인 검색 엔진의 기초가 된다. 이러한 스파이더들은 웹에서 자동으로 데이터를 검색하여, 검색어에 가장 잘 맞는 웹 사이트의 내용을 인덱싱 하는 다른 애플리케이션에 전달한다.

에이전트로서의 웹 스파이더

웹 스파이더와 스크래퍼는 소프트웨어 로봇 또는 에이전트(1980년대 초, Alan Kay에 의해 만들어진 단어)의 또 다른 형태이다. Alan이 만든 에이전트라는 개념은 컴퓨터 세계에서의 사용자용 프록시였다. 이 에이전트에는 목표와, 그 목표에 대한 작업이 주어질 수 있었다. 이것이 어떤 한계에 도달하면 사용자에게서 어드바이스를 요청할 수 있고 계속적으로 그 목표를 수행할 수 있었다.

오늘날 에이전트는 자율성(autonomy), 적응성(adaptiveness), 통신, 다른 에이전트와의 협업(collaboration) 같은 애트리뷰트들로 구분된다. 에이전트 이동성(mobility)과 개성(personality) 같은 기타 애트리뷰트들은 오늘날 에이전트 연구의 목표이다. 이 글에서 설명하는 웹 스파이더는 에이전트 분류법에서 Task-Specific Agents로 분류된다.

스파이더와 비슷한 것으로 웹 스크래퍼(Web scraper)가 있다. 스크래퍼는 스파이더의 한 유형으로서, 웹에서 제품이나 서비스 비용 같은 특수한 내용이 스크래핑 대상이 된다. 한 가지 사용 예제로는 가격 비교가 있는데, 해당 제품의 가격을 파악하여 본인 제품의 가격을 조정하고, 이에 따라 광고를 하는 것이다. 스크래퍼는 많은 웹 소스들에서 데이터를 모으고 그 정보를 사용자에게 제공한다.

생물학적인 동기

스파이더의 본질을 생각할 때, 고립성이 아닌 환경과의 인터랙션에 맞추어 이를 생각하게 된다. 스파이더는 자신의 길을 보고 감지하며, 한 장소에서 또 다른 장소로 의미 있는 방식으로 이동한다. 웹 스파이더도 비슷한 방식으로 작동한다. 웹 스파이더는 고급 언어로 작성된 프로그램이며, Hypertext Transfer Protocol (HTTP) 같은 네트워킹 프로토콜을 사용하여 환경과 인터랙팅 한다. 스파이더가 여러분과 통신하기 원한다면, Simple Mail Transfer Protocol (SMTP)을 사용하여 이메일 메시지를 보낼 수 있다.

스파이더는 HTTP 또는 SMTP 로 국한되지 않는다. 일부 스파이더는 SOAP 또는 Extensible Markup Language Remote Procedure Call (XML-RPC) 프로토콜 같은 웹 서비스를 사용한다. 다른 스파이더는 Network News Transfer Protocol (NNTP)을 통해 뉴스 그룹과 소통하거나, Really Simple Syndication (RSS) 피드로 흥미로운 뉴스 아이템들을 찾는다. 대부분의 스파이더는 본질적으로 명암 강도(light-dark intensity)와 움직임의 변화만 볼 수 있지만, 웹 스파이더들은 많은 유형의 프로토콜들을 사용하여 보고 감지할 수 있다.

위로

스파이더와 스크래퍼의 애플리케이션

스파이더의 눈과 다리

웹 스파이더가 인터넷을 보고 움직이는 주요 수단은 HTTP이며, HTTP는 메시지 중심 프로토콜로서, 이곳에서 클라이언트는 서버와 연결되어 해당 요청을 수행하며, 서버는 응답을 제공한다. 각각의 요청과 응답은 헤더와 바디로 구성되고, 헤더는 상태 정보와 바디의 내용에 대한 디스크립션을 제공한다.

HTTP는 세 가지 기본적인 유형의 요청을 제공한다. 첫 번째가 HEAD인데, 이것은 서버에 있는 자산에 대한 정보를 요청한다. 두 번째는 GET으로서 파일 또는 이미지 같은 자산을 요청한다. 마지막으로, POST 요청은 클라이언트가 웹 페이지를 통해(일반적으로 웹 폼을 통해) 서버와 인터랙팅 할 수 있다.

웹 스파이더와 스크래퍼는 유용한 애플리케이션이고, 따라서 좋든 나쁘든, 여러 가지 다양한 유형의 사용법이 있다. 이러한 기술을 사용하는 몇 가지 애플리케이션에 대해 살펴보도록 하자.

검색 엔진 웹 크롤러(crawler)

웹 스파이더는 인터넷 검색을 쉽고 효율적으로 만든다. 검색 엔진은 많은 웹 스파이더들을 사용하여 인터넷 상의 웹 페이지들을 크롤링 하고, 콘텐트를 리턴하며, 이를 인덱싱 한다. 이것이 완료되면, 검색 엔진은 로컬 인덱스를 빠르게 검색하여 검색에 맞는 가장 합당한 결과를 찾는다. Google은 PageRank 알고리즘을 사용하는데, 검색 결과의 웹 페이지 랭크(rank)는 얼마나 많은 페이지들이 여기에 링크되어 있는지를 나타내는 것이다. 이것은 투표(vote)로서도 작동하는데, 높은 투표를 가진 페이지들은 가장 높은 랭크를 얻는다.

이와 같이 인터넷을 검색하는 것은 웹 콘텐트와 인덱서를 통신하는데 있어서 대역폭과 결과를 인덱싱 하는 전산 비용 관점에서 볼 때 비용이 많이 든다. 많은 스토리지가 이와 같은 것을 필요로 하지만, Google이 Gmail 사용자들에게 1,000 메가바이트의 스토리지를 제공한다고 생각한다면 이것은 문제도 아니다.

웹 스파이더는 일련의 정책을 사용하여 인터넷 상의 흐름을 최소화 한다. Google은 80억 개 이상의 웹 페이지들을 인덱싱 한다. 실행 정책은 크롤러가 인덱서로 어떤 페이지들을 가져오는지, 웹 사이트로 가서 이를 다시 체크하는 빈도수는 어느 정도인지에 대한 politeness 정책을 정의한다. 웹 서버는 robot.txt라고 하는 파일을 사용하여 크롤러를 차단할 수 있다.

기업용 웹 크롤러

표준 검색 엔진 스파이더와 마찬가지로, 기업용 웹 스파이더는 일반인이 사용할 수 없는 콘텐트를 인덱싱 한다. 예를 들어, 기업들은 사원들이 사용하는 내부 웹 사이트를 갖고 있다. 이러한 유형의 스파이더는 로컬 환경으로 제한된다. 검색이 제한되기 때문에 더 많은 전산 파워가 사용되며, 전문화 되고 보다 완벽한 인덱스가 가능하다. Google은 한 단계 더 나아가서 데스크탑 검색 엔진을 제공하여 여러분 개인용 컴퓨터의 콘텐트를 인덱싱 한다.

전문화된 크롤러

콘텐트를 압축하거나 통계를 만들어 내는 등, 특수한 크롤러도 있다. 압축 크롤러는 웹 사이트를 크롤링 하면서, 콘텐트를 로컬로 가져와서 장기적인 저장 미디어에 저장되도록 한다. 이것은 백업용으로 사용될 수 있고, 더 크게는 인터넷 콘텐트의 스냅샷을 만들기도 한다. 통계는 인터넷 콘텐트와 무엇이 부족한지를 이해하는데 도움이 된다. 크롤러는 얼마나 많은 웹 서버들이 실행되는지, 특정 유형의 웹 서버들이 얼마나 많은지, 사용할 수 있는 웹 페이지 수, 깨진 링크의 수(HTTP 404 error, page not found 등을 리턴함) 등을 규명하는데 사용된다.

기타 전문적인 크롤러에는 웹 사이트 체커(checker)도 있다. 이 크롤러는 소실된 콘텐트를 찾고, 모든 링크들을 검사하며, 여러분의 Hypertext Markup Language (HTML)이 유효한지를 확인한다.

이메일을 모으는 크롤러

이제 어두운 쪽으로 가보도록 하자. 불행하게도, 일부 썩은 사과들이 인터넷을 망치고 있다. 이메일을 모으는 크롤러들은 이메일 주소가 있는 웹 사이트를 검색하여 대량의 스팸을 생성하는데 사용한다. 포스티니(Postini) 보고서(2005년 8월)에 따르면, 포스티니(Postini) 사용자들의 모든 이메일 메시지들의 70%가 스팸 이라고 한다.

이메일 모으기는 가장 흔한 크롤러 동작 메커니즘 중 하나이다. 이 글에서는 이 마지막 크롤러 예제를 설명한다.

지금까지, 웹 스파이더와 스크래퍼를 설명했다. 다음 네 가지 예제들은 Ruby와 Python 같은 현대적인 스크립팅 언어를 사용하여 리눅스용 스파이더와 스크래퍼를 구현하는 방법을 설명하겠다.

위로

예제 1: 일반 스크래퍼

이 예제를 통해 주어진 웹 사이트에 대해 어떤 종류의 웹 서버가 실행되는지를 규명하는 방법을 설명하겠다. 이것은 매우 재미있고, 정부, 학계, 업계에서 어떤 종류의 웹 서버를 사용하는지도 알 수 있다.

Listing 1은 HTTP 서버를 규명하기 위해 웹 사이트를 스크래핑 하는 Ruby 스크립트이다. Net::HTTP 클래스는 HTTP 클라이언트와 GET, HEAD, POST HTTP 메소드를 실행한다. HTTP 서버에 요청을 보낼 때 마다, HTTP 메시지 응답의 일부에서는 콘텐트가 제공되는 서버를 나타낸다. 그 사이트에서 페이지를 다운로드 하기 보다는, HEAD 메소드를 사용하여 루트 페이지('/')에 대한 정보를 얻는다. HTTP 서버가 성공적인 응답을 보내는 한("200" 응답 코드로 나타남), 응답의 각 라인을 반복하면서 server 키를 검색하고, 이것을 찾으면 값을 프린트 한다. 이 키의 값은 HTTP 서버를 나타내는 스트링이다.

Listing 1. 간단한 메타데이터 스크래핑을 위한 Ruby 스크립트(srvinfo.rb)

                #!/usr/local/bin/rubyrequire 'net/http'# Get the first argument from the command-line (the URL)url = ARGV[0]begin  # Create a new HTTP connection  httpCon = Net::HTTP.new( url, 80 )  # Perform a HEAD request  resp, data = httpCon.head( "/", nil )  # If it succeeded (200 is success)  if resp.code == "200" then    # Iterate through the response hash    resp.each {|key,val|      # If the key is the server, print the value      if key == "server" then        print "  The server at "+url+" is "+val+"\n"      end    }  endend

srvinfo 스크립트를 사용하는 방법을 설명하는 것 외에도, Listing 2는 많은 정부, 학계, 비즈니스 웹 사이트에서 가져온 결과들도 보여준다. Apache (68%)부터 Sun과 Microsoft� Internet Information Services (IIS)까지 다양하다. 서버가 리포팅 되지 않은 경우도 있다. 미크로네시아(Federated States of Micronesi)는 구 버전의 Apache를 실행하고 있고(이제 업데이트가 필요하다.), Apache.org는 첨단을 달리고 있다는 사실이 흥미롭다.

Listing 2. 서버 스크래퍼의 사용 예제

                [mtj@camus]$ ./srvrinfo.rb www.whitehouse.gov  The server at www.whitehouse.gov is Apache[mtj@camus]$ ./srvrinfo.rb www.cisco.com  The server at www.cisco.com is Apache/2.0 (Unix)[mtj@camus]$ ./srvrinfo.rb www.gov.ru  The server at www.gov.ru is Apache/1.3.29 (Unix)[mtj@camus]$ ./srvrinfo.rb www.gov.cn[mtj@camus]$ ./srvrinfo.rb www.kantei.go.jp  The server at www.kantei.go.jp is Apache[mtj@camus]$ ./srvrinfo.rb www.pmo.gov.to  The server at www.pmo.gov.to is Apache/2.0.46 (Red Hat Linux)[mtj@camus]$ ./srvrinfo.rb www.mozambique.mz  The server at www.mozambique.mz is Apache/1.3.27    (Unix) PHP/3.0.18 PHP/4.2.3[mtj@camus]$ ./srvrinfo.rb www.cisco.com  The server at www.cisco.com is Apache/1.0 (Unix)[mtj@camus]$ ./srvrinfo.rb www.mit.edu  The server at www.mit.edu is MIT Web Server Apache/1.3.26 Mark/1.5 	(Unix) mod_ssl/2.8.9 OpenSSL/0.9.7c[mtj@camus]$ ./srvrinfo.rb www.stanford.edu  The server at www.stanford.edu is Apache/2.0.54 (Debian GNU/Linux) 	mod_fastcgi/2.4.2 mod_ssl/2.0.54 OpenSSL/0.9.7e WebAuth/3.2.8[mtj@camus]$ ./srvrinfo.rb www.fsmgov.org  The server at www.fsmgov.org is Apache/1.3.27 (Unix) PHP/4.3.1[mtj@camus]$ ./srvrinfo.rb www.csuchico.edu  The server at www.csuchico.edu is Sun-ONE-Web-Server/6.1[mtj@camus]$ ./srvrinfo.rb www.sun.com  The server at www.sun.com is Sun Java System Web Server 6.1[mtj@camus]$ ./srvrinfo.rb www.microsoft.com  The server at www.microsoft.com is Microsoft-IIS/6.0[mtj@camus]$ ./srvrinfo.rb www.apache.orgThe server at www.apache.org is Apache/2.2.3 (Unix) 	mod_ssl/2.2.3 OpenSSL/0.9.7g

이것은 유용한 데이터이고, 정부와 학교들이 자신들의 웹 서버로 무엇을 사용하는지를 알 수 있어서 재미있다. 다음 예제에서는 보다 덜 유용한 주식 시세 스크래퍼를 설명하겠다.

위로

예제 2: 주식 시세 스크래퍼

이 예제에서는, 간단한 웹 스크래퍼(스크린 스크래퍼(screen scraper))를 구현하여 주식 시세 정보를 모으도록 하겠다. 다음과 같이 응답 웹 페이지에 한 패턴을 활용하는 방식을 사용할 것이다.

Listing 3. 주식 시세용 웹 스크래퍼

                #!/usr/local/bin/rubyrequire 'net/http'host = "www.smartmoney.com"link = "/eqsnaps/index.cfm?story=snapshot&symbol="+ARGV[0]begin  # Create a new HTTP connection  httpCon = Net::HTTP.new( host, 80 )  # Perform a HEAD request  resp = httpCon.get( link, nil )  stroffset = resp.body =~ /class="price">/  subset = resp.body.slice(stroffset+14, 10)  limit = subset.index('<')  print ARGV[0] + " current stock price " + subset[0..limit-1] +          " (from stockmoney.com)\n"end

이 Ruby 스크립트에서, HTTP 클라이언트를 서버로 연결하고(이 경우, www.smartmoney.com), (&symbol=<symbol>을 통해) 사용자에 의해 전달된 것처럼 주식 시세를 요청하는 링크를 구현한다. 나는 HTTP GET 메소드를 사용하여 이 링크를 요청하고(전체 응답 페이지를 가져오기 위해서), class="price">를 검색하고 바로 뒤에 주식의 현재 시세가 바로 나타난다. 이것은 웹 페이지에서 재단되어 사용자에게 디스플레이 된다.

주식 시세 스크래퍼를 사용하기 위해, 관심 있는 주식 심볼을 가진 스크립트를 호출한다. (Listing 4)

Listing 4. 주식 시세 스크래퍼의 사용 예제

                [mtj@camus]$ ./stockprice.rb ibmibm current stock price 79.28 (from stockmoney.com)[mtj@camus]$ ./stockprice.rb intlintl current stock price 21.69 (from stockmoney.com)[mtj@camus]$ ./stockprice.rb ntnt current stock price 2.07 (from stockmoney.com)[mtj@camus]$

위로

예제 3: 주식 시세 스크래퍼와 통신하기

예제 2의 주식 시세용 웹 스크래퍼는 매력적이지만, 이 스크래퍼가 주식 시세를 늘 모니터링 하고, 관심 있는 주식이 오르거나 하락할 때 여러분에게 알려주도록 한다면 더욱 유용할 것이다. 기다림을 끝났다. Listing 5에서, 웹 스크래퍼를 업데이트 하여 주식을 지속적으로 모니터링 하고 주가 변동이 있을 때 이메일 메시지를 보내도록 하였다.

Listing 5. 이메일 알림을 보낼 수 있는 주식 스크래퍼

                #!/usr/local/bin/rubyrequire 'net/http'require 'net/smtp'## Given a web-site and link, return the stock price#def getStockQuote(host, link)    # Create a new HTTP connection    httpCon = Net::HTTP.new( host, 80 )    # Perform a HEAD request    resp = httpCon.get( link, nil )    stroffset = resp.body =~ /class="price">/    subset = resp.body.slice(stroffset+14, 10)    limit = subset.index('<')    return subset[0..limit-1].to_fend## Send a message (msg) to a user.# Note: assumes the SMTP server is on the same host.#def sendStockAlert( user, msg )    lmsg = [ "Subject: Stock Alert\n", "\n", msg ]    Net::SMTP.start('localhost') do |smtp|      smtp.sendmail( lmsg, "rubystockmonitor@localhost.localdomain", [user] )    endend## Our main program, checks the stock within the price band every two# minutes, emails and exits if the stock price strays from the band.## Usage: ./monitor_sp.rb <symbol> <high> <low> <email_address>#begin  host = "www.smartmoney.com"  link = "/eqsnaps/index.cfm?story=snapshot&symbol="+ARGV[0]  user = ARGV[3]  high = ARGV[1].to_f  low = ARGV[2].to_f  while 1    price = getStockQuote(host, link)    print "current price ", price, "\n"    if (price > high) || (price < low) then      if (price > high) then        msg = "Stock "+ARGV[0]+" has exceeded the price of "+high.to_s+               "\n"+host+link+"\n"      end      if (price < low) then        msg = "Stock "+ARGV[0]+" has fallen below the price of "+low.to_s+               "\n"+host+link+"\n"      end      sendStockAlert( user, msg )      exit    end    sleep 120  endend

Ruby 스크립트는 다소 길지만, Listing 3의 주식 스크래핑 스크립트를 기반으로 구현한 것이다. 새로운 함수 getStockQuote는 주식 스크래핑 함수를 캡슐화 한다. 또 다른 함수인 sendStockAlert는 메시지를 이메일 주소로 보낸다. (두 개 모두 사용자가 정의한 것이다.) 주 프로그램은 그저 반복적으로 주식 시세를 확인하고, 변동이 있는지를 체크하고, 사용자에게 이메일 알림을 보내는 것이다. 서버에 부담을 주고 싶지 않았기 때문에 주식 시세를 체크하는 사이에 딜레이를 적용했다.

Listing 6은 주식 시세 모니터링 실행 예제이다. 2분 마다 주식이 체크되고 프린트 된다. 주가가 상한선을 넘으면, 이메일 알림이 보내지고 스크립트가 종료한다.

Listing 6. 주식 모니터 스크립트 데모

                [mtj@camus]$ ./monitor_sp.rb ibm 83.00 75.00 mtj@mtjones.comcurrent price 82.06current price 82.32current price 82.75current price 83.36

결과 이메일은 그림 1과 같다. 스크립팅 된 데이터의 소스에 링크가 걸려있다.

그림1. Listing 5의 Ruby 스크립트에서 보낸 이메일 알림

이제 스크래퍼를 떠나서 웹 스파이더의 구조에 대해 살펴보도록 하자.

위로

예제 4: 웹 사이트 크롤러

마지막 예제에서는 웹 사이트를 크롤링 하는 웹 스파이더에 대해 설명하도록 하겠다. 보안을 위해 사이트 밖에 머무르지 않고, 대신 하나의 웹 페이지만 탐구하도록 하겠다.

웹 사이트를 크롤링 하고, 이 안에서 제공되는 링크를 따라가려면, HTML 페이지를 파싱해야 한다. 웹 페이지를 성공적으로 파싱할 수 있다면 다른 리소스에 대한 링크를 구분할 수 있다. 어떤 것은 로컬 리소스(파일)을 지정하고, 다른 것은 비 로컬 리소스(다른 웹 페이지에 대한 링크)를 나타낸다.

웹을 크롤링 하려면, 주어진 웹 페이지로 시작하여, 그 페이지에 있는 모든 링크를 파악하고, 이들을 to-visit 큐에 대기시킨 다음, to-visit 큐에서 첫 번째 아이템을 사용하여 이 프로세스를 반복한다. 이것은 breadth-first traversal(너비 우선 순회)이다. (발견된 첫 번째 링크를 통해 나아가는 것과는 대조적이다. 이것은 depth-first behavior(깊이 우선 순회)라고 한다.)

비 로컬(non-local) 링크를 피하고 로컬 웹 페이지로만 탐색한다면 웹 크롤러에게 하나의 웹 사이트를 제공한다. (Listing 7) 이 경우, 나는 Ruby에서 Python으로 전환하여 Python의 유용한 HTMLParser 클래스를 활용한다.

Listing 7. Python 웹 사이트 크롤러 (minispider.py)

                #!/usr/local/bin/pythonimport httplibimport sysimport refrom HTMLParser import HTMLParserclass miniHTMLParser( HTMLParser ):  viewedQueue = []  instQueue = []  def get_next_link( self ):    if self.instQueue == []:      return ''    else:      return self.instQueue.pop(0)  def gethtmlfile( self, site, page ):    try:      httpconn = httplib.HTTPConnection(site)      httpconn.request("GET", page)      resp = httpconn.getresponse()      resppage = resp.read()    except:      resppage = ""    return resppage  def handle_starttag( self, tag, attrs ):    if tag == 'a':      newstr = str(attrs[0][1])      if re.search('http', newstr) == None:        if re.search('mailto', newstr) == None:          if re.search('htm', newstr) != None:            if (newstr in self.viewedQueue) == False:              print "  adding", newstr              self.instQueue.append( newstr )              self.viewedQueue.append( newstr )          else:            print "  ignoring", newstr        else:          print "  ignoring", newstr      else:        print "  ignoring", newstrdef main():  if sys.argv[1] == '':    print "usage is ./minispider.py site link"    sys.exit(2)  mySpider = miniHTMLParser()  link = sys.argv[2]  while link != '':    print "\nChecking link ", link    # Get the file from the site and link    retfile = mySpider.gethtmlfile( sys.argv[1], link )    # Feed the file into the HTML parser    mySpider.feed(retfile)    # Search the retfile here    # Get the next link in level traversal order    link = mySpider.get_next_link()  mySpider.close()  print "\ndone\n"if __name__ == "__main__":  main()

이 크롤러의 기본 디자인은 첫 번째 링크를 로딩하여 큐를 검사하는 것이다. 이 큐는 next-to-interrogate 큐로서 작동한다. 링크가 체크되면, 발견된 새로운 링크들이 같은 큐에 로딩된다.

먼저, Python의 HTMLParser 클래스에서 miniHTMLParser라고 하는 새로운 클래스를 이끌어 낸다. 이 클래스는 몇 가지 일을 수행한다. 먼저, 시작 HTML 태그를 만날 때 마다 콜백 메소드(handle_starttag)를 사용하는 나의 HTML 파서이다. 나는 또한 이 클래스를 사용하여 크롤링에서 발견된 (get_next_link) 링크에 액세스 하고 이 링크에서 나타난 파일(이 경우, HTML 파일)을 가져온다.

두 개의 인스턴스 변수들이 이 클래스 안에 포함되는데, viewedQueue에는 지금까지 조사된 링크가 포함되어 있고, instQueue는 조사 될 링크들을 나타내고 있다.

여러분도 보듯, 클래스 메소드는 단순하다. get_next_link 메소드는 instQueue가 비어있는지 여부를 확인하고 리턴한다. 그렇지 않으면, 다음 아이템이 pop 메소드를 통해 리턴된다. gethtmlfile 메소드는 HTTPConnectionK를 사용하여 사이트로 연결하고 정해진 페이지의 내용을 리턴한다. 마지막으로 handle_starttag는 웹 페이지의 모든 시작 태그에 호출된다. (feed 메소드를 통해 HTML 파서로 피딩(feed) 된다.) 이 함수에서, 링크가 비 로컬 링크(http를 포함하고 있을 경우)인지 여부, 이것이 이메일 주소인지 여부(mailto), 링크에 이것이 웹 페이지라는 것을 나타내는 'htm'이 포함되었는지의 여부를 검사한다. 또한, 전에 한번도 방문한 적 없는 곳인지를 확인하고, 그렇지 않을 경우, 링크는 my interrogate에 로딩되고 큐에 나타난다.

main 메소드는 단순하다. 나는 새로운 miniHTMLParser 인스턴스를 만들고 사용자 정의 사이트(argv[1])와 링크(argv[2])로 시작한다. 링크의 콘텐트를 가져다가, 이것을 HTML 파서에 피딩하고, 다음에 방문할 링크가 있다면 그 다음 링크를 가져온다. 방문할 링크가 있는 한 루프는 계속된다.

웹 스파이더를 호출하려면, 웹 사이트 주소와 링크를 제공한다.

./minispider.py www.fsf.org /

이 경우, Free Software Foundation(자유 소프트웨어 재단)에서 루트 파일을 요청하고 있다. 이 명령어의 결과는 Listing 8과 같다. 요청 큐에 추가된 새로운 링크와 비 로컬 링크 같은 무시된 링크를 볼 수 있다. 리스팅 밑에, 루트에서 발견된 그 링크에 대한 질의를 볼 수 있다.

Listing 8. minispider 스크립트의 결과

                [mtj@camus]$ ./minispider.py www.fsf.org /Checking link  /  ignoring hiddenStructure  ignoring http://www.fsf.org  ignoring http://www.fsf.org  ignoring http://www.fsf.org/news  ignoring http://www.fsf.org/events  ignoring http://www.fsf.org/campaigns  ignoring http://www.fsf.org/resources  ignoring http://www.fsf.org/donate  ignoring http://www.fsf.org/associate  ignoring http://www.fsf.org/licensing  ignoring http://www.fsf.org/blogs  ignoring http://www.fsf.org/about  ignoring https://www.fsf.org/login_form  ignoring http://www.fsf.org/join_form  ignoring http://www.fsf.org/news/fs-award-2005.html  ignoring http://www.fsf.org/news/fsfsysadmin.html  ignoring http://www.fsf.org/news/digital-communities.html  ignoring http://www.fsf.org/news/patents-defeated.html  ignoring /news/RSS  ignoring http://www.fsf.org/news  ignoring http://www.fsf.org/blogs/rms/entry-20050802.html  ignoring http://www.fsf.org/blogs/rms/entry-20050712.html  ignoring http://www.fsf.org/blogs/rms/entry-20050601.html  ignoring http://www.fsf.org/blogs/rms/entry-20050526.html  ignoring http://www.fsf.org/blogs/rms/entry-20050513.html  ignoring http://www.fsf.org/index_html/SimpleBlogFullSearch  ignoring documentContent  ignoring http://www.fsf.org/index_html/sendto_form  ignoring javascript:this.print();  adding licensing/essays/free-sw.html  ignoring /licensing/essays  ignoring http://www.gnu.org/philosophy  ignoring http://www.freesoftwaremagazine.com  ignoring donate  ignoring join_form  adding associate/index_html  ignoring http://order.fsf.org  adding donate/patron/index_html  adding campaigns/priority.html  ignoring http://r300.sf.net/  ignoring http://developer.classpath.org/mediation/OpenOffice2GCJ4  ignoring http://gcc.gnu.org/java/index.html  ignoring http://www.gnu.org/software/classpath/  ignoring http://gplflash.sourceforge.net/  ignoring campaigns  adding campaigns/broadcast-flag.html  ignoring http://www.gnu.org  ignoring /fsf/licensing  ignoring http://directory.fsf.org  ignoring http://savannah.gnu.org  ignoring mailto:webmaster@fsf.org  ignoring http://www.fsf.org/Members/root  ignoring http://www.plonesolutions.com  ignoring http://www.enfoldtechnology.com  ignoring http://blacktar.com  ignoring http://plone.org  ignoring http://www.section508.gov  ignoring http://www.w3.org/WAI/WCAG1AA-Conformance  ignoring http://validator.w3.org/check/referer  ignoring http://jigsaw.w3.org/css-validator/check/referer  ignoring http://plone.org/browsersupportChecking link  licensing/essays/free-sw.html  ignoring mailto:webmasterChecking link  associate/index_html  ignoring mailto:webmasterChecking link  donate/patron/index_html  ignoring mailto:webmasterChecking link  campaigns/priority.html  ignoring mailto:webmasterChecking link  campaigns/broadcast-flag.html  ignoring mailto:webmasterdone[mtj@camus]$

이 예제는 웹 스파이더의 크롤링 단계를 나타내고 있다. 이 파일이 클라이언트에 의해 읽혀진 후에, 페이지의 콘텐트가 검사된다.

위로

리눅스 스라이더링(spidering) 툴

두 개의 스크래퍼와 스파이더를 구현하는 방법을 배웠다. 이러한 기능을 제공하는 리눅스 툴도 있다.

Web get을 뜻하는 wget 명령어는 웹 사이트를 반복적으로 실행하고 관심 내용을 가져오는 유용한 명령어이다. 웹 사이트, 관심이 있는 내용, 기타 관리 옵션들을 지정할 수 있다. 이 명령어는 파일들을 여러분의 로컬 호스트로 가져온다. 예를 들어, 다음 명령어는 여러분이 정의한 URL로 연결하여 세 단계만 반복하여 mp3, mpg, mpeg, 또는 avi 확장자를 가진 파일을 가져온다.

wget -A mp3,mpg,mpeg,avi -r -l 3 http://<some URL>

curl 명령어도 비슷한 방법으로 작동한다. 계속해서 많은 것들이 활발히 개발되고 있다. 이와 비슷한 다른 명령어로는 snarf, fget, fetch 등이 있다.

위로

법적 문제

웹 스파이더를 사용하는 인터넷에서의 데이터 마이닝에 대한 소송들이 있었고, 잘 처리되지 않고 있다. Farechase, Inc.는 최근 American Airlines로부터 스크린 스크래핑과 관련하여 고소를 당했다. 이 소송은 American Airlines의 사용자 계약에 위반되는 데이터를 모았다는 점이 소송에 걸렸다. 소송이 실패하자, American Airlines는 불법 침해를 주장했고 이것은 성공을 거두었다. 다른 소송 건으로는 스파이더와 스크래퍼가 합법적 사용자의 대역폭을 가져가는 것과 관련한 것이었다. 모두가 근거 있는 소송들이고 Politeness 정책들을 수립하는 것이 더욱 중요해지고 있다. (참고자료)

위로

맺음말

소셜 북마크


	mar.gar.in

	Digg

	del.icio.us

	Slashdot

웹의 크롤링과 스크래핑은 재미도 있고 이롭기도 하다. 하지만, 앞서 언급한 것처럼, 법적인 문제도 있다. 스파이더링이나 스크래핑을 할 때, 서버에서 사용할 수 있는 robots.txt 파일을 준수하고, 이것을 여러분의 Politeness 정책들에 추가하도록 한다. SOAP 같은 새로운 프로토콜들은 스파이더링을 더욱 쉽게 만들고, 일반 웹 작동에는 영향을 덜 준다. 시맨틱 웹 같은 노력이 스파이더링을 더욱더 단순화 하기 때문에 스파이더링의 솔루션과 방식은 계속해서 성장할 전망이다.

기사의 원문보기

Build a Web spider on Linux

참고자료

교육

Wikipedia Web crawler
Email Spiders
The Web Robots Pages
"Scrapers, Robots and Spiders: The Battle Over Internet Data Mining" (Gesmer Updegrove LLP, 2006)
한국 developerWorks 리눅스 존- 리눅스 개발자를 위한 기술자료 보기
developerWorks 기술 이벤트와 웹캐스트.

제품 및 기술 얻기

Searchtools.com의 Source Code for Web Robot Spiders
Order the SEK for Linux, Linux from DB2�, Lotus�, Rational�, Tivoli�, WebSphere� 최신 IBM 시험판 소프트웨어
IBM 시험판 소프트웨어 다운로드(한국 developerWorks)

토론

developerWorks 블로그와 다양한 한국 developerWorks community 참여하기

필자소개


		M. Tim Jones는 임베디드 소프트웨어 아키텍트이며, GNU/Linux Application Programming, AI Application Programming, BSD Sockets Programming from a Multilanguage Perspective의 저자이기도 하다. 정지 우주선용 커널부터 임베디드 시스템 아키텍처와 네트워킹 프로토콜 개발까지 광범위한 개발 경험을 갖고 있다. 현재 Emulex Corp. (Longmont, Colorado)의 자문 엔지니어이다.

'기본 카테고리' 카테고리의 다른 글

Beautiful Soup (0)	2008.02.26
Python and HTML Processing (0)	2008.02.26
[팁] vi editor와 관련된 유용한 팀[펌] (0)	2008.02.14
알아두면 편리한 윈도우 명령어 (0)	2008.02.14
윈도우에 상응하는 리눅스 프로그램 4/4 (0)	2008.02.13

[팁] vi editor와 관련된 유용한 팀[펌]

2008. 2. 14. 17:35

[팁] vi editor와 관련된 유용한 팀[펌]

글쓴이 : 예진맘

'기본 카테고리' 카테고리의 다른 글

Python and HTML Processing (0)	2008.02.26
리눅스에서 웹 스파이더(Web spider) 구현하기 (한글) (0)	2008.02.16
알아두면 편리한 윈도우 명령어 (0)	2008.02.14
윈도우에 상응하는 리눅스 프로그램 4/4 (0)	2008.02.13
리눅스에 파이썬 설치하기 (0)	2008.02.13

알아두면 편리한 윈도우 명령어

2008. 2. 14. 12:35

알아두면 편리한 윈도우 명령어

인터넷 연결에 문제가 생겨서 ISP 업체에 문의를 하면 가장 먼저 요구하는 것이 [시작]-[실행] 창을 열어 ping을 쳐보라고 한다. 이 ping은 특정 네트워크와 통신이 되지는 확인하는 간단한 방법이다. 이렇게 유용한 명령어엔 어떠한 것들이 있는지 자세히 알아보도록 하자.

윈도우는 GUI(Graphics User Interface)를 적용해 그림을 통해 누구나 손쉽게 컴퓨터를 사용할 수 있는 환경을 제공해준다. 하지만 이러한 윈도우의 전신은 MS-DOS라고 불리는 운영체제로 현재의 그래픽 환경과는 달리 텍스트(text)로 명령을 내리고 그 결과를 받는 구조로 이루어져 있었다. 이러한 명령어는 현재의 윈도우 내부에 계속 존재하고 있으며 이러한 환경을 콘솔(console)이라고 부른다. 텍스트로 명령을 내려야 하므로 상당히 직관적이며 사용법 또한 간단한 것이 특징이다. 따라서 이러한 명령어들을 많이 알면 알수록 컴퓨터를 사용하는데 도움이 된다.

PING
Ping은 네트워크상의 특정 호스트가 통신이 가능한지 여부를 확인하기 위해 주로 사용하는 명령어다. 특정 호스트로 응답 요청을 보내면 해당 호스트가 네트워킹이 가능하다면 응답을 주므로 쉽게 확인할 수 있다. 반대로 항상 통신이 가능하다고 봐도 무방한 ISP업체의 도메인 네임 서버(문자로된 인터넷 주소를 IP주소로 변경해주는 시스템)로 응답 요청을 보내면 나의 컴퓨터가 인터넷이 가능한지 여부를 확인할 수 있다.

사용 방법은 “Ping [옵션] 대상호스트”이다. 실제 예제로 안랩의 한국 홈페이지인 home.ahnlab.com을 대상으로 테스트를 하기 위해선 다음과 같이 명령어를 입력해야 한다. 만약 IP주소를 알고 있다면 직접 IP주소를 입력해도 무방하다.

ping home.ahnlab.com
ping 211.233.80.22

하지만, 일반적인 환경에서 ping 테스트는 4번만 하므로 지속적으로 확인하기 위해서는 다음과 같이 옵션 [-t]를 사용하여야 한다. 종료하기 위해서는 [Ctrl] + [C]를 눌러야 한다.

ping –t home.ahnlab.com
ping –t 211.233.80.22

[그림 1] ping home.ahnlab.com

SHUTDOWN
Shutdown은 말 그대로 윈도우를 종료하기 위해서 사용하는 명령어다. [시작] -> [컴퓨터 끄기]를 사용해 컴퓨터를 종료할 수 있지만, shutdown명령은 보다 더 다양한 기능을 제공한다. 또한 윈도우 종료 기능을 중지할 수도 있어, RPC 공격에 의해 윈도우가 갑자기 종료될 경우 추가 작업을 수행할 수 있는 시간을 벌 수 있어 유용한 명령어다. 다음은 shutdown을 통해 수행할 수 있는 대표적인 기능들이다.

(1) 특정 시간 후 윈도우 종료하기
[shutdown –s –t 종료시간]을 입력함으로 가능하다. 여기서 말하는 종료시간은 초 단위이다.따라서 10분 후 종료하길 원한다면 [shutdown –s –t 600]을 적어주면 된다.

[그림 2] shutdown -s -t 600 -c "윈도우 예약 종료 시험"

(2) 윈도우 종료 중지하기
(1)에서와 같이 예약 종료를 실행했지만 부득이하게 종료를 중지 해야 할 경우가 발생할 수 있다. 또는 윈도우 보안 패치가 되어 있지 않을 경우 RPC공격을 통해 위와 같은 메시지가 발생할 수 있다. 이 경우 [shutdown –a]를 입력함으로 종료 기능을 중지시킬 수 있다.

(3) GUI(Graphics User Interface) 환경으로 실행하기
(1), (2)와 같이 직접 입력하기 힘들다면 [shutdown –i]를 입력함으로 친숙한 GUI환경으로 예약 종료를 실행할 수 있다. 단, 이 경우 반듯이 [설명] 부분에 종료하려는 사유를 적어야 [확인]버튼이 활성화된다. 명령어를 직접 입력하는 것이 어색하거나 어려운 사용자들에겐 좋은 대안이 된다.

[그림 3] shutdown -i

이 이외에도 원격지 컴퓨터를 종료하는 기능등이 있지만, 일반적인 환경에선 거의 사용되지 않는 기능이다.

IPCONFIG
Ipconfig는 TCP/IP 네트워크 관련 설정 사항을 확인하고 또 갱신할 수 있는 명령어다. 이 기능을 사용하면 네트워크를 다시 설정하므로 윈도우를 재부팅 하지 않고도 네트워크 설정을 갱신할 수 있다. 네트워크 설정 사항을 직접 눈으로 확인해야 하므로 가급적 명령 프롬프트에서 실행하는 것이 좋다. 방법은 [시작] -> [실행] -> CMD를 입력 후 [확인]버튼을 누르면 된다. 이때 실행되는 검은색 텍스트창이 바로 명령 프롬프트이다. 자주 사용되는 명령어는 다음과 같다.

(1) IP주소 설정 상태 확인
현재 나의 컴퓨터에 설치된 네트워크 장비에 할당된 IP주소를 직접 확인할 수 있다. 명령어는 [ipconfig /all]로 확인할 수 있다.
여기서는 네트워크카드 설명, 물리적으로 설정된 주소(네트워크카드의 맥 주소), DHCP(Dynamic Host Configuration Protocol – 동적 호스트 설정 규약) 여부, 할당된 IP 주소, 서브넷 마스크, 기본 게이트웨어, DNS 서버가 상세히 표기된다.

[그림 4] ipconfig /all

(2) 네트워크 연결 상태 해제/갱신
여러가지 이유로 네트워크 설정이 잘 못 되는 경우가 종종 있다. 이 경우 대부분 윈도우를 종료하고 다시 부팅을 한다. 하지만 ipconfig의 사용법을 알면 간단하게 네트워크 정보만 해제 또는 갱신이 가능하다.
해제의 경우 [ipconfig /release]를 입력하면 연결된 네트워크 연결을 해제한다. 이때는 네트워킹이 되지 않으므로 외부와 통신이 불가능하다. 보안패치가 되지 않은 윈도우를 사용하고 있을 외부 공격이 의심되면 이런식으로 네트워크를 해제하고 보안패치를 수행하면 보다 안전하게 작업이 가능하다.
그리고 다시 네트워크를 사용하기 위해 갱신하는 방법은 [ipconfig /renew]를 입력하면 된다. 갱신이 완료되면 [ipconfig /all]을 입력 했을 때와 동일한 화면을 볼 수 있다.

레지스트리 편집기
윈도우의 중요 시스템 설정사항과 응용 프로그램의 설정이 저장되는 곳이 바로 레지스트리이다. 이 레지스트리의 내용을 직접 확인하고 수정할 수 있게 도와주는 프로그램이 바로 레지스트리 편집기다. 윈도우에 내장되어 있는 레지스트리 편집기는 크게 2가지가 있다. 그래픽 환경으로 작업을 할 수 있는 [regedit]와 텍스트로 명령을 수행하는 [reg]가 있다.

그래픽 환경의 레지스트리 편집기 실행 방법은 [시작] -> [실행] -> [regedit]를 입력하고 [확인]을 누르면 된다.

[그림 5] regedit

레지스트리 편집기가 실행되면 다음과 같은 화면을 볼 수 있다.

[그림 6] GUI 환경의 레지스트리 편집기

하지만 레지스트리 편집기에서 데이터를 수정하면 별도의 저장 작업 없이 바로 적용 되므로 신중을 기해야 한다.

텍스트 명령어 입력 환경의 레지스트리 편집기는 명령 프롬프트에서 reg를 입력하면 실행할 수 있다. 하지만 모든 명령어를 직접 입력해야 하므로 다소 까다롭다. 하지만 GUI환경을 구동할 수 없을 경우 매우 큰 도움이 된다. 관련 명령어는 [reg /?]를 입력함으로 알 수 있다.

[그림 7] 명령어 입력 방식의 레지스트리 편집기 (reg /? 실행 화면)

단, 레지스트리는 윈도우의 중요한 시스템 정보와 응용 소프트웨어의 각종 설정이 저장되는 중요한 데이터이다. 따라서 해당 데이터가 잘 못 편집되면 윈도우를 정상적으로 사용할 수 없으므로 윈도우에 전문적인 지식이 있는 사용자가 특별히 변경해야 할 사항이 있을 경우에만 사용 해야 한다.

윈도우 작업 관리자
윈도우는 동시에 여러가지 프로그램이 실행된다. 이렇게 실행되는 프로그램중 특정 프로그램이 CPU를 혼자 사용하고 있으면 윈도우 전체가 느려지는 현상이 발생한다. 또한 컴퓨터 자원이 얼마나 사용되고 있는지 확인하고 싶을때도 문득 있다. 이럴때 사용하는 프로그램이 윈도우 작업 관리자이다. 일반적으로 윈도우 작업 관리자는 [CTRL] + [ALT] + [DEL]키를 동시에 눌러야 실행이 된다. 하지만 이 윈도우 작업관리자를 보다 손쉽게 실행시킬 수 있는 방법이 있다. [시작] -> [실행] -> [taskmgr]을 입력 후 [확인] 버튼을 누르면 윈도우 작업 관리자는 실행된다.

[그림 8] taskmgr (윈도우 작업 관리자)

윈도우 작업 관리자는 [응용 프로그램], [프로세스], [성능], [네트워킹], [사용자]와 같이 총 5개의 탭을 가지고 있다. 각각은 다음과 같은 항목을 보여준다.
(1) 응용 프로그램은 윈도우 바탕화면에 실행되어 눈에 보이는 프로그램들의 목록을 보여준다. 여기서 [상태]가 “응답없음”으로 표기되는 항목은 현재 프로그램이 정상적으로 실행되지 않는다는 뜻이다. 이 항목은 일정 시간이 지나면 자동 종료 되지만 그동안은 작업에 방해를 받게 되므로 해당 항목을 마우스로 선택하고 오른쪽 버튼을 눌러 [작업 끝내기]를 클릭하면 바로 종료할 수 있다.
(2) 프로세스는 실행되어 동작중인 프로그램들을 보여준다. 응용 프로그램과는 달리 사용자 눈에 보이지 않는 상태로 실행되는 프로그램까지 포함되므로 상당히 많은 프로세스 항목이 보일것이다. 여기서는 특정 프로세스가 어느 사용자에 의해 실행 되었으며, CPU를 얼마나 사용하고 있는지, 그리고 메모리를 얼마나 사용하고 있는지를 알 수 있다. 컴퓨터가 특별한 이유 없이 느려졌다면 이 항목을 살펴보고 특정 프로그램이 지나치게 CPU를 많이 사용하는 경우 해당 프로그램을 종료하여 그 문제를 해결할 수 있다. 단, explorer.exe와 같이 윈도우 중요 프로세스를 종료했을 경우 윈도우를 정상적으로 사용할 수 없으므로 주의를 기울여야 한다.
(3) 성능은 현재 CPU와 메모리를 얼마나 사용하고 있는지를 직관적으로 보여준다.
(4) 네트워킹에선 현재 내 컴퓨터의 네트워크가 얼마나 사용되고 있는지를 보여준다.
(5) 사용자에선 현재 윈도우에 로그온되어 있는 사용자를 보여준다. 사용자 전환등으로 불필요하게 윈도우 자원을 사용하고 있다면 불필요한 사용자를 [로그오프]시킬 수 있다. 기타 접속이 허용되지 않은 사용자가 로그온 되어 있을 경우 [연결 끊기]를 사용하여 접속을 차단할 수 있다.

텔넷(Telnet)
예전엔 자주 사용 되였지만, 최근엔 잘 사용되지 않는 telnet 클라이언트가 윈도우 내부에 내장되어 있다. 따라서 급하게 유닉스나 리눅스 서버등으로 접속해야할 경우 매우 요긴하게 사용할 수 있다.
telnet [접속할 호스트] [접속할 호스트의 포트번호] 를 입력하면 접속이 가능하다.
예를 들어 telnet://server.home.com 이라는 서버에 접속하고 하는 경우 다음과 같이 입력하면 된다.

접속 예제) telnet server.home.com

FTP(File Transfer Protocol)
FTP역시 최근 웹(Web)이 널리 보급됨에 따라 잘 사용되지 않지만 서버를 가지고 있는 사용자의 경우 가장 효과적인 파일 송수신 방법이기에 종종 사용된다. 사용법 또한 텔넷과 동일하다
예를 들어 ftp://fileserver.home.com 에 접속하고자 하면 다음과 같이 입력하면 된다.

접속 예제) ftp fileserver.home.com

NETSTAT
Netstat는 네트워크 프로토콜에 대한 통계와 현재 TCP/IP 네트워크 연결 상태등을 보여주는 명령어다. 다양한 기능을 가지고 있지만 자주 사용되는 명령어를 살펴보면 다음과 같다.

(1) 연결 및 수신 대기중인 항목 보기
현재 내 컴퓨터에 연결되었거나 연결을 기다리고 있는 항목들을 확인할 수 있다.
[netstat –a]를 입력하면 내역을 확인할 수 있다.
(2) 네트워크를 사용중인 프로그램 목록 보기
현재 내 컴퓨터에서 네트워크를 사용하는 프로그램 목록을 확인할 수 있다. 공격자의 명령을 기다리는 스파이웨어들을 확인할 때 주로 사용된다.
[netstat –b]를 입력하면 내역을 확인할 수 있다.
(3) 네트워크 통계 보기
네트워크 카드가 사용 가능한 상태가 된 시점부터 송수신된 데이터 통계를 살펴볼 수 있다.
[netstat –e]를 입력하면 내역을 확인할 수 있다.

내 컴퓨터 정보 확인
대다수 사용자들이 내 컴퓨터의 하드웨어에 대한 설정 및 중요 설정 정보를 모르고 있다. 이를 한눈에 확인할 수 있는 명령어가 있는데 바로 [systeminfo]가 그것이다.
[시작] -> [실행] -> [cmd]를 입력하고 명령 프롬프트 모드에서 [systeminfo]를 입력하면 현재 내 컴퓨터에 대한 다양한 정보를 한눈에 확인할 수 있다. 하지만 항목이 너무 많아 미처 다 보지도 못하고 지나가 버린 정보들이 많은데 이를 하나 하나 확인하기 위해서는 다음과 같이 입력하면 된다.

[그림 9] systeminfo | more

기타 윈도우 중요 정책 실행 명령어
윈도우에서 중요한 정책을 확인하고 변경할 수 있는 다양한 기능들이 윈도우엔 내장되어 있다. 하지만 꼭꼭 숨어 있어 이를 발견하기는 쉽지 않다. 따라서 다음 명령어를 입력해 쉽게 실행해 볼 수 있다. 실행 방법은 [시작] -> [실행] -> 해당 명령어 입력 후 [확인]버튼을 클릭하면 된다. 각각의 명령어는 다음과 같다. 하지만 잘못된 정책 변경은 정상적인 윈도우 사용을 불가능하게 할 수 있으므로 잘 모르는 항목은 수정하지 않는 것이 바람직하다.

COMPMGMT.MSC : 컴퓨터 관리로 컴퓨터의 세부 설정을 직접 변경할 수 있다.
DEVMGMT.MSC : 장치 관리자로 컴퓨터에 연결된 각종 장치들을 관리할 수 있다.
DFRG.MSC : 디스크 조각 모음으로 디스크내에 조각나서 저장된 파일을 하나의 조각으로 묶어줘 디스크 읽기 성능 및 수명을 개선하는데 도움을 준다.
EVENTVWR.MSC : 윈도우의 각종 이벤트 로그를 확인할 수 있다. 만약 윈도우에 문제가 발생했을 경우 확인할 수 있는 단서를 제공하기도 한다.
FSMGMT.MSC : 공유된 폴더와 파일을 확인할 수 있다. 공유된 폴더와 파일을 한눈에 보여줘 불필요한 공유 폴더를 확인하고 공유를 해제하여 보안성을 높일 수 있다.
GPEDIT.MSC : 로컬 컴퓨터 정책으로 윈도우의 다양한 정책은 물론 보안 정책을 변경할 수 있다.
LUSRMGR.MSC : 로컬 사용자 및 그룹 정책으로 윈도우 사용자를 추가/삭제/관리할 수 있으며, 그룹을 설정하고 보안 권한을 부여할 수 있다.
PERFMON.MSC : 시스템 성능을 모니터링한 결과를 볼 수 있다.
RSOP.MSC : 정책의 결과 집합으로 윈도우에 로그인된 사용자에게 적용되었거나 적용될 정책을 확인할 수 있습니다.
SECPOL.MSC : 로컬 보안 정책으로 현재 컴퓨터의 보안 정책을 확인하고 변경할 수 있다.
SERVICES.MSC : 현재 윈도우에 설치된 각종 서비스 항목과 그 상태를 확인하고 변경할 수 있다. 불필요한 서비스를 중지시켜 불필요한 자원을 절약할 수 있다.@

[안철수연구소 2008-2-12]

출처 : 안철수연구소 [2008/02/12]

'기본 카테고리' 카테고리의 다른 글

리눅스에서 웹 스파이더(Web spider) 구현하기 (한글) (0)	2008.02.16
[팁] vi editor와 관련된 유용한 팀[펌] (0)	2008.02.14
윈도우에 상응하는 리눅스 프로그램 4/4 (0)	2008.02.13
리눅스에 파이썬 설치하기 (0)	2008.02.13
sql 학습 (0)	2008.02.13

윈도우에 상응하는 리눅스 프로그램 4/4

2008. 2. 13. 21:58

윈도우에 상응하는 리눅스 프로그램 4/4

글쓴이 : 화산폭발 날짜 : 06-11-11 15:20 조회 : 2404

9) 과학, 특수 프로그램
10) 에뮤레이터
11) 기타

9) Scientific and special programs.
Useful links:	-	Scientific Applications on Linux - many links to both OSS and proprietary applications.
Math system in MathCad style	Mathcad	Gap.
Math system in Matlab style	Matlab	1) Matlab for Linux. [FTP] 2) Octave. (+ Gnuplot) 3) Scilab. 4) R. 5) Yorick. 6) rlab. 7) Yacas. 8) Euler.
Math system in Mathematica style	Mathematica	1) Mathematica for Linux. [Prop] 2) Maxima. 3) MuPad. 4) NumExp.
Math system in Maple style	Maple	1) Maple for Linux. [Prop] 2) Maxima. 3) MuPad.
Equation / math editor	Mathtype, MS Equation Editor	1) OpenOffice Math. 2) MathMLed. 3) Kformula (Koffice). 4) LyX. 5) Texmacs.
Programs for three-dimensional modeling	SolidWorks, ...	ProEngineer Linux. [Prop]
Programs for three-dimensional modeling	CATIA for Windows	CATIA. It was designed under Unix, and from version 4 (2000) it was ported under Windows (not too successfully).
Programs for three-dimensional modeling	SolidEdge for Windows	SolidEdge (part of more powerful package Unigraphics).
Engineering	ANSYS for Windows	ANSYS.
CAD/CAM/CAE	Autocad, Microstation	1) Varkon. 2) Linuxcad. [Prop, ~100$] 3) Varicad. [Prop] 4) Cycas. [Prop] 5) Tomcad. 6) Thancad. 7) Fandango. 8) Lignumcad. 9) Giram. 10) Jcad. 11) QSCad. 12) FreeEngineer. 13) Ocadis. 14) PythonCAD.
CAD/CAM/CAE, simplified	ArchiCAD	Qcad.
Desktop Publishing Systems	Adobe PageMaker, QuarkXPress	Adobe Framemaker. [Proprietary, cancelled]
Small desktop publishing systems	MS Publisher	1) Scribus - Desktop Publishing for Linux. 2) KWord.
Diagram and chart designer	Microsoft Visio	1) Kivio (Koffice). 2) Dia. 3) KChart. 4) xfig. 5) Tgif + dotty. 6) Tulip.
Geographic image processing software	Erdas Imagine, ER Mapper, ENVI	ENVI for Linux.
GIS (Geographical information system)	ArcView	1) Grass. 2) Quantum GIS. 3) PostGIS.
Vectorization of bitmaps	MapEdit, Easy Trace	1) Autotrace.
Software CNC, controlling machine tools	OpenCNC [Prop]	EMC.
Advanced text processing system in TeX style	MikTex, emTeX (DOS)	1) TeX. 2) TeTeX / LaTeX 3) LyX (WYSIWYM). 4) Kile.
Convenient, functional and user-friendly TeX-files / dvi-files editor.	WinEdt	1) Kile (KDE Integrated LaTeX Environment). 2) Ktexmaker2. 3) Tk LaTeX Editor.
Statistical Computing Language and Environment	S-PLUS	R.
Statistical analysis	SPSS, Statistica	Many links - here. 1) PSPP. 2) OpenStat2. 3) "Probability and Statistics Utilities for Linux users"
Econometrics Software	Eviews, Gretl	1) Gretl.
Emulation of the circuit	Electronic Workbench	1) Geda. 2) Oregano. 3) Xcircuit. 4) Gnome Assisted Electronics. 5) SPICE. 6) SPICE OPUS. 7) NG-SPICE.
Program to draw chemical structures	Chemdraw, Isisdraw	Xdrawchem.
Downloader and player for Olympus dictophone	Olympus DSS Player	???
Market analysis	MetaStock	???
Electronics scheme design	PCAD	1) Eagle. 2) Geda.
The oscilloscope emulation	Winoscillo	Xoscope.
Measurement of the temperature and voltages on motherboard	MBMonitor, PCAlert	KHealthCare (KDE).
S.M.A.R.T-attributes and temperature of the hard disk	Come on CD with mainboard, Active SMART	1) smartctl. 2) Hddtemp-0.3. 3) IDEload-0.2. 4) Smartsuite-2.1. 5) Smartmontools. 6) Ide-smart. 7) Smartsuite.
Memory testing	SiSoft SANDRA	Memtest86.
Program for watching temperatures, fanspeeds, etc	SiSoft SANDRA, SiSoft SAMANTHA	1) Ksensors. 2) Lm_sensors. 3) xsensors. 4) wmsensormon and other applets for AfterStep / WindowMaker / FluxBox.
HDD testing / benchmarking	SiSoft SANDRA, SiSoft SAMANTHA	1) hdparm. 2) Bonnie++. 3) IOzone. 4) Dbench. 5) Bonnie. 6) IO Bench. 7) Nhfsstone. 8) SPEC SFS. [Prop]
Video testing / benchmarking	Final Reality	1) X11perf. 2) Viewperf.
Realtime Control	SHA Sybera Hardware Access	DIAPM RTAI - Realtime Application Interface.
Simulator of nets	???	1) NS.
Neural network simulation	???	1) Xnbc. 2) Stuttgart Neural Network Simulator (SNNS).
"Sensor for LCD"	???	1) Sensors-lcd.
Electrocardiogrammas viewer	???	1) ecg2png.
A software technology, that turns x86 computer into a full-function PLC-like process controller	SoftPLC	1) MatPLC.
Catalog of the software for translators	-	Linux for translators.
Translation memory	1) Trados Translators Workbench 2) Deja Vu 3) Star Transit 4) SDLX 5) OmegaT	1) OmegaT.
Catalog of educational software	-	1) SchoolForge. 2) Seul / EDU.
Designing and viewing DTDs	NearFar Designer [Prop]	???
10) Emulators.
Virtual machine emulator	VMWare for Windows	1) VMWare for Linux. [Prop] 2) Win4Lin. [Proprietary, $89]. 3) Bochs. 4) Plex86. 5) User Mode Linux.
Linux emulator	1) CygWin. 2) MKS Toolkit. 3) Bash for Windows. 3) Minimalist GNU For Windows.	1) User Mode Linux.
X Window System (XFree) emulator	XFree under CygWin.	-
Windows emulator	-	1) Wine. (GUI: gwine, tkwine) 2) Transgaming WineX. (GUI: tqgui) [Prop] 3) Crossover Office.
Sony PlayStation emulator	ePSXe, ...	1) ePSXe. 2) Pcsx.
ZX Spectrum emulator	X128, Speccyal, SpecX, SpecEmu, UnrealSpeccy, ...	1) Xzx. 2) Glukalka. 3) Fuse. 4) ZXSP-X. 5) FBZX. 6) SpectEmu.
Arcade machines emulator	???	1) Xmame / Xmess. 2) Advancemame. Frontends: advancemenu. ckmame. flynn. gmame. gnomame. grok. grustibus. gxmame. it. it's quit. fancy. kmamerun. kmamu. qmamecat. startxmame. setcleaner. tkmame.
ST emulator	???	1) StonX.
C64 emulator	???	1) Vice.
Amiga emulator	???	1) UAE.
Mac 68k emulator	???	1) Basilisk II.
Game boy emulator	???	1) Vboy. 2) VGBA. (GUI: vgb-gui)
Atari 2600 Video Computer System emulator	1) Stella	1) Stella.
NES emulator	1) Zsnes. 2) Snes9x.	1) Zsnes. 2) Snes9x. 3) FWNes. 4) GTuxNes.
M680x0 Arcade emulator	1) Rainemu.	1) Rainemu.
Multi / other emulators	???	1) M.E.S.S. 2) Zinc.
11) Other / Humour :)
Space simulator	1) Openuniverse. 2) Celestia. 3) Zetadeck.	1) Openuniverse. 2) Celestia. 3) Kstars. 4) Zetadeck.
TV driver	-	RivaTV.
System, running from CD without installing (Live CD)	Impossible	1) Knoppix. 2) Cool Linux. 3) Blin. 4) DemoLinux. 5) DyneBolic. 6) Gentoo (live CD). 7) Lonix. 8) Virtual Linux. 9) Bootable Business Card (LNX-BBC). 10) ByzantineOS. 11) FreeLoader Linux. 12) MoviX. 13) Freeduc CD. 14) SuSE live-eval CD. 15) Freedom Linux. 16) Eagle Linux.
Boot rescue/tools diskette	Windows system diskette	1) Linux system diskette. 2) Tomsrtbt. 3) BanShee Linux.
Local file systems mount	ext2fs (driver), explore2fs (program) - ext2/3 under Windows	Linux-NTFS. (driver for NTFS partitions mounting)
Installing software and uninstalling	InstallShield, WISE, GhostInstaller, Microsoft Installer - the analog of rpm	1) Rpm. 2) Urpmi. 3) GnoRpm. 4) Nautilus RPM. 5) Apt-get & frontends (synaptic, aptitude, ...). 6) Apt-rpm. (for RedHat, SuSE, ALT Linux, etc) 7) yum (Yellowdog Updater Modified) 8) yum enhanced by ASPLinux.
Installing software from source and uninstalling	Minimalist GNU For Windows	1) make install, make uninstall 2) CheckInstall. 3) Sinstall. 4) Emerge (Gentoo). 5) Apt-get & frontends (synaptic, aptitude, ...).
System update	Windows Update	1) Ximian Red Carpet. 2) Red Hat Network. 3) MandrakeOnline. 4) SuSE YaST Online Update. 5) Caldera Volution Online. 6) Apt. 7) Gentoo ebuilds (portage). 8) Debian GNU/Linux package search. 9) Yum.
Certification	MCSD, MCT, MCSE	1) Red Hat Certification. 2) Sair Linux and GNU Certification. 3) Linux Professional Institute Certification (LPIC). 4) Linux+. 5) Prometric. 6) VUE.
Icons on desktop	Explorer	1) Desktop File Manager. 2) Idesk.
Work with screensavers	Desktop properties	1) xset. 2) xlockmore. 3) xscreensaver. 4) kscreensaver.
Place for keeping "removed" files	Trash	1) Trash Can. 2) Libtrash.
Checking the hard disk	Scandisk	fsck -check or reiserfsck -check. Not needed with journaled file systems (reiserfs, ext3, jfs, xfs).
Defragmentation	defrag	Not needed.
GUI of the system	Windows Explorer	Kde, Gnome, IceWM, Windowmaker, Blackbox, Fluxbox, ...
Windows XP GUI	Windows XP	XPde.
Flavors of the sytem	9x, NT, XP	RedHat, Mandrake, Knoppix, Debian, SuSE, ALT, ASP, Gentoo, Slackware, Linux From Scratch, ...
Tactics	FUD (fear, uncertainty, doubt)	Open Source! "First they ignore you, then they laugh at you, then they fight you, then you win".
Source code of the kernel freely available	No	Of course :)
Command line	1) command.com :). 2) cmd.exe 3) Windows Script-xing Host 4) 4DOS / 4NT 5) Minimalist GNU For Windows 6) Unix tools for Windows (AT&T)	1) Bash. 2) Csh. 3) Zsh. 4) Ash. 5) Tcsh.
Free of charge operating system	Microsoft Windows. (Imagine yourself that in Russia there are 95% of users having a pirate copy of Windows :).	Linux - the Free operating system!!
-	Nimda	Slapper.
-	Wincih, klez, etc	No analogs
Backdoors and hidden keys	Decide it yourself :).	-
Easter eggs, undocumented possibilities	Logo with Windows developers, Doom in Excel 95, 3D-racing in Excel 2000, etc, etc...	-
The magazines	Windows Magazine	1) Linux Journal. 2) Linux Gazette. 3) Linux magazine. 4) Linux pratico (Italy).
-	Blue Screen Of Death (BSOD)	1) Kernel panic. 2) Screensaver "bsod" :).
Whom it is necessary to curse for bugs and defects of the system	M$, Bill Gates personally	1) Developers of the distribution. 2) All the Linux people and Linus Torvalds personally :). 3) Yourself and your own /dev/hands :)).
-	M$.com	GNU.org, FSF.org
-	Windows.com	Linux.org
-	Bill Gates, "Road ahead"	Linus Torvalds, "Just for fun" :).
-	Bill Gates, "Business @ the speed of thought"	Richard M. Stallman, "The right to read".

'기본 카테고리' 카테고리의 다른 글

[팁] vi editor와 관련된 유용한 팀[펌] (0)	2008.02.14
알아두면 편리한 윈도우 명령어 (0)	2008.02.14
리눅스에 파이썬 설치하기 (0)	2008.02.13
sql 학습 (0)	2008.02.13
MySQL5 (0)	2008.02.05

리눅스에 파이썬 설치하기

2008. 2. 13. 21:49

'기본 카테고리' 카테고리의 다른 글

알아두면 편리한 윈도우 명령어 (0)	2008.02.14
윈도우에 상응하는 리눅스 프로그램 4/4 (0)	2008.02.13
sql 학습 (0)	2008.02.13
MySQL5 (0)	2008.02.05
정규표현식 기초 (0)	2008.02.01

sql 학습

2008. 2. 13. 13:59

http://sql.1keydata.com/kr/

http://www.igotit.co.kr/zbxe/PAGE_SPECIAL

'기본 카테고리' 카테고리의 다른 글

윈도우에 상응하는 리눅스 프로그램 4/4 (0)	2008.02.13
리눅스에 파이썬 설치하기 (0)	2008.02.13
MySQL5 (0)	2008.02.05
정규표현식 기초 (0)	2008.02.01
에디터용 글꼴 (0)	2008.02.01

MySQL5

2008. 2. 5. 11:11

MySQL5

Database/MySQL 2007/08/13 04:41

작성자 : shin-gosoo(hchshin@chol.com)
작성일 : 2007.04.10

새창 보기

윈도우 자바개발환경을 위한 기본적인 Mysql 5 설치방법입니다.

데이터베이스 설치
Mysql 환경설정 - 한글개발환경(euckr)일 경우
Mysql 환경설정 - 다국어 개발환경(utf-8)일 경우

1. 데이터베이스 설치

설치 버전 : 5.0.37 (2007.04.10 현재 최신 버전)
다운로드 URL : http://dev.mysql.com/downloads/mysql/5.0.html#win32
Without installer (unzip in C:\) : mysql-noinstall-5.0.37-win32.zip (45.6M) 를 선택해서 다운로드 받는다.
개인 취향이겠지만 필자는 인스톨 버전은 싫어함.
또한, 필자는 개발환경은 하드드라이브가 C, D로 나눠서 있을 경우 D 드라이브에 설치한다. 가끔씩 윈도우를 재설치 할 경우를 대비해서.
mysql-noinstall-5.0.37-win32.zip 를 풀면 mysql-5.0.37-win32 폴더가 생긴다. mysql-5.0.37로 이름변경해서 아래와 같이 설치하자.
설치 예)
- 설치디렉토리 : D:\dev\mysql-5.0.37
- 윈도우 시스템환경변수 설정
  - Path : D:\dev\mysql-5.0.37\bin; 추가
- 윈도우 서비스로 설정
  - 도스프롬프트 : D:\dev\mysql-5.0.37\bin> mysqld --install ( 서비스 제거는 mysqld --remvoe )
  - 제어판 - 관리도구 - 서비스를 통해 Mysql 서비스 시작
- 도스 프롬프트에서 C:/>mysql -uroot 로 접속되면 설치 성공.

2. Mysql 환경설정 - 한글개발환경(euckr)일 경우

my.ini 설정
C:\Windows 밑에 my.ini 파일 생성
1. [mysql]
2. default-character-set= euckr
4. [mysqld]
5. character-set-client-handshake=FALSE
6. init_connect="SET collation_connection = euckr_korean_ci"
7. init_connect="SET NAMES euckr"
8. default-character-set= euckr
9. character-set-server= euckr
11. collation-server= euckr_korean_ci
13. [client]
14. default-character-set= euckr
16. [mysqldump]
17. default-character-set= euckr
[mysql]default-character-set = euckr[mysqld]character-set-client-handshake=FALSEinit_connect="SET collation_connection = euckr_korean_ci"init_connect="SET NAMES euckr"default-character-set = euckrcharacter-set-server = euckrcollation-server = euckr_korean_ci[client] default-character-set = euckr[mysqldump] default-character-set = euckr
Mysql Restart
root 계정으로 mysql 접속후
mysql>status
아래와 같이 나오면 설정 OK.
1. mysql>status
2. --------------
3. mysql Ver 14.12 Distrib 5.0.37, for Win32 (ia32)
5. Connection id: 1
6. Current database:
7. Current user: root@localhost
8. SSL: Not in use
9. Using delimiter: ;
10. Server version: 5.0.37-community MySQL Community Edition (GPL)
11. Protocol version: 10
12. Connection: localhost via TCP/IP
13. Server characterset: euckr
14. Db characterset: euckr
15. Client characterset: euckr
16. Conn. characterset: euckr
17. TCP port: 3306
18. Uptime: 10 sec
20. Threads: 1 Questions: 4 Slow queries: 0 Opens: 12 Flush tables: 1 Open tabl
21. es: 6 Queries per second avg: 0.400
22. --------------
24. mysql>
mysql> status--------------mysql Ver 14.12 Distrib 5.0.37, for Win32 (ia32)Connection id: 1Current database:Current user: root@localhostSSL: Not in useUsing delimiter: ;Server version: 5.0.37-community MySQL Community Edition (GPL)Protocol version: 10Connection: localhost via TCP/IPServer characterset: euckrDb characterset: euckrClient characterset: euckrConn. characterset: euckrTCP port: 3306Uptime: 10 secThreads: 1 Questions: 4 Slow queries: 0 Opens: 12 Flush tables: 1 Open tables: 6 Queries per second avg: 0.400--------------mysql>
root 계정 초기 비밀번호 지정하기
1. C:>mysql -uroot mysql
3. mysql>updateusersetpassword=password('새비밀번호') whereuser='root';
4. mysql>flush privileges;
5. mysql>exit
7. C:>mysql -uroot -p새비밀번호
C:>mysql -uroot mysql mysql>update user set password=password('새비밀번호') where user='root'; mysql>flush privileges; mysql>exit C:>mysql -uroot -p새비밀번호
데이터베이스 생성 및 사용자 생성
1. C:>mysql -uroot -p비밀번호
3. mysql>CREATEDATABASEmyproject_kr DEFAULTCHARACTERSETeuckr COLLATEeuckr_korean_ci;
5. mysql>GRANTALLPRIVILEGESON*.* TO'javamaster'@'localhost'IDENTIFIED BY'1234'WITHGRANTOPTION;
7. mysql>GRANTALLPRIVILEGESON*.* TO'javamaster'@'%'IDENTIFIED BY'1234'WITHGRANTOPTION;
9. mysql>FLUSH PRIVILEGES;
11. mysql>exit
13. C:>mysql -ujavamaster -p1234 myproject_kr
C:>mysql -uroot -p비밀번호mysql>CREATE DATABASE myproject_kr DEFAULT CHARACTER SET euckr COLLATE euckr_korean_ci; mysql>GRANT ALL PRIVILEGES ON *.* TO 'javamaster'@'localhost' IDENTIFIED BY '1234' WITH GRANT OPTION;mysql>GRANT ALL PRIVILEGES ON *.* TO 'javamaster'@'%' IDENTIFIED BY '1234' WITH GRANT OPTION; mysql>FLUSH PRIVILEGES;mysql>exitC:>mysql -ujavamaster -p1234 myproject_kr 4라인 : euckr 환경으로 myproject_kr 이라는 데이터베이스 생성
6라인 : 아이디 javamaster, 비밀번호 1234로 로컬에서만 접속권한이 있는 사용자 생성
8라인 : 아이디 javamaster, 비밀번호 1234로 원격에서도 접속권한이 있는 사용자 생성
10라인 : 권한 적용
14라인 : 새로 생성한 계정으로 접속

3. Mysql 환경설정 - 다국어 개발환경(utf-8)일 경우

my.ini 설정
C:\Windows 밑에 my.ini 파일 생성
1. [mysql]
2. default-character-set= utf8
4. [mysqld]
5. character-set-client-handshake=FALSE
6. init_connect="SET collation_connection = utf8_general_ci"
7. init_connect="SET NAMES utf8"
8. default-character-set= utf8
9. character-set-server= utf8
10. collation-server= utf8_general_ci
12. [client]
13. default-character-set= utf8
15. [mysqldump]
16. default-character-set= utf8
[mysql]default-character-set = utf8[mysqld]character-set-client-handshake=FALSEinit_connect="SET collation_connection = utf8_general_ci"init_connect="SET NAMES utf8"default-character-set = utf8character-set-server = utf8collation-server = utf8_general_ci[client] default-character-set = utf8[mysqldump] default-character-set = utf8
Mysql Restart
root 계정으로 mysql 접속후
mysql>status
아래와 같이 나오면 설정 OK.
1. mysql>status
2. --------------
3. mysql Ver 14.12 Distrib 5.0.37, for Win32 (ia32)
5. Connection id: 1
6. Current database:
7. Current user: root@localhost
8. SSL: Not in use
9. Using delimiter: ;
10. Server version: 5.0.37-community MySQL Community Edition (GPL)
11. Protocol version: 10
12. Connection: localhost via TCP/IP
13. Server characterset: utf8
14. Db characterset: utf8
15. Client characterset: utf8
16. Conn. characterset: utf8
17. TCP port: 3306
18. Uptime: 10 sec
20. Threads: 1 Questions: 4 Slow queries: 0 Opens: 12 Flush tables: 1 Open tabl
21. es: 6 Queries per second avg: 0.400
22. --------------
24. mysql>
mysql> status--------------mysql Ver 14.12 Distrib 5.0.37, for Win32 (ia32)Connection id: 1Current database:Current user: root@localhostSSL: Not in useUsing delimiter: ;Server version: 5.0.37-community MySQL Community Edition (GPL)Protocol version: 10Connection: localhost via TCP/IPServer characterset: utf8Db characterset: utf8Client characterset: utf8Conn. characterset: utf8TCP port: 3306Uptime: 10 secThreads: 1 Questions: 4 Slow queries: 0 Opens: 12 Flush tables: 1 Open tables: 6 Queries per second avg: 0.400--------------mysql>
root 계정 초기 비밀번호 지정하기
1. C:>mysql -uroot mysql
3. mysql>updateusersetpassword=password('새비밀번호') whereuser='root';
4. mysql>flush privileges;
5. mysql>exit
7. C:>mysql -uroot -p새비밀번호
C:>mysql -uroot mysql mysql>update user set password=password('새비밀번호') where user='root'; mysql>flush privileges; mysql>exit C:>mysql -uroot -p새비밀번호
데이터베이스 생성 및 사용자 생성
1. C:>mysql -uroot -p비밀번호
3. mysql>CREATEDATABASEmyproject_utf8 DEFAULTCHARACTERSETutf8 COLLATEutf8_general_ci;
5. mysql>GRANTALLPRIVILEGESON*.* TO'javamaster'@'localhost'IDENTIFIED BY'1234'WITHGRANTOPTION;
7. mysql>GRANTALLPRIVILEGESON*.* TO'javamaster'@'%'IDENTIFIED BY'1234'WITHGRANTOPTION;
9. mysql>FLUSH PRIVILEGES;
11. mysql>exit
13. C:>mysql -ujavamaster -p1234 myproject_utf8
15. mysql>setnames euckr;
C:>mysql -uroot -p비밀번호mysql>CREATE DATABASE myproject_utf8 DEFAULT CHARACTER SET utf8 COLLATE utf8_general_ci; mysql>GRANT ALL PRIVILEGES ON *.* TO 'javamaster'@'localhost' IDENTIFIED BY '1234' WITH GRANT OPTION;mysql>GRANT ALL PRIVILEGES ON *.* TO 'javamaster'@'%' IDENTIFIED BY '1234' WITH GRANT OPTION; mysql>FLUSH PRIVILEGES;mysql>exitC:>mysql -ujavamaster -p1234 myproject_utf8mysql>set names euckr; 4라인 : euckr 환경으로 myproject_utf8 이라는 데이터베이스 생성
6라인 : 아이디 javamaster, 비밀번호 1234로 로컬에서만 접속권한이 있는 사용자 생성
8라인 : 아이디 javamaster, 비밀번호 1234로 원격에서도 접속권한이 있는 사용자 생성
10라인 : 권한 적용
14라인 : 새로 생성한 계정으로 접속
16라인 : utf8 환경일 경우 도스프롬프트에서 테이블에 한글문자를 등록(insert)하거나 한글을 조회(select)시 깨져보임. set names euckr; 로 설정을 바꾸면 한글이 제대로 보임(Mysql 5일 경우)

이올린에 북마크하기

'Database>MySQL' 카테고리의 다른 글

MySQL - JDBC Source(0)	2007/10/29
Connecting to a MySQL Database using Connector/J JDBC Driver(0)	2007/10/29
MySQL - create table examples(0)	2007/10/29
MySQL - JDBC(0)	2007/10/29
MySQL5(0)	2007/08/13
MySQL for JDBC Driver(0)	2007/06/13

'기본 카테고리' 카테고리의 다른 글

리눅스에 파이썬 설치하기 (0)	2008.02.13
sql 학습 (0)	2008.02.13
정규표현식 기초 (0)	2008.02.01
에디터용 글꼴 (0)	2008.02.01
USB 메모리나 하드에 WIndows XP 설치하기 (USB에 XP설치) (0)	2008.01.31

정규표현식 기초

2008. 2. 1. 13:41

정규표현식 기초

저자 전정호 (mahajjh@myscan.org)

Copyright (c) 2001 Jeon, Jeongho.Permission is granted to copy, distribute and/or modify this documentunder the terms of the GNU Free Documentation License, Version 1.1or any later version published by the Free Software Foundation;with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts.

이 글은 유닉스 사용과 관리에 필수인 정규표현식을 설명합니다. 또, 정규표현식을 처리하는 C 라이브러리도 마지막에 설명합니다.

1. 정규표현식이란?

아마 MS-DOS를 접해본 분이라면 와일드카드(wildcard, 유닉스에서는 glob pattern이라고 부른다)이라고 부르는 *나 ?와 같은 기호에 익숙할 겁니다. a로 시작하는 모든 GIF 파일을 a*.gif와 같이 비슷한 파일명을 일일이 명시하지 않고 지정할 수 있습니다. 정규표현식(regular express, 줄여서 regexp, regex, re)도 MS-DOS의 *나 ?와 같이 패턴을 기술하는 방식입니다. 그러나 정규표현식은 MS-DOS의 와일드카드와는 달리 파일명 뿐만이 아니라 파일 내용을 포함한 일반적인 용도로 사용이 가능하며, 그 기능도 더 강력합니다.

유닉스는 기본적으로 그래픽보다는 문자 기반 인터페이스를 제공하기 때문에, 문자들을 찾거나 다른 문자로 대체하는 도구인 정규표현식은 매우 중요합니다. 사실, 정규표현식을 모르고 유닉스를 사용하는 것이 가능할까란 의문이 들 정도로 정규표현식은 유닉스 사용과 관리의 많은 부분에 적용이 가능합니다. 우리가 자주 사용하는 편집기인 vi와 emacs, 자주 사용하는 도구인 grep과 sed와 awk, portable shell로 불리는 Perl, 자동으로 메일을 정리하는 procmail 등, 정규표현식은 유닉스의 거의 모든 도구와 관련이 있습니다. 개인적으로 뼈아픈 경험뒤에 "멍청하면 손발이 고생한다"는 격언(?)의 적절한 예로 정규표현식을 꼽습니다.

불행히도 도구마다 정규표현식을 지원하는 정도가 조금 차이가 나지만 몇번 시도해보면 이 차이를 알 수 있습니다. 그러면 기본적이고 광범위하게 쓰이는 정규표현식부터 하나씩 알아봅시다.

2. 정규표현식 기초

기본적으로 정규표현식은 다음 세가지로 구별할 수 있습니다.

문자에 해당되는 부분
앞의 해당되는 부분을 반복하는 부분
문자에 해당되지않고 위치나 결합을 나타내는 부분

이제 MS-DOS의 *와 같이 특수한 의미를 가지는 문자들을 만나게 됩니다. 우리가 정규표현식을 배운다는 것은 이런 특수 문자들과 그들의 의미를 아는 것입니다.

2.1. 문자에 해당되는 부분

우선 보통 알파벳과 숫자 등은 그 문자 그대로를 나타냅니다. 물론 대소문자는 서로 구별됩니다.

$ egrep 'GNU' COPYING                    GNU GENERAL PUBLIC LICENSEfreedom to share and change it.  By contrast, the GNU General Publicthe GNU Library General Public License instead.)  You can apply it to...(생략)...$

위에서 egrep은 파일들에서 원하는 문자들을 찾는 도구입니다. (흔히들 사용하는 grep의 변종으로 grep보다 다양한 정규표현식을 사용할 수 있습니다.) 첫번째 아규먼트로 원하는 문자를 나타내는 정규표현식을 사용합니다. 여기서 GNU는 정규표현식으로 뒤에 나오는 파일들에서 G, N, U 세 문자가 연이어 나오는 경우를 찾습니다. 여기서 사용한 파일인 COPYING은 자유 소프트웨어 소스코드에서 쉽게 찾을 수 있는 GPL 조항입니다. 결과를 명확하게 하기 위해서 찾은 단어를 굵게 표시했습니다.

그런데 왜 GNU 주위에 따옴표를 했을까요? 여기서 따옴표는 정규표현식에서 쓰이는 *, ?, | 등의 문자들이 쉘에서도 특별한 기능을 하기때문에 이들 문자가 쉘에서 처리되지 않게하려고 필요합니다. 또, egrep 'modified work' COPYING와 같이 찾으려는 패턴에 공백이 포함된 경우에도 따옴표는 이들을 한개의 아규먼트로 처리합니다. 사실 위의 GNU에 따옴표는 필요없지만, 항상 규칙처럼 따옴표를 같이 사용하는 것을 권합니다.

어떤 특정한 문자가 아니라 가능한 여러 문자들을 지정할 수도 있습니다.

$ egrep '[Tt]he' COPYING  The licenses for most software are designed to take away yourfreedom to share and change it.  By contrast, the GNU General Publicsoftware--to make sure the software is free for all its users.  This...(생략)...$

위에서 [Tt]는 그 자리에 T나 t가 나올 수 있음을 의미합니다. 이렇게 [와 ]안에 가능한 문자들을 적어줄 수 있습니다.

[a-z]와 같이 [] 안에 -를 사용하여 그 범위 안의 문자들도 지정할 수 있습니다. 예를 들어, [a-zA-Z0-9]는 영문 알파벳 대소문자와 숫자들을 모두 포함합니다. 또, [^a-z]와 같이 [] 처음에 ^를 사용하여 뒤에서 지정된 문자 이외의 문자를 지시할 수도 있습니다. 즉, 이는 영문 알파벳 소문자를 제외한 문자들을 의미합니다.

([a-z]에서 범위는 ASCII 코드값으로 a (97)에서 z (122)까지를 뜻합니다. 만약 [z-a]와 같이 큰 값을 앞에 쓰면 안됩니다. ASCII 코드값은 man ascii로 볼 수 있습니다.)

마지막으로 (보통 행바꿈 문자를 제외한) 어떤 문자에도 대응되는 .이 있습니다. (MS-DOS의 ?와 같습니다.)

$ egrep 'th..' COPYING of this license document, but changing it is not allowed.freedom to share and change it.  By contrast, the GNU General Publicsoftware--to make sure the software is free for all its users.  ThisGeneral Public License applies to most of the Free SoftwareFoundation's software and to any other program whose authors commit to...(생략)...$

이는 th 뒤에 두 문자가 나오는 경우를 찾습니다. 세번째 줄 끝에 This는 대소문자를 구별하기 때문에 패턴에 해당되지않고, the 에서 공백도 한 문자로 취급한 것을 주의하길 바랍니다. 위에서 program will individually obtain patent licenses, in effect making the와 같은 줄을 출력하지 않은 이유는 마지막 the에서 th와 그 뒤의 한 문자는 찾았지만 그 뒤에 문자가 줄바꿈 문자이기 때문에 조건이 만족되지않기 때문입니다.

2.2. 앞의 해당되는 부분을 반복하는 부분

여기서는 *, ?, +을 다룹니다.

*는 바로 앞의 문자를 0번 이상 반복해도 됨을 나타냅니다. 예를 들어, abc*는

abccccccccc
abc
ab

를 모두 만족합니다. 여기서 주의해서 볼 것은 "0번 이상"이기 때문에 마지막 경우와 같이 앞의 문자가 안나와도 된다는 것입니다. (그래서 MS-DOS의 *은 정규표현식으로 .*입니다.)

*와 비슷하게, ?는 앞의 문자가 없거나 하나 있는 경우를 나타내고, +는 앞의 문자가 1번 이상 반복하는 경우를 나타냅니다. 그래서 a+는 aa*와 같습니다.

이제 abc 모두를 반복하고 싶다면 어떻게 해야 되는지 의문이 듭니다. 이 경우 (, ) 괄호를 사용하여 문자들을 묶어주면 됩니다. 그래서 (abc)*는

abcabcabcabc
abc

를 모두 만족합니다. 마지막 예는 0번 반복한 경우로 어떤 문자도 없는 빈 경우입니다. 이제 앞에서 말한 "앞의 문자"라는 말을 정정해야 겠습니다. *, ?, +는 "앞의 문자"에 적용되는 것이 아니라 "앞의 단위"에 적용됩니다. 기본적으로 한 문자는 그 자체로 한 단위입니다. 그래서 abc*에서 *는 바로 앞 문자이자 단위인 c에 적용된 것입니다. 그러나 괄호로 문자들을 묶어서 단위를 만들 수 있고, (abc)*의 경우에 *는 앞의 단위인 abc에 적용된 것입니다.

주의
위에서 (abc)*가 0번 반복해서 어떤 문자도 없는 것을 나타낼 수 있음을 주의해야 합니다. 정규표현식에서 이런 경우 대상과 관계없이 패턴이 만족한 것으로 판단하기 때문에 egrep '(abc)*' COPYING와 같은 명령어는 COPYING에 abc라는 부분이 없음에도 불구하고 모든 줄을 출력합니다. 즉, egrep '(abc)*' COPYING | wc -l과 wc -l COPYING은 같습니다.

또, 주의할 점은 정규표현식은 패턴을 만족시키는 가장 긴 부분을 찾는다는 점입니다. 즉, abababab에 대한 정규표현식 (ab)+는 ab나 abab에 대응되지 않고 abababab 모두에 대응됩니다. 이런 행동은 어떻게보면 당연한 것이지만 주의를 하지않으면 문제가 생길 수도 있습니다. 예를 들어, <B>compiler</B> and <B>interpreter<\B>에 대한 정규표현식 <B>.*<\B>는 (의도했을) <B>compiler</B>을 찾지않고 <B>compiler</B> and <B>interpreter<\B> 전체를 찾게 됩니다. 이를 해결하기 위해 <B>[^<]*<\B>을 사용합니다. . 대신에 [^<]를 사용한 것처럼 찾는 대상을 제한하기 위해서 [^...] 형식을 자주 사용합니다.

2.3. 문자에 해당되지않고 위치나 결합을 나타내는 부분

여기서 다루는 ^, $, |는 앞에서와는 달리 특정 문자에 대응하지는 않지만, 위치나 결합의 중요한 기능을 합니다.

우선 ^는 줄의 처음을 뜻합니다.

$ egrep '^[Tt]he ' COPYINGthe GNU Library General Public License instead.)  You can apply it tothe term "modification".)  Each licensee is addressed as "you".the scope of this License.The source code for a work means the preferred form of the work for...(생략)...$

정규표현식의 마지막 문자가 공백임을 주의하길 바랍니다. 만약 이 공백이 없다면 These나 themselves,로 시작하는 줄들도 찾게됩니다. 이렇듯 정규표현식을 적을 때는 자신이 찾길 원하는 부분을 빼먹거나, 자신이 원하는 부분 이외의 것을 포함하지 않도록 주의해야 합니다. 지금처럼 정규표현식을 입력하고 그 결과를 하나씩 살펴볼때는 문제가 없지만, 많은 경우 정규표현식은 스크립트로 많은 문서를 한꺼번에 처리할때 사용하기때문에 주의해야 합니다. 잘못 쓴 정규표현식때문에 원하는 결과를 얻지 못하는 것은 물론이고 원본까지 망치게 되는 경우가 있습니다.

^는 이렇게 [Tt]와 같이 특정 문자에 대응되지는 않지만 원하는 문자가 선택될 수 있게 도와줍니다. 반대로, $는 줄의 끝을 나타냅니다. 그래서 ^$과 같은 정규표현식은 빈 줄을 찾습니다.

|은 기호 앞뒤의 것 중 하나를 선택한다는 뜻입니다. 즉, 문서에서 this(This)나 that(That)을 찾는다면,

this|This|that|That
[tT]his|[tT]hat
[tT]his|hat - 틀림! 이 정규표현식은 [tT]his나 hat을 찾음.
[tT](his|hat)
[tT]h(is|at)

모두 가능합니다. 세번째와 네번째 경우에서 괄호의 기능을 알 수 있습니다.

2.4. 일반문자와 특수문자

아마도 지금쯤 ^이 두가지 의미로 쓰인다는 것이 이상해 보일 수도 있을 겁니다. 정규표현식에서 쓰이는 문자는 크게 일반문자와 특수문자로 나눠볼 수 있습니다. 여기서 특수문자란 앞에서 다룬 (순서대로) [, ], -, ^, ., *, ?, +, (, ), $, |과 같이 정규표현식에서 문자그대로의 의미로 해석되지 않는 문자들입니다. 반대로 특수문자가 아닌 문자는 일반문자로 G, N, U와 같이 문자그대로의 의미를 가집니다.

여기서 특수문자는 쓰이는 곳에 따라 다릅니다. 자세히 말하면, []안이냐 밖이냐에 따라 특수문자가 달라집니다.

우선 [] 밖에서는 -를 제외한, ^, ., *, ?, +, (, ), $, |이 특수문자입니다. 여기서 ^는 줄의 시작을 의미합니다.

그러나 [] 안에서는 -과 ^만이 특수문자이고, 다른 문자들은 일반문자가 됩니다. 즉, [*?+]는 반복이 아니라 문자그대로 *나 ?나 + 중 하나를 뜻합니다. [] 안에서 (제일 앞에 나오는) ^는 뒤에나오는 조건을 만족하지 않는 문자를 찾는다는 의미가 됩니다.

2.5. 특수문자에 해당하는 문자 사용하기

그렇다면 찾으려는 부분에 특수문자가 포함되있다면 어떻게 할까요? 예를 들어 what?이라는 물음표로 끝나는 문자를 찾고 싶다고, egrep 'what?' ...이라고 하면 ?이 특수문자이므로 wha를 포함한 whale도 찾게 됩니다. 또, 3.14로 찾을때는 3+14 등도 찾게 됩니다.

특수문자가 [] 안과 밖에서 다르다는 점을 생각하여 각각의 경우를 살펴봅시다. 우선 [] 밖의 경우는,

\을 특수문자 앞에 붙이기. 예, what\?, 3\.14
[]을 사용하기. 예, what[?], 3[.]14

첫번째 방법은 보통 escape라고 부르며, 특수문자 앞에 \을 붙여서 특수문자의 특수한 기능을 제거합니다. 두번째 방법은 [] 밖의 많은 특수문자들이 [] 안에서는 일반문자가 되는 점을 이용한 것입니다. 보통 첫번째 방법을 많이 사용합니다.

주의할 점은 첫번째 방법에서 사용하는 \가 뒤에 나오는 특수문자를 일반문자로 만드는 특수문자이기 때문에, 문자 그대로의 \을 나타내려면 \\을 사용해야 합니다. 물론 [\]도 가능합니다.

[] 안의 특수문자는 위치를 바꿔서 처리합니다. 먼저, ^는 [^abc]와 같이 처음에 나와야만 의미가 있으므로 [abc^]와 같이 다른 위치에 사용하면 됩니다. -는 [a-z]와 같이 두 문자 사이에서만 의미가 있으므로 [-abc]나 [abc-]와 같이 제일 처음이나 마지막에 사용합니다.

(grep과 같이 도구에 따라 역으로 일반 문자앞에 \를 붙여서 특수문자를 만드는 경우가 있습니다. 아래 각 도구에 대한 설명 참고.)

3. 정규표현식 고급

고급이라고 제목을 붙였지만 여기서는 도구마다 차이가 나거나 없을 수도 있는 내용을 다룹니다.

3.1. 자세한 반복

반복하는 횟수를 자세히 조정할 수 있습니다.

{n} - 정확히 n번 반복. a{3}은 aaa와 같음.
{n,} - n번 이상 반복. a{3,}은 aaaa*와 같음.
{n,m} - n번 이상 m번 이하 반복. a{2,4}는 aaa?a?와 같음.

물론 (abc){2,4}같이 괄호로 반복할 단위를 지정할 수 있습니다. 여기서 {, }도 *, ?, +와 같이 특수문자임을 주의하길 바랍니다. (엄밀한 의미에서 }은 특수문자가 아닙니다.)

3.2. 기억하기

앞에서 여러 문자를 묶어서 단위로 만드는 괄호는 정규표현식으로 찾은 부분을 기억하여 다른 곳에서 사용할때도 사용합니다. 예를 들어, HTML 제목 테그는 (egrep에서) <[Hh]([1-6])>.*</[Hh]\1>와 같이 찾을 수 있습니다. 여기서 ([1-6])의 (, )는 사이에 대응된 부분을 기억하여 (첫번째 기억된 내용을) \1에서 사용합니다. 즉, <H2>Conclusion</H2>에서 </H2> 외에 </H1>나 </H3> 등은 만족하지 않습니다.

(...)은 여러번 사용할 수 있고 (심지어 겹쳐서도), \n은 기억된 n번째 부분을 지칭합니다. 순서는 기억이 시작되는 (의 순서입니다.

여기에서는 (과 )이 특수문자이고, 그냥 $와 $는 일반문자이지만, 도구에 따라 반대인 경우도 있습니다.

이 기능은 또 치환에서 자주 사용됩니다. 아래 vi와 sed 부분을 참고하길 바랍니다.

3.3. 단어 찾기

앞에서 the를 찾으면 the 외에 them 등도 같이 찾는 것을 보았습니다. 그러면 정관사 the만 찾으려면 어떻게 할까요?

간단히 정규표현식 앞뒤에 공백을 추가한 the를 생각해 볼 수 있습니다. 그러나 이 정규표현식에는 두가지 문제가 있습니다. 첫번째는 탭(tab) 등 다른 공백문자가 있기 때문입니다. 두번째는 이 정규표현식으로 the가 줄 제일 앞이나 제일 뒤에 나오는 경우는 찾지 못하기 때문입니다. 물론 [], ^, $와 |를 복잡하게 결합하여 이들 경우를 모두 처리할 수 있는 정규표현식을 쓸 수 있지만, 자주 사용하는 표현이기 때문에 간단히 할 수 있는 방법이 마련되있습니다.

그것은 \<과 \>로, \<은 공백에서 공백이 아닌 문자 사이, \>는 공백이 아닌 문자에서 공백 사이의 위치를 나타냅니다. 즉, ^나 $와 같이 문자에 해당되지않고 위치만을 나타냅니다. 이제 해답은 \<the\>입니다.

3.4. 단축 표현들

정규표현식에는 이외에도 자주 사용되는 표현에 대한 단축된 형식을 제공합니다. 예를 들어, vim에서 \i는 (C 언어 인식자 이름에서 사용하는 문자인) [_a-zA-Z0-9]와 같습니다. 그러나 이런 단축된 형식은 도구에 따라 많은 차이가 나기때문에 관련 문서를 참고하길 바랍니다.

POSIX.2에서 정의한 단축 표현은 다음과 같습니다. (C 언어에서 <ctype.h>에 선언된 is*() 함수와 비슷한 것을 알 수 있습니다.) 단축된 형식이 나타내는 정확한 값은 locale에 따라 변합니다. 여기서는 영어권에서 사용하는 값을 보입니다. 독일어의 움라우트(ä)와 같이 다른 언어권에서는 다른 값을 가질 수 있습니다.

[:alnum:] - 알파벳과 숫자. [a-zA-Z0-9]
[:alpha:] - 알파벳. [a-zA-Z]
[:cntrl:] - 제어문자. ASCII 값으로 0x00-0x1F와 0x7F
[:digit:] - 숫자. [0-9]
[:graph:] - 제어문자와 공백을 제외한 문자. ASCII 값으로 0x21-0x7E
[:lower:] - 소문자. [a-z]
[:print:] - 제어문자를 제외한 문자. ASCII 값으로 0x20-0x7E
[:punct:] - [:graph:] 중에 [:alnum:]에 속하지 않은 문자. !, @, #, :, , 등
[:space:] - space, tab, carriage return, new line, vertical tab, formfeed. ASCII 값으로 0x09-0x0D와 0x20
[:upper:] - 대문자. [A-Z]
[:xdigit:] - 16진수에 사용하는 문자. [0-9a-fA-F]

** 요약

^ (caret)	라인의 처음이나 문자열의 처음을 표시	^aaa (문자열의 처음에 aaa를 포함하면 참, 그렇지 않으면 거짓)
$ (dollar)	라인의 끝이나 문자열의 끝을 표시	aaa$ (문자열의 끝에 aaa를 포함하면 참, 그렇지 않으면 거짓)
. (period)	임의의 한 문자를 표시	^a.c (문자열의 처음에 abc, adc, aZc 등은 참, aa 는 거짓)
		a..b$ (문자열의 끝에 aaab, abbb, azzb 등을 포함하면 참)
`[] (bracket)`	문자의 집합이나 범위를 나타냄, 두 문자 사이의 "-"는 범위를 나타냄	`[]`내에서 "^"이 선행되면 not을 나타냄
{} (brace)	{} 내의 숫자는 직전의 선행문자가 나타나는 횟수 또는 범위를 나타냄	a{3} ('a'의 3번 반복인 aaa만 해당됨)
* (asterisk)	"*" 직전의 선행문자가 0번 또는 여러번 나타나는 문자열	ab*c ('b'를 0번 또는 여러번 포함하므로 ac, ackdddd, abc, abbc, abbbbbbbc 등)
+	"+" 직전의 선행문자가 1번 이상 나타나는 문자열	ab+c ('b'를 1번 또는 여러번 포함하므로 abc, abckdddd, abbc, abbbbbbbc 등, ac는 안됨)
?	"?" 직전의 선행문자가 0번 또는 1번 나타나는 문자열	ab?c ('b'를 0번 또는 1번 포함하므로 abc, abcd 만 해당됨)
() (parenthesis)	()는 정규식내에서 패턴을 그룹화 할 때 사용
\| (bar)	or를 나타냄	a\|b\|c (a, b, c 중 하나, 즉 `[a-c]`와 동일함)
\ (backslash)	위에서 사용된 특수 문자들을 정규식내에서 문자로 취급하고 싶을 때 '\'를 선행시켜서 사용하면됨	filename\.ext ("filename.ext"를 나타냄)
\s	띄어쓰기

정규식에서는 위에서 언급한 특수 문자를 제외한 나머지 문자들은 일반 문자로 취급함

위의 정규식 연산자 가운데 vi에서는 지원하지 않는 연산자가 있습니다. vi의 경우 +연산자도 없습니다. regular expression library에 따라 지원하는 연산자의 종류가 상당히 다릅니다. 요즘은 perl-style regular expression이 표준으로 자리잡아가는 것이 대체적인 추세이고, perl의 regular expression은 가장 복합하고 기능이 많은 편입니다.

 * [abc] (a, b, c 중 어떤 문자, "[a-c]."과 동일)  * [Yy] (Y 또는 y)  * [A-Za-z0-9] (모든 알파벳과 숫자)  * [-A-Z]. ("-"(hyphen)과 모든 대문자)  * [^a-z] (소문자 이외의 문자)  * [^0-9] (숫자 이외의 문자)  * [[:digit:]] ([0-9]와 동일)  * a{3,} ('a'가 3번 이상 반복인 aaa, aaaa, aaaa, ... 등을 나타냄)  * a{3,5} (aaa, aaaa, aaaaa 만 해당됨)  * ab{2,3} (abb와 abbb 만 해당됨)  * [0-9]{2} (두 자리 숫자)  * doc[7-9]{2} (doc77, doc87, doc97 등이 해당)  * [^Zz]{5} (Z와 z를 포함하지 않는 5개의 문자열, abcde, ttttt 등이 해당)  * .{3,4}er ('er'앞에 세 개 또는 네 개의 문자를 포함하는 문자열이므로 Peter, mother 등이 해당)  * * (선행문자가 없는 경우이므로 임의의 문자열 및 공백 문자열도 해당됨)  * .* (선행문자가 "."이므로 하나 이상의 문자를 포함하는 문자열, 공백 문자열은 안됨)  * ab* ('b'를 0번 또는 여러번 포함하므로 a, accc, abb, abbbbbbb 등)  * a* ('a'를 0번 또는 여러번 포함하므로 k, kdd, sdfrrt, a, aaaa, abb, 공백문자열 등)  * doc[7-9]* (doc7, doc777, doc778989, doc 등이 해당)  * [A-Z].* (대문자로만 이루어진 문자열)  * like.* (직전의 선행문자가 '.'이므로 like에 0 또는 하나 이상의 문자가 추가된 문자열이 됨, like, likely, liker, likelihood 등)  * ab+ ('b'를 1번 또는 여러번 포함하므로 ab, abccc, abb, abbbbbbb 등)  * like.+ (직전의 선행문자가 '.'이므로 like에 하나 이상의 문자가 추가된 문자열이 됨, likely, liker, likelihood 등, 그러나 like는 해당안됨)  *[A-Z]+ (대문자로만 이루어진 문자열)  * yes|Yes (yes나 Yes 중 하나, [yY]es와 동일함)  * korea|japan|chinese (korea, japan, chinese 중 하나)  * [\?\[\\\]] ('?', '[', '\', ']' 중 하나)

3.5. 눈으로 보는 정규표현식

정규표현식이 패턴을 찾는 과정을 시각적으로 보여주는 프로그램들이 있습니다.

Visual REGEXP (Tcl/Tk 사용) 또는 이미 받아 놓은 실행화일 사용법
RegExplorer (Qt 사용)

4. 정규표현식 사용

이제 이런 정규표현식을 실제로 어떻게 사용하는지 알아봅시다. 평소에 많이 사용하는 vi, grep/egrep/fgrep, sed/awk의 예를 들어보겠습니다.

4.1. vi에서

vi에서 정규표현식은 ':'상태에서 사용합니다. (실제로 이 상태에서 실행하는 명령어는 ed나 ex라는 프로그램이 처리하게 됩니다. 그래서 보통 이 상태를 "ed-모드"라고 합니다.) 문서에서 원하는 패턴을 찾으려면, (커서 다음에서 찾을때) /패턴이나 (커서 전에서 찾을때) ?패턴을 사용합니다.

정규표현식은 문자치환과 결합하여 강력한 기능을 합니다. 문자치환 명령은 다음과 같습니다.

:범위s/변경전/변경후/수정자

"범위"는 명령이 실행될 범위를 나타내며, 보통은 현재 편집하고 있는 문서 전체를 지시하는 (첫번째 줄에서 마지막 줄까지를 뜻하는) 1,$나 줄여서 %를 사용합니다.

뒤에 "s"는 치환(substitute) 명령어입니다.

"변경전"과 "변경후"에 치환할 내용을 입력합니다. "변경전"에 정규표현식을 적습니다. 정규표현식으로 ., *, ^, $, [], $...$, \<...\>, POSIX.2 단축 표현을 사용할 수 있습니다. 여기서 여러 문자를 묶여서 단위를 만들고 찾은 내용을 기억하는 특수문자가 $, $임을 주의해야 합니다. 반대로 (, )가 일반문자입니다. vim(VI iMproved)에서는 vi에 추가로 |, +, (?와 같은) =, {n,m}을 사용할 수 있지만, 앞에 \를 붙여야 합니다. 또, vim에는 \i, \k, \p, \s 등의 단축 표현들이 있습니다.

"변경후"에 \n과 &를 사용할 수 있습니다. \n는 "변경전"에서 n번째 $...$에 대응하는 부분이고, &는 "변경전"에 만족한 전체를 나타냅니다. 예를 들어, :%s/$[0-9][0-9]*$ $[Cc]hapter$/\2 \1/는 문서에서 12 Chapter같은 부분을 Chapter 12와 같이 치환하고, :%s/F[1-9][12]*/&/g는 HTML 문서에서 "F1" ~ "F12"란 단어 모두를 굵은 체로 바꿉니다. (주의! &는 정규표현식의 특수문자는 아니지만 vi의 특수문자이므로, 문자그대로의 &를 사용하려면 대신 \&를 사용해야 한다.) 이외에도 (뒤를 모두 대문자로) \u나 (뒤를 모두 소문자로) \l같은 기능이 있습니다.

"수정자"는 치환 명령의 세부사항을 결정합니다. 필요한 것만 뒤에 적어주면 됩니다.

g (global) - 한 줄에서 정규표현식을 만족하는 부분을 여러개 찾았을 때 모두다 치환한다. 이 수정자를 사용하지 않으면 처음 것만 치환한다.
c (confirm) - 만족하는 정규표현식을 찾았을때 치환하기 전에 확인한다.
i (ignore-case) - 대소문자를 무시하고 찾는다. 즉, :%s/[aA][bB][cC]/XXX/ 대신에 :%s/abc/XXX/i를 사용할 수 있다.

마지막으로 주의할 점은 치환명령어가 / 문자로 각 부분을 구분하기때문에 "변경전"이나 "변경후"에 / 문자를 사용하려면 \/ 같이 써야합니다. 필요하다면 / 대신 다른 문자를 사용해도 됩니다. 예를 들어, :%s/\/usr\/local\/bin\//\/usr\/bin\//g 대신 :%s#/usr/local/bin/#/usr/bin/#g가 알아보기 더 쉽습니다.

4.2. grep/egrep/fgrep에서

grep은 Global Regular Expression Print(ed 명령어로 :g/re/p)의 준말로 입력에서 원하는 정규표현식을 찾는 명령어입니다. grep에는 egrep과 fgrep이라는 변종이 있습니다. 전통적으로 egrep은 grep 보다 더 다양한 정규표현식을 지원하고, fgrep은 정규표현식을 지원하지 않고 빨리 검색하기 위한 명령어입니다. GNU grep에서 egrep은 grep -E, fgrep은 grep -F와 같습니다.

grep과 egrep 모두 ., *, ?, +, {n,m}, ^, $, |, [], (...), \n, \<...\>, POSIX.2 단축 표현을 지원합니다. 단, grep은 ?, +, {, |, (, )를 일반문자로 보기때문에 특수문자로 사용하려면 앞에 \를 붙여야 합니다.

4.3. sed/awk에서

...

5. Perl 정규표현식

...

6. 정규표현식 응용

7. 정규표현식 프로그래밍

프로그래밍 언어와 관계없이 정규표현식을 프로그래밍하는 방식은 비슷하다. 먼저, 사용할 정규표현식을 "컴파일"한다. 여기서 컴파일한다는 말은 정규표현식을 실행파일로 만든다는 말이 아니라 정규표현식을 처리하기위한 내부 자료구조를 만든다는 뜻이다. 이 자료구조를 사용하여 정규표현식을 빠르게 처리할 수 있다. 컴파일한 후 컴파일된 자료구조를 사용하여 원하는 검색과 치환을 하게된다. 마지막으로 사용이 끝난 자료구조를 반환한다. 프로그래밍 언어에 따라 이 과정이 필요없는 경우도 있다.

7.1. C 언어

glibc(GNU C Library)에 정규표현식을 위한 다음과 같은 함수들이 있다.

#include <regex.h>int regcomp(regex_t *compiled, const char *pattern, int cflags);int regexec(regex_t *compiled, char *string, size_t nmatch, regmatch_t matchptr[], int eflags);void regfree(regex_t *compiled);size_t regerror(int errcode, regex_t *compiled, char *buffer, size_t length);

먼저 함수와 자료형이 선언된 regex.h를 포함한다. regcomp()는 pattern에 주어진 정규표현식을 컴파일하여 결과를 compiled에 저장한다. cflags 인자는 정규표현식 처리 옵션들을 지정한다. 정상적으로 실행되면 0을 반환하고, 오류가 발생한 경우 0이 아닌 값을 반환한다.

표 1. cflags 인자

`REG_EXTENDED`
`REG_ICASE`	대소문자 구별안함
`REG_NOSUB`	괄호로 찾은 부분 영역 기억하지 않기
`REG_NEWLINE`	여러 줄을 처리. 이 옵션이 없다면 `.`에 행바꿈 문자가 포함되고, (사이에 행바꿈 문자가 있더라도) `^`과 `$`는 찾으려는 문자열의 시작과 끝만을 의미한다.

실제 정규표현식 검색은 regexec()으로 한다. string에 정규표현식으로 검색할 문자열을 주면 ...

http://www.onjava.com/pub/a/onjava/2003/11/26/regex.html 참고

7.2. Java

7.3. Python

7.4. PHP

참고 자료

grep(1), regex(3), regex(7), fnmatch(3) manpage
GNU C Library 문서
Learning the vi Editor, 6th ed, Linda Lamb, O'Reilly, 1998
sed & awk, 2nd ed, Dale Dougherty & Arnold Robbins, O'Reilly, 1997

'기본 카테고리' 카테고리의 다른 글

sql 학습 (0)	2008.02.13
MySQL5 (0)	2008.02.05
에디터용 글꼴 (0)	2008.02.01
USB 메모리나 하드에 WIndows XP 설치하기 (USB에 XP설치) (0)	2008.01.31
파이썬으로 EXCEL 파일 열고 쓰기 (0)	2008.01.31

에디터용 글꼴

2008. 2. 1. 13:02

에디터용 글꼴

윈도우에서 텍스트 파일을 다루기 시작한 이래로 쭉 '굴림체'를 사용하고 있었다. 윈도우 기본 글꼴인 만큼 가독성도 훌륭하고 꽤 괜찮은 글꼴이긴 한데 코딩 작업에 쓰기에는 몇가지 흠이 있다. 소문자 엘, 대문자 아이의 모양이 거의 같은데다 영문자 오, 숫자 영의 모양은 완전히 같고 역슬래시가 원화 기호로 나온다는 점. 그래서 코딩용 글꼴을 바꿔보려는 생각은 계속 갖고있었지만 마음에 드는 글꼴이 없어서 여지껏 굴림체를 쓰고 있었다.

내가 원하던 글꼴의 조건은,

고정폭 글꼴일 것
I,l,1,O,0의 구분이 명확할 것
굴림체에 버금가는 가독성을 보일 것
역슬래시를 ＼ 모양으로 표시해줄 것
9pt 에서 Anti-aliasing 없이 출력될 것
한글은 굴림 9pt로 출력될 것

훌륭한 영문 글꼴은 많지만 대부분 6번 조건을 만족시키지 못했는데 한글과 섞어놨을때 한글이 바탕체로 나오거나 굴림체지만 10pt 정도로 크게 나오거나 하는 문제가 많았다. 결국 저 6가지 조건을 만족시키는 글꼴은 없는가 하고 고민하다가 직접 만들면 안되나? 하는 생각까지 이르러서 주말에 삽질을 좀 했다.

한글과 섞어 써야 하기 때문에 글꼴 폭은 굴림체와 같은 5픽셀로 하고 Verdana와 굴림체의 영문을 적절히 섞어서 만들었다. 굴림체에 너무 익숙해져서 그런지 좀 지저분해 보이는 감이 없지 않지만 계속 보고 있으니 좀 익숙해지는 것도 같고..;;

굴림체

새로 만든 Crizin Code 글꼴 (이름참..;;)

Courier, Courier New, Lucida Sans Typewriter, 굴림체 글꼴과 비교

다른 글꼴들과 비교

(2006.05.23) 일부 특수문자가 나오지 않는 오류 수정 (KYG님 완전 감사!!)

CrizinCode.ttf

Version : 2006.05.23.01

'기본 카테고리' 카테고리의 다른 글

MySQL5 (0)	2008.02.05
정규표현식 기초 (0)	2008.02.01
USB 메모리나 하드에 WIndows XP 설치하기 (USB에 XP설치) (0)	2008.01.31
파이썬으로 EXCEL 파일 열고 쓰기 (0)	2008.01.31
제 1회 게임기획전문가 시험 실기 (0)	2008.01.19

PREV 1 ···61 62 63 64 65 66 67 ···81 NEXT