How Internet Search Engines Work
"Spiders" take a
Web page's content and create key search words that enable online users to find
pages they're looking for.
Web Crawling
When most
people talk about Internet search engines, they really mean World Wide Web
search engines. Before the Web became the most visible part of the Internet,
there were already search engines in place to help people find information on
the Net. Programs with names like "gopher" and "Archie" kept
indexes of files stored on servers connected to the Internet, and dramatically reduced the
amount of time required to find programs and documents. In the late 1980s,
getting serious value from the Internet meant knowing how to use gopher,
Archie, Veronica and the rest.
Today, most
Internet users limit their searches to the Web, so we'll limit this article to search
engines that focus on the contents of Web pages.
Before a
search engine can tell you where a file or document is, it must be found. To
find information on the hundreds of millions of Web pages that exist, a search
engine employs special software robots, called spiders,
to build lists of the words found on Web sites. When a spider is building its
lists, the process is called Web crawling.
(There are some disadvantages to calling part of the Internet the World Wide
Web -- a large set of arachnid-centric names for tools is one of them.) In
order to build and maintain a useful list of words, a search engine's spiders
have to look at a lot of pages.
How does
any spider start its travels over the Web? The usual starting points are lists
of heavily usedservers and very popular pages. The spider
will begin with a popular site, indexing the words on its pages and following
every link found within the site. In this way, the spidering system quickly
begins to travel, spreading out across the most widely used portions of the
Web.
Google began as an academic search engine.
In the paper that describes how the system was built, Sergey Brin and Lawrence
Page give an example of how quickly their spiders can work. They built their
initial system to use multiple spiders, usually three at one time. Each spider
could keep about 300 connections to Web pages open at a time. At its peak
performance, using four spiders, their system could crawl over 100 pages per
second, generating around 600 kilobytes of data each second.
Keeping
everything running quickly meant building a system to feed necessary
information to the spiders. The early Google system had a server dedicated to
providing URLs to the spiders. Rather than depending on an Internet service provider for the domain name server (DNS) that translates a
server's name into an address, Google had its own DNS, in order to keep delays
to a minimum.
When the
Google spider looked at an HTML page, it took note of two things:
·
The words within the page
·
Where the words were found
Words
occurring in the title, subtitles, meta tags
and other positions of relative importance were noted for special consideration
during a subsequent user search. The Google spider was built to index every
significant word on a page, leaving out the articles "a," "an"
and "the." Other spiders take different approaches.
These different approaches usually attempt to make the spider
operate faster, allow users to search more efficiently, or both. For example,
some spiders will keep track of the words in the title, sub-headings and links,
along with the 100 most frequently used words on the page and each word in the
first 20 lines of text. Lycos is said to use this approach to spidering the
Web.
Other
systems, such as AltaVista, go in the other direction, indexing every single
word on a page, including "a," "an," "the" and
other "insignificant" words. The push to completeness in this
approach is matched by other systems in the attention given to the unseen
portion of the Web page, the meta tags. Learn more about meta tags on the next
page.
1 comment:
Now that we know how search engines work, we must now practice responsible internet use.
Post a Comment