About the Web Crawler and Indexer
I wrote this indexing software to help me understand how to
speed up big Web indices with special hardware, as part of
CS245.
You can use it to
search
for words in web pages at Harvard.
The Crawler
Here is the Perl
source
for my web crawler.
You'll also want
webget.c,
web.c,
web.h,
and
weblinks.c.
The crawler fetches pages, accumulates them in a huge file, and follows
the links in each fetched page.
It keeps track of the URLs it has yet to follow, and the
URLs it has already followed, in a DB file.
It avoids talking to any one server more than once per minute.
It remembers which servers seem to be down and avoids
them for half an hour.
It does not know about the robot exclusion convention,
which makes it inappropriate for serious use.
It it not particularly high performance because it is
not multi-threaded.
The Indexer
indexer.c
builds an inverted index from
the output of the crawler (or any other set of files).
The basic idea:
- Replace every non-alphanumeric character with space.
- Mark each word of input with its document number and
offset in the document. View the input as word, document,
offset tuples.
- Sort the tuples (this takes a long time).
- Delete the word from each tuple.
Now you have a large
postings
file containing document/offset pairs;
the pairs for each word are contiguous and sorted.
- Build a hash table mapping each word to the location of
its first entry in the postings file.
The design was driven by a desire to avoid random disk accesses.
The only data structure that is random access and might not fit
in memory is the hash table of words.
This indexer burns up a lot of temporary disk space.
See Edward Fox and Whay Lee,
FAST-INV: A Fast Algorithm for Building Large Inverted Files,
Virginia Polytechnic Institute TR 91-10
for some more parsimonious approaches.
The Querier
querier.c
takes a set of terms (words) from the user as input.
It performs joins on the postings lists for the terms;
these involve linear scans since the postings
entries are sorted.
The result is a sorted list of document/offset tuples,
one for each occurence of each term in every document
that contained all the terms.
The querier then selects the ten highest-scoring documents
to present to the user.
A document's score is the sum over the query terms of:
- 10 points if the term appears in the document.
- 7 points if the term appears just before a different query term.
- 7 points if the term appears in the first 400 characters
of the document.
- 5 points if the term occurs more than once in the document.
- 3 points if the term occurs within 150 characters
of a different term.
The querier should be taught about and/or queries: that is,
if it doesn't find documents that contain all the terms,
it should look for documents that contain only some of them.
The standard reference on ranking query results by relevance is
Gerard Salton's
Introduction to Modern Information Retrieval,
McGraw-Hill, 1983.
-- Robert Morris, rtm@eecs.harvard.edu