About the Web Crawler and Indexer

I wrote this indexing software to help me understand how to speed up big Web indices with special hardware, as part of CS245. You can use it to search for words in web pages at Harvard.

The Crawler

Here is the Perl source for my web crawler. You'll also want webget.c, web.c, web.h, and weblinks.c. The crawler fetches pages, accumulates them in a huge file, and follows the links in each fetched page. It keeps track of the URLs it has yet to follow, and the URLs it has already followed, in a DB file. It avoids talking to any one server more than once per minute. It remembers which servers seem to be down and avoids them for half an hour. It does not know about the robot exclusion convention, which makes it inappropriate for serious use. It it not particularly high performance because it is not multi-threaded.

The Indexer

indexer.c builds an inverted index from the output of the crawler (or any other set of files). The basic idea:

Replace every non-alphanumeric character with space.
Mark each word of input with its document number and offset in the document. View the input as word, document, offset tuples.
Sort the tuples (this takes a long time).
Delete the word from each tuple. Now you have a large postings file containing document/offset pairs; the pairs for each word are contiguous and sorted.
Build a hash table mapping each word to the location of its first entry in the postings file.

The design was driven by a desire to avoid random disk accesses. The only data structure that is random access and might not fit in memory is the hash table of words.

This indexer burns up a lot of temporary disk space. See Edward Fox and Whay Lee, FAST-INV: A Fast Algorithm for Building Large Inverted Files, Virginia Polytechnic Institute TR 91-10 for some more parsimonious approaches.

The Querier

querier.c takes a set of terms (words) from the user as input. It performs joins on the postings lists for the terms; these involve linear scans since the postings entries are sorted. The result is a sorted list of document/offset tuples, one for each occurence of each term in every document that contained all the terms.

The querier then selects the ten highest-scoring documents to present to the user. A document's score is the sum over the query terms of:

10 points if the term appears in the document.
7 points if the term appears just before a different query term.
7 points if the term appears in the first 400 characters of the document.
5 points if the term occurs more than once in the document.
3 points if the term occurs within 150 characters of a different term.

The querier should be taught about and/or queries: that is, if it doesn't find documents that contain all the terms, it should look for documents that contain only some of them.

The standard reference on ranking query results by relevance is Gerard Salton's Introduction to Modern Information Retrieval, McGraw-Hill, 1983.

-- Robert Morris, rtm@eecs.harvard.edu