Secrets of the Spider

by Triad@Efnet

Let me say this: the idea for the spider is not mine.  I read in 2005 a Ph.D. paper that was written by two researchers from the University of Chicago.

They called this spider a weapon and would not give the code.  But they did give me one clue and that was that it was made from Perl.  I did not know Perl nor did I know how to build spiders (or web crawlers if you will).  With what I read and what I researched, I built the weapon and it works - and it works good.

That was in 2005 and I think the paper was written in 1998 (give or take a year).  Now, the spider weapon is mostly obsolete, or rather the weapon involved is now mostly outdated.  The links it used on the target page have been replaced by most high-level web developers with JavaScript.  So it is time to retire the main weapon used on the spider.  The one thing I can say is that the code is mine.  The researchers gave me the idea and the framework and I did the coding and made the spider work.

Like the researchers, I will not give you the weapon code.  But I will give you the spider code.  It is Perl and it is easy to understand, especially if you know Perl (you will note that I use Perl like BASIC).

Looking at the code from the top, the first thing you see are the variables.  Most variables used in the warheads are gone to make the spider faster and more efficient.  So if you see some variables and can't find them on the code, it probably was used on the warhead.  The $file variable is used to load the searchdata.txt file.  This is used as an ammunition dump for the warhead.  This file is loaded with URLs that are used one at a time for processing and stripping links for the level two warhead processing.

The next section is the spider/agent setup area.

This area uses Perl libraries (LWP::UserAgent) to set up the spider.  The spider will not work if the agent libraries are not listed.  The next section is for loading the URLs from searchdata.txt.

Again, this array is used to feed the spider URLs to keep the spider crawling.  Once the array is filled with URLs, this file is closed and not used again unless the spider is stopped and restarted.

Okay, now it's time launch the spider.

In the next section, the spider begins by grabbing a URL from the array and then using some routines from the Perl libraries, calling the URL, and seeing if we get a response.

If so, then the spider strips the links off the first page and stores.  It then releases the warhead on the first page and does what it's supposed to do (looking for certain data, etc.).

When Level One is complete, then Level Two begins its job.  Level Two uses an array that was filled with the links from stripping links off the Level One page.

I am showing you Level Two very scaled down.  Truth is, it can be set up to run a second level warhead and strip links off the second level URLs and create a third level warhead.  I did go to three levels and it worked very well.  All of my levels used the same warhead which made it easy to watch for problems.

So, to show you the spider, I just plucked out the warhead scripts and eliminated variables and scaled down Level Two.  The two files I will give you are the spider framework and the searchdata.txt.

I had 3000 URLs in my searchdata.txt.  I have never seen the end of the file.  Stripping links off pages and running the warhead scripts on the links in two to three levels can take a very, very long run.  The searchdata.txt file can have any URL in it but it has to be in a certain format for the HTTP::Request.  It needs to begin with: http://

I will leave you with a few examples in Figure 1.

Most of the spider's time will be spent in searching the stripped links.  This is because there is only one home page and it can have 1-500 links to other pages.  If you have a third layer, it could be hours before it comes back to the ammo dump and grabs another home URL.

As always, I want to stress that this is just a teaching article which is why I took out any of the scripts that might be used for malicious purposes.

I also wish to apologize to the two researchers who gave me the framework so I could give them their rightly dues for their article.  Truth is, I looked for hours for that PDF file that taught me about this spider.  I have been looking for it for years and finally I just gave up, hoping I would run across the article by mistake one day.  If I am contacted by them, I will surely let them know.  Again, they gave me no code.  The code is mine.

Another thing to say is to always use good spider/crawler practices and abide by the site's robot.txt laws.  Saying that, I got me a good lesson in RegX and Perl.

Figure 1 will show you the setup of the URL feeding file for Level One.

One thing to remember is to leave a space at the top of the URL list.  I don't know why; it just works that way.  If you find it different, then by all means make it run your way.

Make sure you have the spider.pl and the searchdata.txt file in the same directory or you'll get one of my colorful error texts.  Any URL you want can be listed.  If the spider fails in the middle of a run, look at the URL.  It probably has something wrong in the URL that the spider doesn't like.  Don't blame the spider right off.  And again, it will start at the top of the searchdata.txt list if it is stopped for any problems.

Figure 2 shows the start of the spider run, showing Level One URLs and Level Two URLs and what the beginning of a run will look like.  I also want to say that this program was written in Windows Perl (ActivePerl).  Don't throw rocks at me yet.  I just didn't know Linux at that time.  I am porting it now and it should be a breeze because ActivePerl emulates Linux Perl effectively.  The code is also commented very well.

Good luck.

Figure 1

Example searchdata.txt:


http://www.example.com
http://url.example.com
http://13.url.example.edu

This is how the searchdata.txt should be set up with one space at the top and one line in between.  This must be in a separate file with spider.pl and searchdata.txt also in the same directories.

Figure 2

******** Loading URL's *******
Seed URL's = 1
 Begin Spider run .....
-- Home Page -- Level I -- http://example.com
https://www.iana.org/domains/example
Level 2 STRIPPED URL
-- Home Page -- Level I -- http://example.edu
http://example.edu
Level 2 STRIPPED URL
https://www.iana.org/domains/example
Level 2 STRIPPED URL

search.pl:

#TDM 2005
#
#
my $x=0; #used on the FORM FILL Area on $sizeofharvestedURls index
my $y=0; #used to index thru FORMS on page
my $q=0;
my $z=0; #Level I index
my $a=0;
my $b=0;
my $c=0;
my $d=0;
my $e=0; #Level II index
my $p = HTML::LinkExtor->new(\&callback);
my $input = 0; #Used to input data from files
my @harvestedULs = ();
my $sizofharvestedURLs = 0;
my $sizeofinput = 0;
my $url = "";  #Level I
my $url2 = ""; #Level II
my @links = ();#stripped links array
my $sizeoflinks = 0;
my $counter = 0;
$file = "searchdata.txt"; #DOT.COMS from searchdata.txt file

#----------------------------- Set up Agent ------------------------------------
require LWP::UserAgent;
use HTML::LinkExtor;
use URI::URL;

$ua = new LWP::UserAgent;
      $ua->timeout(5); #not sure of this number. Ex. code had 5, I put in 5
      $ua->agent('Mozilla/4.75'); 
#     $ua->proxy(http => 'http://127.0.0.1:8118'); # TOR TOR TOR
      $ua->from('www.xxxxx.com');
      
#----------------Load URL's array with links --------------------

print "\n\n******** Loading URL's *******\n\n";
if (open(A, "$file") == undef){
    return( print "\n\n\nSHIT !!! Cannot open the file :( \n\n\n");
    exit(-1);
} #endif()
while(<A>){
      $input=<A>;
      push(@harvestedURLs, $input);
}#endwhile()
close(A);
$sizeofharvestedURLs = $#harvestedURLs;
print "Seed URL's = $sizeofharvestedURLs\n\n";
sleep(2); #used to let array to settle in

########################### Begin Spider ###############################

print "\n\n Begin Spider run .....\n\n";
while($x <= $sizeofharvestedURLs){#aa  #Loop for harvestedURLs
      $url = $harvestedURLs[$x];    #uses $x for indexing
      print "-- Home Page -- Level I -- $url\n\n";
      sleep(1); # used to sow down for TOR.
      #$counter++;
      #print "$counter\n";
      $req = new HTTP::Request GET => $harvestedURLs[$x];
      $response = $ua->request($req);
      my $base = $response->base;
      
      if($response->is_success) {#bb
          sleep(2); # Used to slow down for TOR
	 $p->parse($response->content);      
         

#                  **  LINK STRIPPING **

	 @links = map { $_ = url($_, $base)->abs; } @links;
         #print "@links"; # test point for link stripping
	 $sizeoflinks = $#links;
	 
#                  ** End LINK STRIPPING  **	

          # Here is where you set up for a run on home page #
	 
  }#bb# 
  
#****************** LVL 2 -  BEGIN *********************************************
  
  while($c <= $sizeoflinks ){#xxx
       $url2 = $links[$c++];      
       print "$url2\n";
       print "Level 2 STRIPPED URL\n\n";
       sleep(10); #used to slow down for viewing the spider operation
       
                 # Enter into level 3 #
#                      ***
                  # Exiting Level 3 #
              
       # Here is where you set up for a run on Level 2 #  
       
  }#xxx Exit Level 2
#******************** LVL 2 -  END *********************************************

  
 $c = 0; #reset level 2 $links variable
 $x++; # Used on $harvestedURLs[$x]
 @links = ""; # makes sure that @array is empty
}#aa  Exit Level 1
######################### END Spider ###########################################

#----------------------Link Stripping Sub-Routine-------------------------------
  sub callback { #999
     my($tag, %attr) = @_;
     return if $tag ne 'a';  # Tag to strip <a>, <img>, ....etc
     push(@links, values %attr);
     
} #999 End sub callback
#-------------------------------------------------------------------------------
# TDM 2005
# Updated Feb. 01, 2008 --  Triad
# Update Apr.29.2010 - Triad
# Updated June 19 2010 - Triad
################################################################################

Code: spider.pl

Return to $2600 Index