Moderator
From: Deland, FL
Registered: 2005-10-25
Posts: 1314
I've been thanked 23 times.
Offline
Ok, so...I've been learning Perl lately and now I've got a project to do for work.
I have to create a spider, pretty much. What I need is a program that looks at all the links on a page, follows internal links and checks the links on those pages also, and reports the HTTP Response code (404, 500, etc.), logged to a file.
It also has to follow 301 and 302 redirects and report the HTTP Response code of the page it is redirected to.
I've been told I need to use the Perl module LWP, but I can't figure it out 
Any help would be GREAT!
Moderator
From: Deland, FL
Registered: 2005-10-25
Posts: 1314
I've been thanked 23 times.
Offline
Here's what I've got so far, I guess its a start...
Code: Perl
#!/usr/bin/perl
use LWP::UserAgent;
$ua = LWP::UserAgent->new;
$ua->agent("MySpider/0.1 ");
my @urls = (
'http://www.google.com',
'http://www.yahoo.com',
);
foreach(@urls) {
&getResponseCode($_);
}
#----- SUBROUTINES -----#
sub getResponseCode {
# Create a request
my $req = HTTP::Request->new(HEAD => @_);
# Pass request to the user agent and get a response back
my $res = $ua->request($req);
# Check the outcome of the response
if ($res->is_success) {
print $res->status_line, "\n";
}
else {
print $res->status_line, "\n";
}
}
Now, that works, gets the HTTP Response code from each URL in the @urls hash. My problem is gathering all the links and following them, etc. Like, I should be able to give a base url to this program (probably through $ARGV[0], and have the spider go from there....
Argh...confusing....
Administrator
From: you know you want a caricature
Registered: 2004-11-08
Posts: 3454
I've been thanked 34 times.
Offline
toooo f%$#*Ng hard for me!!
Moderator
From: Deland, FL
Registered: 2005-10-25
Posts: 1314
I've been thanked 23 times.
Offline
matte wrote:
toooo f%$#*Ng hard for me!!
I hear ya....
Not the easiest thing in the world...actually not sure if anyone here knows Perl. But, figured I'd give it a shot 
Moderator
From: Yorkshire, UK
Registered: 2006-08-19
Posts: 2818
I've been thanked 81 times.
Offline
look on source forge for the snoopy project
It's written in php, but is has a method for extracting links - rip of the regex from here
the spidering logic is pretty basic
create an array with the start page as the first element
loop over the array, going to the next page, getting a list of links back. for each link you get back, check wether it's in your array or not, adding new/unique ones to your array
increment the array pointer and rinse and repeat.
I've got a php version somewhere if you'd like to see the logic for eeverything
I know php's loosly typed approach and dynamic array lengths make it very easy to do it this way - not all languages work like that
| Never |


