Sign in to PrePAN

CrawlerCommons::RobotRulesParser Perl implementation of Google crawler-commons RobotRulesParser



use 5.10.1;
use CrawlerCommons::RobotRulesParser;

my $rules_parser = CrawlerCommons::RobotRulesParser->new;

my $content = "User-agent: *\r\nDisallow: *images";
my $content_type = "text/plain";
my $robot_names = "any-old-robot";
my $url = "";

my $robot_rules =
   $rules_parser->parse_content($url, $content, $content_type, $robot_names);

say "We're allowed to crawl the index :)"
    if $robot_rules->is_allowed( "");

say "Not allowed to crawl: $_" unless $robot_rules->is_allowed( $_ )
    for ("",


This module is a fairly close reproduction of the Crawler-Commons SimpleRobotRulesParser

From BaseRobotsParser javadoc:

Parse the robots.txt file in content, and return rules appropriate
for processing paths by userAgent. Note that multiple agent names
may be provided as comma-separated values; the order of these shouldn't
matter, as the file is parsed in order, and each agent name found in the
file will be compared to every agent name found in robotNames.
Also note that names are lower-cased before comparison, and that any
robot name you pass shouldn't contain commas or spaces; if the name has
spaces, it will be split into multiple names, each of which will be
compared against agent names in the robots.txt file. An agent name is
considered a match if it's a prefix match on the provided robot name. For
example, if you pass in "Mozilla Crawlerbot-super 1.0", this would match
"crawlerbot" as the agent name, because of splitting on spaces,
lower-casing, and the prefix match rule.

The method failedFetch is not implemented.


Could you explain a little more about the rationale? Perhaps say what it is about this module which would make people choose it over WWW::RobotRules. At present I don't really see the big distiguishing factor or clear use case. Thanks.

Please sign up to post a review.