Package 

Class HttpRobotRulesParser

  • All Implemented Interfaces:
    ai.platon.pulsar.common.config.Configurable

    
    public class HttpRobotRulesParser
    extends RobotRulesParser
                        

    This class is used for parsing robots for urls belonging to HTTP protocol. It extends the generic RobotRulesParser class and contains Http protocol specific implementation for obtaining the robots file.

    • Method Summary

      Modifier and Type Method Description
      BaseRobotRules getRobotRulesSet(Protocol protocol, URL url) Get the rules from robots.
      • Methods inherited from class ai.platon.pulsar.crawl.protocol.http.HttpRobotRulesParser

        getConf, getRobotRulesSet, parseRules, setConf
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Constructor Detail

      • HttpRobotRulesParser

        HttpRobotRulesParser(ImmutableConfig conf)
    • Method Detail

      • getRobotRulesSet

         BaseRobotRules getRobotRulesSet(Protocol protocol, URL url)

        Get the rules from robots.txt which applies for the given url. Robot rules are cached for a unique combination of host, protocol, and port. If no rules are found in the cache, a HTTP request is send to fetch {{protocol://host:port/robots.txt}}. The robots.txt is then parsed and the rules are cached to avoid re-fetching and re-parsing it again.

        Parameters:
        protocol - The Protocol object
        url - URL robots.