Package 

Class AbstractFetchSchedule

  • All Implemented Interfaces:
    ai.platon.pulsar.common.config.Parameterized , ai.platon.pulsar.crawl.schedule.FetchSchedule

    
    public abstract class AbstractFetchSchedule
     implements FetchSchedule
                        

    This class provides common methods for implementations of FetchSchedule.

    • Constructor Detail

      • AbstractFetchSchedule

        AbstractFetchSchedule(ImmutableConfig conf, MiscMessageWriter messageWriter)
    • Method Detail

      • getConf

         final ImmutableConfig getConf()
      • initializeSchedule

         Unit initializeSchedule(WebPage page)

        Initialize fetch schedule related data. Implementations should at least set the fetchTime and fetchInterval. The default implementation sets the fetchTime to now, using the default fetchInterval.

      • setFetchSchedule

         Unit setFetchSchedule(WebPage page, ModifyInfo m)

        Sets the fetchInterval and fetchTime on a successfully fetched page. NOTE: this implementation resets the retry counter - extending classes should call super.setFetchSchedule() to preserve this behavior.

        Parameters:
        page - The Web page
        m - The modification info
      • setPageRetrySchedule

         Unit setPageRetrySchedule(WebPage page, Instant prevFetchTime, Instant prevModifiedTime, Instant fetchTime)

        This method adjusts the fetch schedule if fetching needs to be re-tried due to transient errors. The default implementation sets the next fetch time 1 day in the future and increases the retry counter.

        Parameters:
        page - WebPage to retry
        prevFetchTime - previous fetch time
        prevModifiedTime - previous modified time
        fetchTime - current fetch time
      • setPageGoneSchedule

         Unit setPageGoneSchedule(WebPage page, Instant prevFetchTime, Instant prevModifiedTime, Instant fetchTime)

        This method specifies how to schedule refetching of pages marked as GONE. Default implementation increases fetchInterval by 50% but the value may never exceed maxInterval.

      • shouldFetch

         Boolean shouldFetch(WebPage page, Instant now)

        This method provides information whether the page is suitable for selection in the current fetchlist. NOTE: a true return value does not guarantee that the page will be fetched, it just allows it to be included in the further selection process based on scores. The default implementation checks fetchTime, if it is higher than the

        Parameters:
        page - Web page to fetch
        now - reference time (usually set to the time when the fetchlist generation process was started).
      • forceRefetch

         Unit forceRefetch(WebPage page, Instant prevFetchTime, Boolean asap)

        This method resets fetchTime, fetchInterval, modifiedTime, retriesSinceFetch and page text, so that it forces refetching.

        Parameters:
        asap - if true, force refetch as soon as possible - this sets the fetchTime to now.