Package 

Class AdaptiveFetchSchedule

  • All Implemented Interfaces:
    ai.platon.pulsar.common.config.Parameterized , ai.platon.pulsar.crawl.schedule.FetchSchedule

    
    public class AdaptiveFetchSchedule
    extends AbstractFetchSchedule
                        

    This class implements an adaptive re-fetch algorithm. This works as follows:

    • for pages that has changed since the last fetchTime, decrease their fetchInterval by a factor of DEC_FACTOR (default value is 0.2f).

    • for pages that haven't changed since the last fetchTime, increase their fetchInterval by a factor of INC_FACTOR (default value is 0.2f).<br></br> If SYNC_DELTA property is true, then:

    • calculate a delta = fetchTime - modifiedTime

    • try to synchronize with the time of change, by shifting the next fetchTime by a fraction of the difference between the last modification time and the last fetch time. I.e. the next fetch time will be set to fetchTime + fetchInterval - delta * SYNC_DELTA_RATE

    • if the adjusted fetch interval is bigger than the delta, then fetchInterval = delta.

    • the minimum value of fetchInterval may not be smaller than MIN_INTERVAL (default is 1 minute).

    • the maximum value of fetchInterval may not be bigger than MAX_INTERVAL (default is 365 days).

    NOTE: values of DEC_FACTOR and INC_FACTOR higher than 0.4f may destabilize the algorithm, so that the fetch interval either increases or decreases infinitely, with little relevance to the page changes. Please use . method to test the values before applying them in a production system.

    • Constructor Detail

      • AdaptiveFetchSchedule

        AdaptiveFetchSchedule(ImmutableConfig conf, MiscMessageWriter messageWriter)
    • Method Detail

      • getConf

         final ImmutableConfig getConf()
      • setFetchSchedule

         Unit setFetchSchedule(WebPage page, ModifyInfo m)

        Sets the fetchInterval and fetchTime on a successfully fetched page. NOTE: this implementation resets the retry counter - extending classes should call super.setFetchSchedule() to preserve this behavior.

        Parameters:
        page - The Web page
        m - The modification info
      • setPageGoneSchedule

         Unit setPageGoneSchedule(WebPage page, Instant prevFetchTime, Instant prevModifiedTime, Instant fetchTime)

        This method specifies how to schedule refetching of pages marked as GONE. Default implementation increases fetchInterval by 50% but the value may never exceed maxInterval.