Package 

Interface FetchSchedule

  • All Implemented Interfaces:
    ai.platon.pulsar.common.config.Parameterized

    
    public interface FetchSchedule
     implements Parameterized
                        

    This interface defines the contract for implementations that manipulate fetch times and re-fetch intervals.

    • Method Summary

      Modifier and Type Method Description
      abstract Unit initializeSchedule(WebPage page) Initialize fetch schedule related data.
      abstract Unit setFetchSchedule(WebPage page, ModifyInfo m) Sets the fetchInterval and fetchTime on a successfully fetched page.
      abstract Unit setPageGoneSchedule(WebPage page, Instant prevFetchTime, Instant prevModifiedTime, Instant fetchTime) This method specifies how to schedule refetching of pages marked as GONE.
      abstract Unit setPageRetrySchedule(WebPage page, Instant prevFetchTime, Instant prevModifiedTime, Instant fetchTime) This method adjusts the fetch schedule if fetching needs to be re-tried due to transient errors.
      abstract Instant estimatePrevFetchTime(WebPage page) Calculates last fetch time of the given CrawlDatum.
      abstract Boolean shouldFetch(WebPage page, Instant now) This method provides information whether the page is suitable for selection in the current fetchlist.
      abstract Unit forceRefetch(WebPage page, Instant prevFetchTime, Boolean asap) This method resets fetchTime, fetchInterval, modifiedTime and page text, so that it forces refetching.
      abstract Duration getMaxFetchInterval()
      • Methods inherited from class ai.platon.pulsar.crawl.schedule.FetchSchedule

        getParams
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Method Detail

      • initializeSchedule

         abstract Unit initializeSchedule(WebPage page)

        Initialize fetch schedule related data. Implementations should at least set the fetchTime and fetchInterval. The default implementation set the fetchTime to now, using the default fetchInterval.

      • setFetchSchedule

         abstract Unit setFetchSchedule(WebPage page, ModifyInfo m)

        Sets the fetchInterval and fetchTime on a successfully fetched page. Implementations may use supplied arguments to support different re-fetching schedules.

        Parameters:
        page - The Web page
      • setPageGoneSchedule

         abstract Unit setPageGoneSchedule(WebPage page, Instant prevFetchTime, Instant prevModifiedTime, Instant fetchTime)

        This method specifies how to schedule refetching of pages marked as GONE. Default implementation increases fetchInterval by 50%, and if it exceeds the maxInterval it calls .forceRefetch.

        Parameters:
        page - The page
      • setPageRetrySchedule

         abstract Unit setPageRetrySchedule(WebPage page, Instant prevFetchTime, Instant prevModifiedTime, Instant fetchTime)

        This method adjusts the fetch schedule if fetching needs to be re-tried due to transient errors. The default implementation sets the next fetch time 1 day in the future and increases the retry counter.Set

        Parameters:
        page - The page
        prevModifiedTime - previous modified time
        fetchTime - current fetch time
      • shouldFetch

         abstract Boolean shouldFetch(WebPage page, Instant now)

        This method provides information whether the page is suitable for selection in the current fetchlist. NOTE: a true return value does not guarantee that the page will be fetched, it just allows it to be included in the further selection process based on scores. The default implementation checks fetchTime, if it is higher than the

        Parameters:
        page - The Web page
      • forceRefetch

         abstract Unit forceRefetch(WebPage page, Instant prevFetchTime, Boolean asap)

        This method resets fetchTime, fetchInterval, modifiedTime and page text, so that it forces refetching.

        Parameters:
        page - The Web page
        asap - if true, force refetch as soon as possible - this sets the fetchTime to now.