-
- All Implemented Interfaces:
-
ai.platon.pulsar.common.config.Parameterized
public interface FetchSchedule implements ParameterizedThis interface defines the contract for implementations that manipulate fetch times and re-fetch intervals.
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description public classFetchSchedule.Companion
-
Method Summary
Modifier and Type Method Description abstract UnitinitializeSchedule(WebPage page)Initialize fetch schedule related data. abstract UnitsetFetchSchedule(WebPage page, ModifyInfo m)Sets the fetchIntervalandfetchTimeon a successfully fetched page.abstract UnitsetPageGoneSchedule(WebPage page, Instant prevFetchTime, Instant prevModifiedTime, Instant fetchTime)This method specifies how to schedule refetching of pages marked as GONE. abstract UnitsetPageRetrySchedule(WebPage page, Instant prevFetchTime, Instant prevModifiedTime, Instant fetchTime)This method adjusts the fetch schedule if fetching needs to be re-tried due to transient errors. abstract InstantestimatePrevFetchTime(WebPage page)Calculates last fetch time of the given CrawlDatum. abstract BooleanshouldFetch(WebPage page, Instant now)This method provides information whether the page is suitable for selection in the current fetchlist. abstract UnitforceRefetch(WebPage page, Instant prevFetchTime, Boolean asap)This method resets fetchTime, fetchInterval, modifiedTime and page text, so that it forces refetching. abstract DurationgetMaxFetchInterval()-
-
Method Detail
-
initializeSchedule
abstract Unit initializeSchedule(WebPage page)
Initialize fetch schedule related data. Implementations should at least set the
fetchTimeandfetchInterval. The default implementation set thefetchTimeto now, using the defaultfetchInterval.
-
setFetchSchedule
abstract Unit setFetchSchedule(WebPage page, ModifyInfo m)
Sets the
fetchIntervalandfetchTimeon a successfully fetched page. Implementations may use supplied arguments to support different re-fetching schedules.- Parameters:
page- The Web page
-
setPageGoneSchedule
abstract Unit setPageGoneSchedule(WebPage page, Instant prevFetchTime, Instant prevModifiedTime, Instant fetchTime)
This method specifies how to schedule refetching of pages marked as GONE. Default implementation increases fetchInterval by 50%, and if it exceeds the
maxIntervalit calls .forceRefetch.- Parameters:
page- The page
-
setPageRetrySchedule
abstract Unit setPageRetrySchedule(WebPage page, Instant prevFetchTime, Instant prevModifiedTime, Instant fetchTime)
This method adjusts the fetch schedule if fetching needs to be re-tried due to transient errors. The default implementation sets the next fetch time 1 day in the future and increases the retry counter.Set
- Parameters:
page- The pageprevModifiedTime- previous modified timefetchTime- current fetch time
-
estimatePrevFetchTime
abstract Instant estimatePrevFetchTime(WebPage page)
Calculates last fetch time of the given CrawlDatum.
-
shouldFetch
abstract Boolean shouldFetch(WebPage page, Instant now)
This method provides information whether the page is suitable for selection in the current fetchlist. NOTE: a true return value does not guarantee that the page will be fetched, it just allows it to be included in the further selection process based on scores. The default implementation checks
fetchTime, if it is higher than the- Parameters:
page- The Web page
-
forceRefetch
abstract Unit forceRefetch(WebPage page, Instant prevFetchTime, Boolean asap)
This method resets fetchTime, fetchInterval, modifiedTime and page text, so that it forces refetching.
- Parameters:
page- The Web pageasap- if true, force refetch as soon as possible - this sets the fetchTime to now.
-
getMaxFetchInterval
abstract Duration getMaxFetchInterval()
-
-
-
-