-
- All Implemented Interfaces:
-
ai.platon.pulsar.common.config.Parameterized,ai.platon.pulsar.crawl.schedule.FetchSchedule
public abstract class AbstractFetchSchedule implements FetchSchedule
This class provides common methods for implementations of FetchSchedule.
-
-
Field Summary
Fields Modifier and Type Field Description private final DurationmaxFetchIntervalprivate final ImmutableConfigconfprivate final MiscMessageWritermessageWriter
-
Constructor Summary
Constructors Constructor Description AbstractFetchSchedule(ImmutableConfig conf, MiscMessageWriter messageWriter)
-
Method Summary
Modifier and Type Method Description DurationgetMaxFetchInterval()final ImmutableConfiggetConf()final MiscMessageWritergetMessageWriter()ParamsgetParams()UnitinitializeSchedule(WebPage page)Initialize fetch schedule related data. UnitsetFetchSchedule(WebPage page, ModifyInfo m)Sets the fetchIntervalandfetchTimeon a successfully fetched page.UnitsetPageRetrySchedule(WebPage page, Instant prevFetchTime, Instant prevModifiedTime, Instant fetchTime)This method adjusts the fetch schedule if fetching needs to be re-tried due to transient errors. UnitsetPageGoneSchedule(WebPage page, Instant prevFetchTime, Instant prevModifiedTime, Instant fetchTime)This method specifies how to schedule refetching of pages marked as GONE. InstantestimatePrevFetchTime(WebPage page)This method return the last fetch time of the WebPage BooleanshouldFetch(WebPage page, Instant now)This method provides information whether the page is suitable for selection in the current fetchlist. UnitforceRefetch(WebPage page, Instant prevFetchTime, Boolean asap)This method resets fetchTime, fetchInterval, modifiedTime, retriesSinceFetch and page text, so that it forces refetching. -
-
Constructor Detail
-
AbstractFetchSchedule
AbstractFetchSchedule(ImmutableConfig conf, MiscMessageWriter messageWriter)
-
-
Method Detail
-
getMaxFetchInterval
Duration getMaxFetchInterval()
-
getConf
final ImmutableConfig getConf()
-
getMessageWriter
final MiscMessageWriter getMessageWriter()
-
getParams
Params getParams()
-
initializeSchedule
Unit initializeSchedule(WebPage page)
Initialize fetch schedule related data. Implementations should at least set the
fetchTimeandfetchInterval. The default implementation sets thefetchTimeto now, using the defaultfetchInterval.
-
setFetchSchedule
Unit setFetchSchedule(WebPage page, ModifyInfo m)
Sets the
fetchIntervalandfetchTimeon a successfully fetched page. NOTE: this implementation resets the retry counter - extending classes should call super.setFetchSchedule() to preserve this behavior.- Parameters:
page- The Web pagem- The modification info
-
setPageRetrySchedule
Unit setPageRetrySchedule(WebPage page, Instant prevFetchTime, Instant prevModifiedTime, Instant fetchTime)
This method adjusts the fetch schedule if fetching needs to be re-tried due to transient errors. The default implementation sets the next fetch time 1 day in the future and increases the retry counter.
- Parameters:
page- WebPage to retryprevFetchTime- previous fetch timeprevModifiedTime- previous modified timefetchTime- current fetch time
-
setPageGoneSchedule
Unit setPageGoneSchedule(WebPage page, Instant prevFetchTime, Instant prevModifiedTime, Instant fetchTime)
This method specifies how to schedule refetching of pages marked as GONE. Default implementation increases fetchInterval by 50% but the value may never exceed
maxInterval.
-
estimatePrevFetchTime
Instant estimatePrevFetchTime(WebPage page)
This method return the last fetch time of the WebPage
-
shouldFetch
Boolean shouldFetch(WebPage page, Instant now)
This method provides information whether the page is suitable for selection in the current fetchlist. NOTE: a true return value does not guarantee that the page will be fetched, it just allows it to be included in the further selection process based on scores. The default implementation checks
fetchTime, if it is higher than the- Parameters:
page- Web page to fetchnow- reference time (usually set to the time when the fetchlist generation process was started).
-
forceRefetch
Unit forceRefetch(WebPage page, Instant prevFetchTime, Boolean asap)
This method resets fetchTime, fetchInterval, modifiedTime, retriesSinceFetch and page text, so that it forces refetching.
- Parameters:
asap- if true, force refetch as soon as possible - this sets the fetchTime to now.
-
-
-
-