-
- All Implemented Interfaces:
-
ai.platon.pulsar.common.config.Parameterized,ai.platon.pulsar.crawl.schedule.FetchSchedule
public class AdaptiveFetchSchedule extends AbstractFetchSchedule
This class implements an adaptive re-fetch algorithm. This works as follows:
for pages that has changed since the last fetchTime, decrease their fetchInterval by a factor of DEC_FACTOR (default value is 0.2f).
for pages that haven't changed since the last fetchTime, increase their fetchInterval by a factor of INC_FACTOR (default value is 0.2f).<br></br> If SYNC_DELTA property is true, then:
calculate a
delta = fetchTime - modifiedTimetry to synchronize with the time of change, by shifting the next fetchTime by a fraction of the difference between the last modification time and the last fetch time. I.e. the next fetch time will be set to
fetchTime + fetchInterval - delta * SYNC_DELTA_RATEif the adjusted fetch interval is bigger than the delta, then
fetchInterval = delta.
the minimum value of fetchInterval may not be smaller than MIN_INTERVAL (default is 1 minute).
the maximum value of fetchInterval may not be bigger than MAX_INTERVAL (default is 365 days).
NOTE: values of DEC_FACTOR and INC_FACTOR higher than 0.4f may destabilize the algorithm, so that the fetch interval either increases or decreases infinitely, with little relevance to the page changes. Please use . method to test the values before applying them in a production system.
-
-
Field Summary
Fields Modifier and Type Field Description private final DurationmaxFetchIntervalprivate final ImmutableConfigconfprivate final MiscMessageWritermessageWriter
-
Constructor Summary
Constructors Constructor Description AdaptiveFetchSchedule(ImmutableConfig conf, MiscMessageWriter messageWriter)
-
Method Summary
Modifier and Type Method Description DurationgetMaxFetchInterval()final ImmutableConfiggetConf()final MiscMessageWritergetMessageWriter()ParamsgetParams()UnitsetFetchSchedule(WebPage page, ModifyInfo m)Sets the fetchIntervalandfetchTimeon a successfully fetched page.UnitsetPageGoneSchedule(WebPage page, Instant prevFetchTime, Instant prevModifiedTime, Instant fetchTime)This method specifies how to schedule refetching of pages marked as GONE. -
Methods inherited from class ai.platon.pulsar.crawl.schedule.AdaptiveFetchSchedule
estimatePrevFetchTime, forceRefetch, initializeSchedule, setPageRetrySchedule, shouldFetch -
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
-
-
Constructor Detail
-
AdaptiveFetchSchedule
AdaptiveFetchSchedule(ImmutableConfig conf, MiscMessageWriter messageWriter)
-
-
Method Detail
-
getMaxFetchInterval
Duration getMaxFetchInterval()
-
getConf
final ImmutableConfig getConf()
-
getMessageWriter
final MiscMessageWriter getMessageWriter()
-
getParams
Params getParams()
-
setFetchSchedule
Unit setFetchSchedule(WebPage page, ModifyInfo m)
Sets the
fetchIntervalandfetchTimeon a successfully fetched page. NOTE: this implementation resets the retry counter - extending classes should call super.setFetchSchedule() to preserve this behavior.- Parameters:
page- The Web pagem- The modification info
-
setPageGoneSchedule
Unit setPageGoneSchedule(WebPage page, Instant prevFetchTime, Instant prevModifiedTime, Instant fetchTime)
This method specifies how to schedule refetching of pages marked as GONE. Default implementation increases fetchInterval by 50% but the value may never exceed
maxInterval.
-
-
-
-