-
- All Implemented Interfaces:
-
ai.platon.pulsar.context.PulsarContext,java.lang.AutoCloseable
public abstract class AbstractPulsarContext implements PulsarContext, AutoCloseable
The main entry point for pulsar functionality.
A PulsarContext can be used to inject, fetch, load, parse, store webpages.
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description public classAbstractPulsarContext.Companion
-
Field Summary
Fields Modifier and Type Field Description private final Integeridprivate final ImmutableConfigunmodifiedConfigprivate final CrawlUrlNormalizersurlNormalizersprivate final WebDbwebDbprivate final GlobalCacheFactoryglobalCacheFactoryprivate final InjectComponentinjectComponentprivate final BatchFetchComponentfetchComponentprivate final ParseComponentparseComponentprivate final UpdateComponentupdateComponentprivate final LoadComponentloadComponentprivate final UrlPoolcrawlPoolprivate final CrawlLoopscrawlLoopsprivate final LongstartTimeprivate final BooleanisActiveprivate final ConcurrentSkipListMap<Integer, PulsarSession>sessionsprivate final AbstractApplicationContextapplicationContextprivate final PulsarEnvironmentpulsarEnvironment
-
Constructor Summary
Constructors Constructor Description AbstractPulsarContext(AbstractApplicationContext applicationContext, PulsarEnvironment pulsarEnvironment)
-
Method Summary
Modifier and Type Method Description IntegergetId()ImmutableConfiggetUnmodifiedConfig()CrawlUrlNormalizersgetUrlNormalizers()WebDbgetWebDb()GlobalCacheFactorygetGlobalCacheFactory()InjectComponentgetInjectComponent()BatchFetchComponentgetFetchComponent()ParseComponentgetParseComponent()UpdateComponentgetUpdateComponent()LoadComponentgetLoadComponent()UrlPoolgetCrawlPool()CrawlLoopsgetCrawlLoops()final LonggetStartTime()The start time final BooleangetIsActive()final ConcurrentSkipListMap<Integer, PulsarSession>getSessions()All open sessions AbstractApplicationContextgetApplicationContext()PulsarEnvironmentgetPulsarEnvironment()final <T extends Any> TgetBean(KClass<T> requiredType)final <T extends Any> TgetBean()final <T extends Any> TgetBeanOrNull(KClass<T> requiredType)final <T extends Any> TgetBeanOrNull()abstract AbstractPulsarSessioncreateSession()UnitcloseSession(PulsarSession session)UnitregisterClosable(AutoCloseable closable)Close objects when sessions closes final UnitclearCaches()NormUrlnormalize(String url, LoadOptions options, Boolean toItemOption)Normalize an url, the url can be one of the following: a normal url
a configured url
a base64 encoded url
a base64 encoded configured url
List<NormUrl>normalize(Iterable<String> urls, LoadOptions options, Boolean toItemOption)Normalize urls, remove invalid ones NormUrlnormalize(UrlAware url, LoadOptions options, Boolean toItemOption)Normalize an url. List<NormUrl>normalize(Collection<UrlAware> urls, LoadOptions options, Boolean toItemOption)Normalize urls, remove invalid ones NormUrlnormalizeOrNull(String url, LoadOptions options, Boolean toItemOption)NormUrlnormalizeOrNull(UrlAware url, LoadOptions options, Boolean toItemOption)WebPageinject(String url)Inject an url WebPageinject(NormUrl url)WebPageget(String url)Get a webpage from the storage WebPagegetOrNull(String url)Get a webpage from the storage Booleanexists(String url)Check if a page exists in the storage CheckStatefetchState(WebPage page, LoadOptions options)Check the fetch state of a page Iterator<WebPage>scan(String urlPrefix)Scan pages in the storage Iterator<WebPage>scan(String urlPrefix, Iterable<GWebPage.Field> fields)Scan pages in the storage Iterator<WebPage>scan(String urlPrefix, Array<String> fields)Scan pages in the storage WebPageload(String url, LoadOptions options)Load an page with specified options, see LoadOptions for all options WebPageload(URL url, LoadOptions options)Load a url with specified options, see LoadOptions for all options WebPageload(NormUrl url)Load a url, options can be specified following the url, see LoadOptions for all options WebPageloadDeferred(NormUrl url)List<WebPage>loadAll(Iterable<String> urls, LoadOptions options)Load a batch of urls with the specified options. List<WebPage>loadAll(Iterable<NormUrl> urls)CompletableFuture<WebPage>loadAsync(NormUrl url)List<CompletableFuture<WebPage>>loadAllAsync(Iterable<NormUrl> urls)AbstractPulsarContextsubmit(UrlAware url)AbstractPulsarContextsubmitAll(Iterable<UrlAware> urls)FeaturedDocumentparse(WebPage page)Parse the WebPage using parseComponent Unitpersist(WebPage page)Persist the page into the storage Unitdelete(String url)Delete the page from the storage Unitdelete(WebPage page)Delete the page from the storage Unitflush()Flush the storage Unitawait()Wait until there is no tasks in the main loop UnitregisterShutdownHook()Register a shutdown hook with the JVM runtime, closing this context on JVM shutdown unless it has already been closed at that time. Unitclose()Close this pulsar contextDelegates to doClose()for the actual closing procedure.-
-
Constructor Detail
-
AbstractPulsarContext
AbstractPulsarContext(AbstractApplicationContext applicationContext, PulsarEnvironment pulsarEnvironment)
-
-
Method Detail
-
getUnmodifiedConfig
ImmutableConfig getUnmodifiedConfig()
-
getUrlNormalizers
CrawlUrlNormalizers getUrlNormalizers()
-
getWebDb
WebDb getWebDb()
-
getGlobalCacheFactory
GlobalCacheFactory getGlobalCacheFactory()
-
getInjectComponent
InjectComponent getInjectComponent()
-
getFetchComponent
BatchFetchComponent getFetchComponent()
-
getParseComponent
ParseComponent getParseComponent()
-
getUpdateComponent
UpdateComponent getUpdateComponent()
-
getLoadComponent
LoadComponent getLoadComponent()
-
getCrawlPool
UrlPool getCrawlPool()
-
getCrawlLoops
CrawlLoops getCrawlLoops()
-
getStartTime
final Long getStartTime()
The start time
-
getIsActive
final Boolean getIsActive()
-
getSessions
final ConcurrentSkipListMap<Integer, PulsarSession> getSessions()
All open sessions
-
getApplicationContext
AbstractApplicationContext getApplicationContext()
-
getPulsarEnvironment
PulsarEnvironment getPulsarEnvironment()
-
getBeanOrNull
final <T extends Any> T getBeanOrNull(KClass<T> requiredType)
-
getBeanOrNull
final <T extends Any> T getBeanOrNull()
-
createSession
abstract AbstractPulsarSession createSession()
-
closeSession
Unit closeSession(PulsarSession session)
-
registerClosable
Unit registerClosable(AutoCloseable closable)
Close objects when sessions closes
-
clearCaches
final Unit clearCaches()
-
normalize
NormUrl normalize(String url, LoadOptions options, Boolean toItemOption)
Normalize an url, the url can be one of the following:
a normal url
a configured url
a base64 encoded url
a base64 encoded configured url
An url can be configured by appending arguments to the url, and it also can be used with a LoadOptions, If both tailing arguments and LoadOptions are present, the LoadOptions overrides the tailing arguments, but default values in LoadOptions are ignored.
-
normalize
List<NormUrl> normalize(Iterable<String> urls, LoadOptions options, Boolean toItemOption)
Normalize urls, remove invalid ones
- Parameters:
urls- The urls to normalizeoptions- The LoadOptions applied to each urltoItemOption- If the LoadOptions is converted to item load options
-
normalize
NormUrl normalize(UrlAware url, LoadOptions options, Boolean toItemOption)
Normalize an url.
If both url arguments and LoadOptions are present, the LoadOptions overrides the tailing arguments, but default values in LoadOptions are ignored.
-
normalize
List<NormUrl> normalize(Collection<UrlAware> urls, LoadOptions options, Boolean toItemOption)
Normalize urls, remove invalid ones
- Parameters:
urls- The urls to normalizeoptions- The LoadOptions applied to each urltoItemOption- If the LoadOptions is converted to item load options
-
normalizeOrNull
NormUrl normalizeOrNull(String url, LoadOptions options, Boolean toItemOption)
-
normalizeOrNull
NormUrl normalizeOrNull(UrlAware url, LoadOptions options, Boolean toItemOption)
-
fetchState
CheckState fetchState(WebPage page, LoadOptions options)
Check the fetch state of a page
-
scan
Iterator<WebPage> scan(String urlPrefix, Iterable<GWebPage.Field> fields)
Scan pages in the storage
-
load
WebPage load(String url, LoadOptions options)
Load an page with specified options, see LoadOptions for all options
- Parameters:
url- The url followed by optionsoptions- The options
-
load
WebPage load(URL url, LoadOptions options)
Load a url with specified options, see LoadOptions for all options
- Parameters:
url- The url followed by optionsoptions- The options
-
load
WebPage load(NormUrl url)
Load a url, options can be specified following the url, see LoadOptions for all options
- Parameters:
url- The url followed by options
-
loadDeferred
WebPage loadDeferred(NormUrl url)
-
loadAll
List<WebPage> loadAll(Iterable<String> urls, LoadOptions options)
Load a batch of urls with the specified options.
If the option indicates prefer parallel, urls are fetched in a parallel manner whenever applicable. If the batch is too large, only a random part of the urls is fetched immediately, all the rest urls are put into a pending fetch list and will be fetched in background later.
If a page exists neither in local storage nor at the given remote location, WebPage.NIL is returned
- Parameters:
urls- The urls to loadoptions- The options
-
loadAsync
CompletableFuture<WebPage> loadAsync(NormUrl url)
-
loadAllAsync
List<CompletableFuture<WebPage>> loadAllAsync(Iterable<NormUrl> urls)
-
submit
AbstractPulsarContext submit(UrlAware url)
-
submitAll
AbstractPulsarContext submitAll(Iterable<UrlAware> urls)
-
parse
FeaturedDocument parse(WebPage page)
Parse the WebPage using parseComponent
-
registerShutdownHook
Unit registerShutdownHook()
Register a shutdown hook with the JVM runtime, closing this context on JVM shutdown unless it has already been closed at that time.
Delegates to
doClose()for the actual closing procedure.
-
-
-