-
- All Implemented Interfaces:
-
ai.platon.pulsar.common.collect.CrawlableFatLinkCollector,ai.platon.pulsar.common.collect.collector.DataCollector,ai.platon.pulsar.common.collect.collector.PriorityDataCollector,kotlin.Comparable
public class HyperlinkCollector extends AbstractPriorityDataCollector<UrlAware> implements CrawlableFatLinkCollectorCollect hyper links from the given seeds. The urls are restricted by loadArguments and urlNormalizer.
all urls are restricted by css outLinkSelector
all urls are restricted by urlPattern
all urls have to not be fetched before or expired against the last version
-
-
Field Summary
Fields Modifier and Type Field Description private UrlNormalizerPipelineurlNormalizerprivate Stringnameprivate final Integersizeprivate final IntegerestimatedSizeprivate final ConcurrentSkipListMap<String, CrawlableFatLink>fatLinksprivate final PulsarSessionsessionprivate final Queue<NormUrl>seedsprivate final Integercapacityprivate IntegercollectCountprivate final DurationcollectTimeprivate IntegercollectedCountprivate Stringcountryprivate final InstantcreateTimeprivate InstantdeadTimeprivate Stringdistrictprivate final IntegerestimatedExternalSizeprivate final IntegerexternalSizeprivate InstantfirstCollectTimeprivate final Integeridprivate final BooleanisDeadprivate final Set<String>labelsprivate Stringlangprivate InstantlastCollectedTimeprivate final Integerpriority
-
Constructor Summary
Constructors Constructor Description HyperlinkCollector(PulsarSession session, Queue<NormUrl> seeds, Priority13 priority)
-
Method Summary
-
Methods inherited from class ai.platon.pulsar.common.collect.collector.AbstractPriorityDataCollector
collectTo, collectTo, collectTo, compareTo -
Methods inherited from class ai.platon.pulsar.common.collect.collector.AbstractDataCollector
deepClear -
Methods inherited from class ai.platon.pulsar.common.collect.HyperlinkCollector
removeAll, removeAll, toString -
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
-
-
Constructor Detail
-
HyperlinkCollector
HyperlinkCollector(PulsarSession session, Queue<NormUrl> seeds, Priority13 priority)
-
-
Method Detail
-
getUrlNormalizer
final UrlNormalizerPipeline getUrlNormalizer()
-
setUrlNormalizer
final Unit setUrlNormalizer(UrlNormalizerPipeline urlNormalizer)
-
getEstimatedSize
Integer getEstimatedSize()
-
getFatLinks
ConcurrentSkipListMap<String, CrawlableFatLink> getFatLinks()
Track the status of this batch, we need a notice when the batch is finished
-
getSession
final PulsarSession getSession()
The pulsar session to use
-
getSeeds
final Queue<NormUrl> getSeeds()
The urls of portal pages from where hyper links are extracted from
-
getCapacity
Integer getCapacity()
-
getCollectCount
Integer getCollectCount()
-
setCollectCount
Unit setCollectCount(Integer collectCount)
-
getCollectTime
Duration getCollectTime()
-
getCollectedCount
Integer getCollectedCount()
-
setCollectedCount
Unit setCollectedCount(Integer collectedCount)
-
getCountry
String getCountry()
-
setCountry
Unit setCountry(String country)
-
getCreateTime
Instant getCreateTime()
-
getDeadTime
Instant getDeadTime()
-
setDeadTime
Unit setDeadTime(Instant deadTime)
-
getDistrict
String getDistrict()
-
setDistrict
Unit setDistrict(String district)
-
getEstimatedExternalSize
Integer getEstimatedExternalSize()
-
getExternalSize
Integer getExternalSize()
-
getFirstCollectTime
Instant getFirstCollectTime()
-
setFirstCollectTime
Unit setFirstCollectTime(Instant firstCollectTime)
-
getLastCollectedTime
Instant getLastCollectedTime()
-
setLastCollectedTime
Unit setLastCollectedTime(Instant lastCollectedTime)
-
getPriority
Integer getPriority()
-
remove
CrawlableFatLink remove(FatLink fatLink)
-
-
-
-