-
- All Implemented Interfaces:
-
ai.platon.pulsar.common.config.Parameterized,ai.platon.pulsar.crawl.common.JobInitialized,java.lang.AutoCloseable
public final class PageParser implements Parameterized, JobInitialized, AutoCloseable
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description public enumPageParser.Counterpublic classPageParser.Companion
-
Field Summary
Fields Modifier and Type Field Description private final ConcurrentSkipListSet<CharSequence>unparsableTypesprivate final LinkFilterlinkFilterprivate final ParserFactoryparserFactoryprivate final ImmutableConfigconfprivate final CrawlFilterscrawlFiltersprivate final Signaturesignatureprivate final MiscMessageWritermessageWriter
-
Constructor Summary
Constructors Constructor Description PageParser(ParserFactory parserFactory, ImmutableConfig conf)PageParser(ImmutableConfig conf)PageParser(ParserFactory parserFactory, ImmutableConfig conf, CrawlFilters crawlFilters, Signature signature, MiscMessageWriter messageWriter)
-
Method Summary
Modifier and Type Method Description final ConcurrentSkipListSet<CharSequence>getUnparsableTypes()final LinkFiltergetLinkFilter()final ParserFactorygetParserFactory()final ImmutableConfiggetConf()final CrawlFiltersgetCrawlFilters()final SignaturegetSignature()final MiscMessageWritergetMessageWriter()Unitsetup(ImmutableConfig jobConf)ParamsgetParams()final ParseResultparse(WebPage page)Parses given web page and stores parsed content within page. Unitclose()-
-
Constructor Detail
-
PageParser
PageParser(ParserFactory parserFactory, ImmutableConfig conf)
-
PageParser
PageParser(ImmutableConfig conf)
- Parameters:
conf- The configuration
-
PageParser
PageParser(ParserFactory parserFactory, ImmutableConfig conf, CrawlFilters crawlFilters, Signature signature, MiscMessageWriter messageWriter)
-
-
Method Detail
-
getUnparsableTypes
final ConcurrentSkipListSet<CharSequence> getUnparsableTypes()
-
getLinkFilter
final LinkFilter getLinkFilter()
-
getParserFactory
final ParserFactory getParserFactory()
-
getConf
final ImmutableConfig getConf()
-
getCrawlFilters
final CrawlFilters getCrawlFilters()
-
getSignature
final Signature getSignature()
-
getMessageWriter
final MiscMessageWriter getMessageWriter()
-
getParams
Params getParams()
-
parse
final ParseResult parse(WebPage page)
Parses given web page and stores parsed content within page. Puts a meta-redirect to outlinks.
- Parameters:
page- The web page
-
-
-
-