-
public final class CrawlUrlNormalizersThis class uses a "chained filter" pattern to run defined normalizers. Different lists of normalizers may be defined for different "scopes", or contexts where they are used (note however that they need to be activated first through <tt>plugin.include</tt> property).
There is one global scope defined by default, which consists of all active normalizers. The order in which these normalizers are executed may be defined in "urlnormalizer.order" property, which lists space-separated implementation classes (if this property is missing normalizers will be run in random order). If there are more normalizers activated than explicitly named on this list, the remaining ones will be run in random order after the ones specified on the list are executed.
You can define a set of contexts (or scopes) in which normalizers may be called. Each scope can have its own list of normalizers (defined in "urlnormalizer.scope.<scope_name>" property) and its own order (defined in "urlnormalizer.order.<scope_name>" property). If any of these properties are missing, default settings are used for the global scope. </scope_name></scope_name> *
In case no normalizers are required for any given scope, a
ai.platon.pulsar.crawl.net.urlnormalizer.pass.PassURLNormalizershould be used.Each normalizer may further select among many configurations, depending on the scope in which it is called, because the scope name is passed as a parameter to each normalizer. You can also use the same normalizer for many scopes.
Several scopes have been defined, and various AppConstants cli will attempt using scope-specific normalizers first (and fall back to default config if scope-specific configuration is missing).
Normalizers may be run several times, to ensure that modifications introduced by normalizers at the end of the list can be further reduced by normalizers executed at the beginning. By default this loop is executed just once - if you want to ensure that all possible combinations have been applied you may want to run this loop up to the number of activated normalizers. This loop count can be configured through <tt>urlnormalizer.loop.count</tt> property. As soon as the url is unchanged the loop will stop and return the result.
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description public classCrawlUrlNormalizers.Companion
-
Field Summary
Fields Modifier and Type Field Description private final List<CrawlUrlNormalizer>urlNormalizersprivate final Stringscopeprivate final ImmutableConfigconf
-
Constructor Summary
Constructors Constructor Description CrawlUrlNormalizers(ImmutableConfig conf)CrawlUrlNormalizers(List<CrawlUrlNormalizer> urlNormalizers, String scope, ImmutableConfig conf)
-
Method Summary
Modifier and Type Method Description final List<CrawlUrlNormalizer>getUrlNormalizers()final StringgetScope()final ImmutableConfiggetConf()final List<CrawlUrlNormalizer>getURLNormalizers(String scope)TODO : not implemented final CrawlUrlNormalizerfindByClassName(String name)final Stringnormalize(String url, String scope)Normalize final Stringnormalize(String url)Normalize StringtoString()-
-
Constructor Detail
-
CrawlUrlNormalizers
CrawlUrlNormalizers(ImmutableConfig conf)
-
CrawlUrlNormalizers
CrawlUrlNormalizers(List<CrawlUrlNormalizer> urlNormalizers, String scope, ImmutableConfig conf)
-
-
Method Detail
-
getUrlNormalizers
final List<CrawlUrlNormalizer> getUrlNormalizers()
-
getConf
final ImmutableConfig getConf()
-
getURLNormalizers
final List<CrawlUrlNormalizer> getURLNormalizers(String scope)
TODO : not implemented
-
findByClassName
final CrawlUrlNormalizer findByClassName(String name)
-
normalize
@JvmOverloads() final String normalize(String url, String scope)
Normalize
-
normalize
@JvmOverloads() final String normalize(String url)
Normalize
-
-
-
-