-
public class CrawlFilter.Companion
-
-
Field Summary
Fields Modifier and Type Field Description private final LoggerLOGprivate final Array<String>MEDIA_URL_SUFFIXESprivate final Array<Pattern>INDEX_PAGE_URL_PATTERNSprivate final PatternSEARCH_PAGE_URL_PATTERNprivate final Array<Pattern>DETAIL_PAGE_URL_PATTERNSprivate final PatternMEDIA_PAGE_URL_PATTERNpublic final static CrawlFilter.CompanionINSTANCE
-
Method Summary
Modifier and Type Method Description final PageCategorygetPageCategory(String url)final PageCategoryguessPageCategory(String url)A simple regex rule to sniff the possible category of a web page final BooleankeyGreaterEqual(String test, String bound)final BooleankeyLessEqual(String test, String bound)final LoggergetLOG()final Array<String>getMEDIA_URL_SUFFIXES()TODO : use suffix-urlfilter instead final Array<Pattern>getINDEX_PAGE_URL_PATTERNS()The follow patterns are simple rule to indicate a url's category, this is a very simple solution, and the result is not accurate final PatterngetSEARCH_PAGE_URL_PATTERN()final Array<Pattern>getDETAIL_PAGE_URL_PATTERNS()final PatterngetMEDIA_PAGE_URL_PATTERN()-
-
Method Detail
-
getPageCategory
final PageCategory getPageCategory(String url)
-
guessPageCategory
final PageCategory guessPageCategory(String url)
A simple regex rule to sniff the possible category of a web page
-
keyGreaterEqual
final Boolean keyGreaterEqual(String test, String bound)
-
keyLessEqual
final Boolean keyLessEqual(String test, String bound)
-
getLOG
final Logger getLOG()
-
getMEDIA_URL_SUFFIXES
final Array<String> getMEDIA_URL_SUFFIXES()
TODO : use suffix-urlfilter instead
-
getINDEX_PAGE_URL_PATTERNS
final Array<Pattern> getINDEX_PAGE_URL_PATTERNS()
The follow patterns are simple rule to indicate a url's category, this is a very simple solution, and the result is not accurate
-
getSEARCH_PAGE_URL_PATTERN
final Pattern getSEARCH_PAGE_URL_PATTERN()
-
getDETAIL_PAGE_URL_PATTERNS
final Array<Pattern> getDETAIL_PAGE_URL_PATTERNS()
-
getMEDIA_PAGE_URL_PATTERN
final Pattern getMEDIA_PAGE_URL_PATTERN()
-
-
-
-