-
public final class PrimerParserA very simple DOM parser
A collection of methods for extracting content from DOM trees.
This class holds a few utility methods for pulling content out of DOM nodes, such as getLiveLinks, getPageText, etc.
-
-
Field Summary
Fields Modifier and Type Field Description private final ImmutableConfigconf
-
Constructor Summary
Constructors Constructor Description PrimerParser(ImmutableConfig conf)
-
Method Summary
Modifier and Type Method Description final ImmutableConfiggetConf()final UnitdetectEncoding(WebPage page)final ParseContextparseHTMLDocument(WebPage page)final BooleangetPageText(StringBuilder sb, Node root, Boolean abortOnNestedAnchors)This method takes a StringBuilder and a DOM Node, and will append all the content text found beneath the DOM node to the StringBuilder.final UnitgetPageText(StringBuilder sb, Node root)This is a convinience method, equivalent to . final StringgetPageText(Node root)final StringgetPageTitle(Node root)final Map<String, String>getMetadata(Node root)final URLgetBaseURLFromTag(Node root)If Node contains a BASE tag then it's HREF is returned. final Set<HyperlinkPersistable>collectLinks(URL base, Node root)This method finds all anchors below the supplied DOM root, and creates appropriate HyperlinkPersistable records for each (relative to the suppliedbaseURL), and adds them to theoutlinksArrayList.final Set<HyperlinkPersistable>collectLinks(URL base, Node root, CrawlFilters crawlFilters)final Set<HyperlinkPersistable>collectLinks(URL base, Set<HyperlinkPersistable> hyperlinks, Node root, CrawlFilters crawlFilters)-
-
Method Detail
-
getConf
final ImmutableConfig getConf()
-
detectEncoding
final Unit detectEncoding(WebPage page)
-
parseHTMLDocument
final ParseContext parseHTMLDocument(WebPage page)
-
getPageText
final Boolean getPageText(StringBuilder sb, Node root, Boolean abortOnNestedAnchors)
This method takes a StringBuilder and a DOM Node, and will append all the content text found beneath the DOM node to the
StringBuilder.If
abortOnNestedAnchorsis true, DOM traversal will be aborted and theStringBufferwill not contain any text encountered after a nested anchor is found.
-
getPageText
final Unit getPageText(StringBuilder sb, Node root)
This is a convinience method, equivalent to .getPageText.
-
getPageText
final String getPageText(Node root)
-
getPageTitle
final String getPageTitle(Node root)
-
getMetadata
final Map<String, String> getMetadata(Node root)
-
getBaseURLFromTag
final URL getBaseURLFromTag(Node root)
If Node contains a BASE tag then it's HREF is returned.
-
collectLinks
final Set<HyperlinkPersistable> collectLinks(URL base, Node root)
This method finds all anchors below the supplied DOM
root, and creates appropriate HyperlinkPersistable records for each (relative to the suppliedbaseURL), and adds them to theoutlinksArrayList.Links without inner structure (tags, text, etc) are discarded, as are links which contain only single nested links and empty text nodes (this is a common DOM-fixup artifact, at least with nekohtml).
-
collectLinks
final Set<HyperlinkPersistable> collectLinks(URL base, Node root, CrawlFilters crawlFilters)
-
collectLinks
final Set<HyperlinkPersistable> collectLinks(URL base, Set<HyperlinkPersistable> hyperlinks, Node root, CrawlFilters crawlFilters)
-
-
-
-