Package 

Interface ScoringFilter

  • All Implemented Interfaces:
    ai.platon.pulsar.common.config.Parameterized

    
    public interface ScoringFilter
     implements Parameterized
                        

    A contract defining behavior of scoring plugins.

    A scoring filter will manipulate scoring variables in CrawlDatum and in resulting search indexes. Filters can be chained in a specific order, to provide multi-stage scoring adjustments.

    • Method Summary

      Modifier and Type Method Description
      Unit injectedScore(WebPage page) Set an initial score for newly injected pages.
      Unit initialScore(WebPage page) Set an initial score for newly discovered pages.
      ScoreVector generatorSortValue(WebPage page, ScoreVector initSort) This method prepares a sort value for the purpose of sorting and selecting top N scoring pages during fetchlist generation.
      Unit distributeScoreToOutlinks(WebPage page, WebGraph graph, Collection<WebEdge> outgoingEdges, Integer allCount) Distribute score value from the current page to all its outlinked pages.
      Unit updateScore(WebPage page, WebGraph graph, Collection<WebEdge> incomingEdges) This method calculates a new score during table update, based on the values contributed by inlinked pages.
      Unit updateContentScore(WebPage page)
      Float indexerScore(String url, IndexDocument doc, WebPage page, Float initScore) This method calculates a Lucene document boost.
      • Methods inherited from class ai.platon.pulsar.crawl.scoring.ScoringFilter

        getParams
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Method Detail

      • injectedScore

         Unit injectedScore(WebPage page)

        Set an initial score for newly injected pages. Note: newly injected pages may have no inlinks, so filter implementations may wish to set this score to a non-zero value, to give newly injected pages some initial credit.

        Parameters:
        page - new page.
      • initialScore

         Unit initialScore(WebPage page)

        Set an initial score for newly discovered pages. Note: newly discovered pages have at least one inlink with its score contribution, so filter implementations may choose to set initial score to zero (unknown value), and then the inlink score contribution will set the "real" value of the new page.

        Parameters:
        page - page row.
      • generatorSortValue

         ScoreVector generatorSortValue(WebPage page, ScoreVector initSort)

        This method prepares a sort value for the purpose of sorting and selecting top N scoring pages during fetchlist generation.

        Parameters:
        page - page row.
        initSort - initial sort value, or a value from previous filters in chain
      • distributeScoreToOutlinks

         Unit distributeScoreToOutlinks(WebPage page, WebGraph graph, Collection<WebEdge> outgoingEdges, Integer allCount)

        Distribute score value from the current page to all its outlinked pages.

        Parameters:
        page - page row
        allCount - number of all collected outlinks from the source page
      • updateScore

         Unit updateScore(WebPage page, WebGraph graph, Collection<WebEdge> incomingEdges)

        This method calculates a new score during table update, based on the values contributed by inlinked pages.

        Parameters:
        page - page row
      • indexerScore

         Float indexerScore(String url, IndexDocument doc, WebPage page, Float initScore)

        This method calculates a Lucene document boost.

        Parameters:
        url - url of the page
        doc - document.
        page - page row
        initScore - initial boost value for the Lucene document.