Package 

Class TextProfileSignature


  • 
    public final class TextProfileSignature
    extends Signature
                        

    An implementation of a page signature. It calculates an MD5 hash of a plain signature "profile" of a page. In case there is no signature, it calculates a hash using the MD5Signature.

    The algorithm to calculate a page "profile" takes the plain signature version of a page and performs the following steps:

    • remove all characters except letters and digits, and bring all characters to lower case,

    • split the signature into tokens (all consecutive non-whitespace characters),

    • discard tokens equal or shorter than MIN_TOKEN_LEN (default 2 characters),

    • sort the list of tokens by decreasing frequency,

    • round down the counts of tokens to the nearest multiple of QUANT ( QUANT = QUANT_RATE * maxFreq, where QUANT_RATE is 0.01f by default, and maxFreq is the maximum token frequency). If maxFreq is higher than 1, then QUANT is always higher than 2 (which means that tokens with frequency 1 are always discarded).

    • tokens, which frequency after quantization falls below QUANT, are discarded.

    • create a list of tokens and their quantized frequency, separated by spaces, in the order of decreasing frequency.

    This list is then submitted to an MD5 hash calculation.

    • Method Summary

      Modifier and Type Method Description
      ByteArray calculate(WebPage page)
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Constructor Detail

      • TextProfileSignature

        TextProfileSignature(ImmutableConfig conf)