-
public final class TextProfileSignature extends Signature
An implementation of a page signature. It calculates an MD5 hash of a plain signature "profile" of a page. In case there is no signature, it calculates a hash using the MD5Signature.
The algorithm to calculate a page "profile" takes the plain signature version of a page and performs the following steps:
remove all characters except letters and digits, and bring all characters to lower case,
split the signature into tokens (all consecutive non-whitespace characters),
discard tokens equal or shorter than MIN_TOKEN_LEN (default 2 characters),
sort the list of tokens by decreasing frequency,
round down the counts of tokens to the nearest multiple of QUANT (
QUANT = QUANT_RATE * maxFreq, whereQUANT_RATEis 0.01f by default, andmaxFreqis the maximum token frequency). IfmaxFreqis higher than 1, then QUANT is always higher than 2 (which means that tokens with frequency 1 are always discarded).tokens, which frequency after quantization falls below QUANT, are discarded.
create a list of tokens and their quantized frequency, separated by spaces, in the order of decreasing frequency.
This list is then submitted to an MD5 hash calculation.
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description public classTextProfileSignature.Companion
-
Constructor Summary
Constructors Constructor Description TextProfileSignature(ImmutableConfig conf)
-