-
public class EncodingDetectorA simple class for detecting character encodings.
Broadly this encompasses two functions, which are distinctly separate:
- Auto detecting a set of "clues" from input text.
- Taking a set of clues and making a "best guess" as to the "real" encoding.
A caller will often have some extra information about what the encoding might be (e.g. from the HTTP header or HTML meta-tags, often wrong but still potentially useful clues). The types of clues may differ from caller to caller. Thus a typical calling sequence is:
- Run step (1) to generate a set of auto-detected clues;
- Combine these clues with the caller-dependent "extra clues" available;
- Run step (2) to guess what the most probable answer is.
TODO: Use Tika's EncodingDetector
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description public classEncodingDetector.EncodingClue
-
Field Summary
Fields Modifier and Type Field Description public final static LoggerLOGpublic final static intNO_THRESHOLDpublic final static StringMIN_CONFIDENCE_KEYprivate final List<EncodingDetector.EncodingClue>cluesprivate intminConfidenceprivate StringdefaultCharEncoding
-
Constructor Summary
Constructors Constructor Description EncodingDetector()EncodingDetector(ImmutableConfig conf)
-
Method Summary
Modifier and Type Method Description List<EncodingDetector.EncodingClue>getClues()intgetMinConfidence()voidsetMinConfidence(int minConfidence)StringgetDefaultCharEncoding()voidsetDefaultCharEncoding(String defaultCharEncoding)static StringresolveEncodingAlias(String encoding)static StringparseCharacterEncoding(CharSequence contentTypeUtf8)ParseResult the character encoding from the specified content type header. StringsniffEncoding(WebPage page)StringgetCluesAsString()voidautoDetectClues(WebPage page, boolean filter)StringsniffCharacterEncoding(Array<byte> content)Given a byte[]representing an html file of anunknown encoding, read out 'charset' parameter in the meta tagfrom the firstCHUNK_SIZEbytes.voidaddClue(String value, String source)StringguessEncoding(WebPage page, String defaultValue)Guess the encoding with the previously specified list of clues. voidclearClues()Clears all clues. -
-
Method Detail
-
getClues
List<EncodingDetector.EncodingClue> getClues()
-
getMinConfidence
int getMinConfidence()
-
setMinConfidence
void setMinConfidence(int minConfidence)
-
getDefaultCharEncoding
String getDefaultCharEncoding()
-
setDefaultCharEncoding
void setDefaultCharEncoding(String defaultCharEncoding)
-
resolveEncodingAlias
static String resolveEncodingAlias(String encoding)
-
parseCharacterEncoding
static String parseCharacterEncoding(CharSequence contentTypeUtf8)
ParseResult the character encoding from the specified content type header. If thecontent type is null, or there is no explicit character encoding,
nullis returned.This method was copied from org.apache.catalina.util.RequestUtil, which islicensed under the Apache License, Version 2.0 (the "License").- Parameters:
contentTypeUtf8- utf8 encoded content
-
sniffEncoding
String sniffEncoding(WebPage page)
-
getCluesAsString
String getCluesAsString()
-
autoDetectClues
void autoDetectClues(WebPage page, boolean filter)
-
sniffCharacterEncoding
String sniffCharacterEncoding(Array<byte> content)
Given a
byte[]representing an html file of anunknown encoding, read out 'charset' parameter in the meta tagfrom the firstCHUNK_SIZEbytes. If there's no meta tag forContent-Type or no charset is specified, the content is checked for aUnicode Byte Order Mark (BOM). This will also cover non-byte orientedcharacter encodings (UTF-16 only). If no character set can be determined,nullis returned.See alsohttp://www.w3.org/International/questions/qa-html-encoding-declarations,http://www.w3.org/TR/2011/WD-html5-diff-20110405/#character-encoding, andhttp://www.w3.org/TR/REC-xml/#sec-guessing- Parameters:
content-byte[]representation of an html file
-
guessEncoding
String guessEncoding(WebPage page, String defaultValue)
Guess the encoding with the previously specified list of clues.
- Parameters:
page- URL's rowdefaultValue- Default encoding to return if no encoding can be detected withenough confidence.
-
clearClues
void clearClues()
Clears all clues.
-
-
-
-