Package 

Class EncodingDetector


  • 
    public class EncodingDetector
    
                        

    A simple class for detecting character encodings.

    Broadly this encompasses two functions, which are distinctly separate:

    • Auto detecting a set of "clues" from input text.
    • Taking a set of clues and making a "best guess" as to the "real" encoding.

    A caller will often have some extra information about what the encoding might be (e.g. from the HTTP header or HTML meta-tags, often wrong but still potentially useful clues). The types of clues may differ from caller to caller. Thus a typical calling sequence is:

    • Run step (1) to generate a set of auto-detected clues;
    • Combine these clues with the caller-dependent "extra clues" available;
    • Run step (2) to guess what the most probable answer is.
    • TODO: Use Tika's EncodingDetector

    • Constructor Detail

      • EncodingDetector

        EncodingDetector()
      • EncodingDetector

        EncodingDetector(ImmutableConfig conf)
    • Method Detail

      • parseCharacterEncoding

         static String parseCharacterEncoding(CharSequence contentTypeUtf8)

        ParseResult the character encoding from the specified content type header. If thecontent type is null, or there is no explicit character encoding,null is returned.This method was copied from org.apache.catalina.util.RequestUtil, which islicensed under the Apache License, Version 2.0 (the "License").

        Parameters:
        contentTypeUtf8 - utf8 encoded content
      • sniffCharacterEncoding

         String sniffCharacterEncoding(Array<byte> content)

        Given a byte[] representing an html file of anunknown encoding, read out 'charset' parameter in the meta tagfrom the first CHUNK_SIZE bytes. If there's no meta tag forContent-Type or no charset is specified, the content is checked for aUnicode Byte Order Mark (BOM). This will also cover non-byte orientedcharacter encodings (UTF-16 only). If no character set can be determined,null is returned.See alsohttp://www.w3.org/International/questions/qa-html-encoding-declarations,http://www.w3.org/TR/2011/WD-html5-diff-20110405/#character-encoding, andhttp://www.w3.org/TR/REC-xml/#sec-guessing

        Parameters:
        content - byte[] representation of an html file
      • guessEncoding

         String guessEncoding(WebPage page, String defaultValue)

        Guess the encoding with the previously specified list of clues.

        Parameters:
        page - URL's row
        defaultValue - Default encoding to return if no encoding can be detected withenough confidence.