All Packages  Class Hierarchy  This Package  Previous  Next  Index

Class com.jclark.xml.tok.Encoding

java.lang.Object
   |
   +----com.jclark.xml.tok.Encoding

public abstract class Encoding
extends Object
An Encoding object corresponds to a possible encoding (a mapping from characters to sequences of bytes). It provides operations on byte arrays that represent all or part of a parsed XML entity in that encoding.

The set of ASCII characters excluding $@\^`{}~ have a special status; these are called XML significant characters.

This class imposes certain restrictions on an encoding:

Several methods operate on byte subarrays. The subarray is specified by a byte array buf and two integers, off and end; off gives the index in buf of the first byte of the subarray and end gives the index in buf of the byte immediately after the last byte.

Use the getInitialEncoding method to get an Encoding object to use to start parsing an entity.

The main operations provided by Encoding are tokenizeProlog, tokenizeContent and tokenizeCdataSection; these are used to divide up an XML entity into tokens. tokenizeProlog is used for the prolog of an XML document as well as for the external subset and parameter entities (except when referenced in an EntityValue); it can also be used for parsing the Misc* that follows the document element. tokenizeContent is used for the document element and for parsed general entities that are referenced in content except for CDATA sections. tokenizeCdataSection is used for CDATA sections, following the <![CDATA[ up to and including the ]]>.

tokenizeAttributeValue and tokenizeEntityValue are used to further divide up tokens returned by tokenizeProlog and tokenizeContent; they are also used to divide up entities referenced in attribute values or entity values.


Variable Index

 o TOK_ATTRIBUTE_VALUE_S
Represents a white space character in an attribute value, excluding white space characters that are part of line boundaries.
 o TOK_CDATA_SECT_CLOSE
Represents the end of a CDATA section ]]>.
 o TOK_CDATA_SECT_OPEN
Represents the start of a CDATA section <![CDATA[.
 o TOK_CHAR_PAIR_REF
Represents a numeric character reference (decimal or hexadecimal), when the referenced character is greater than 0xFFFF and so is represented by a pair of chars.
 o TOK_CHAR_REF
Represents a numeric character reference (decimal or hexadecimal), when the referenced character is less than or equal to 0xFFFF and so is represented by a single char.
 o TOK_CLOSE_BRACKET
Represents ] in the prolog.
 o TOK_CLOSE_PAREN
Represents a ) in the prolog that is not followed immediately by any of *, + or ?.
 o TOK_CLOSE_PAREN_ASTERISK
Represents )* in the prolog.
 o TOK_CLOSE_PAREN_PLUS
Represents )+ in the prolog.
 o TOK_CLOSE_PAREN_QUESTION
Represents )? in the prolog.
 o TOK_COMMA
Represents , in the prolog.
 o TOK_COMMENT
Represents a comment <!-- comment -->.
 o TOK_COND_SECT_CLOSE
Represents ]]> in the prolog.
 o TOK_COND_SECT_OPEN
Represents <![ in the prolog.
 o TOK_DATA_CHARS
Represents one or more characters of data.
 o TOK_DATA_NEWLINE
Represents a newline (CR, LF or CR followed by LF) in data.
 o TOK_DECL_CLOSE
Represents > in the prolog.
 o TOK_DECL_OPEN
Represents <!NAME in the prolog.
 o TOK_EMPTY_ELEMENT_NO_ATTS
Represents an empty element tag <name/>, that doesn't have any attribute specifications.
 o TOK_EMPTY_ELEMENT_WITH_ATTS
Represents an empty element tag <name att="val"/>, that contains one or more attribute specifications.
 o TOK_END_TAG
Represents a complete end-tag </name>.
 o TOK_ENTITY_REF
Represents a general entity reference.
 o TOK_LITERAL
Represents a literal (EntityValue, AttValue, SystemLiteral or PubidLiteral).
 o TOK_MAGIC_ENTITY_REF
Represents a general entity reference to a one of the 5 predefined entities amp, lt, gt, quot, apos.
 o TOK_NAME
Represents a name in the prolog.
 o TOK_NAME_ASTERISK
Represents a name followed immediately by *.
 o TOK_NAME_PLUS
Represents a name followed immediately by +.
 o TOK_NAME_QUESTION
Represents a name followed immediately by ?.
 o TOK_NMTOKEN
Represents a name token in the prolog that is not a name.
 o TOK_OPEN_BRACKET
Represents [ in the prolog.
 o TOK_OPEN_PAREN
Represents a ( in the prolog.
 o TOK_OR
Represents | in the prolog.
 o TOK_PARAM_ENTITY_REF
Represents a parameter entity reference in the prolog.
 o TOK_PERCENT
Represents a % in the prolog that does not start a parameter entity reference.
 o TOK_PI
Represents a processing instruction.
 o TOK_POUND_NAME
Represents #NAME in the prolog.
 o TOK_PROLOG_S
Represents whitespace in the prolog.
 o TOK_START_TAG_NO_ATTS
Represents a complete start-tag <name>, that doesn't have any attribute specifications.
 o TOK_START_TAG_WITH_ATTS
Represents a complete start-tag <name att="val">, that contains one or more attribute specifications.
 o TOK_XML_DECL
Represents an XML declaration or text declaration (a processing instruction whose target is xml).

Method Index

 o convert(byte[], int, int, char[], int)
Convert bytes to characters.
 o getEncoding(String)
Returns an Encoding corresponding to the specified IANA character set name.
 o getFixedBytesPerChar()
Returns the number of bytes required to represent each char, or zero if different chars are represented by different numbers of bytes.
 o getInitialEncoding(byte[], int, int, Token)
Returns an encoding object to be used to start parsing an external entity.
 o getInternalEncoding()
Returns an Encoding object for use with internal entities.
 o getMinBytesPerChar()
Returns the minimum number of bytes required to represent a single character in this encoding.
 o getPublicId(byte[], int, int)
Checks that a literal contained in the specified byte subarray is a legal public identifier and returns a string with the normalized content of the public id.
 o getSingleByteEncoding(String)
Returns an Encoding for entities encoded with a single-byte encoding (an encoding in which each byte represents exactly one character).
 o matchesXMLString(byte[], int, int, String)
Returns true if the specified byte subarray is equal to the string.
 o movePosition(byte[], int, int, Position)
Moves a position forward.
 o skipIgnoreSect(byte[], int, int)
Skips over an ignored conditional section.
 o skipS(byte[], int, int)
Skips over XML whitespace characters at the start of the specified subarray.
 o tokenizeAttributeValue(byte[], int, int, Token)
Scans the first token of a byte subarrary that contains part of literal attribute value.
 o tokenizeCdataSection(byte[], int, int, Token)
Scans the first token of a byte subarrary that starts with the content of a CDATA section.
 o tokenizeContent(byte[], int, int, ContentToken)
Scans the first token of a byte subarrary that contains content.
 o tokenizeEntityValue(byte[], int, int, Token)
Scans the first token of a byte subarrary that contains part of literal entity value.
 o tokenizeProlog(byte[], int, int, Token)
Scans the first token of a byte subarray that contains part of a prolog.

Variables

 o TOK_DATA_CHARS
 public static final int TOK_DATA_CHARS
Represents one or more characters of data.

 o TOK_DATA_NEWLINE
 public static final int TOK_DATA_NEWLINE
Represents a newline (CR, LF or CR followed by LF) in data.

 o TOK_START_TAG_NO_ATTS
 public static final int TOK_START_TAG_NO_ATTS
Represents a complete start-tag <name>, that doesn't have any attribute specifications.

 o TOK_START_TAG_WITH_ATTS
 public static final int TOK_START_TAG_WITH_ATTS
Represents a complete start-tag <name att="val">, that contains one or more attribute specifications.

 o TOK_EMPTY_ELEMENT_NO_ATTS
 public static final int TOK_EMPTY_ELEMENT_NO_ATTS
Represents an empty element tag <name/>, that doesn't have any attribute specifications.

 o TOK_EMPTY_ELEMENT_WITH_ATTS
 public static final int TOK_EMPTY_ELEMENT_WITH_ATTS
Represents an empty element tag <name att="val"/>, that contains one or more attribute specifications.

 o TOK_END_TAG
 public static final int TOK_END_TAG
Represents a complete end-tag </name>.

 o TOK_CDATA_SECT_OPEN
 public static final int TOK_CDATA_SECT_OPEN
Represents the start of a CDATA section <![CDATA[.

 o TOK_CDATA_SECT_CLOSE
 public static final int TOK_CDATA_SECT_CLOSE
Represents the end of a CDATA section ]]>.

 o TOK_ENTITY_REF
 public static final int TOK_ENTITY_REF
Represents a general entity reference.

 o TOK_MAGIC_ENTITY_REF
 public static final int TOK_MAGIC_ENTITY_REF
Represents a general entity reference to a one of the 5 predefined entities amp, lt, gt, quot, apos.

 o TOK_CHAR_REF
 public static final int TOK_CHAR_REF
Represents a numeric character reference (decimal or hexadecimal), when the referenced character is less than or equal to 0xFFFF and so is represented by a single char.

 o TOK_CHAR_PAIR_REF
 public static final int TOK_CHAR_PAIR_REF
Represents a numeric character reference (decimal or hexadecimal), when the referenced character is greater than 0xFFFF and so is represented by a pair of chars.

 o TOK_PI
 public static final int TOK_PI
Represents a processing instruction.

 o TOK_XML_DECL
 public static final int TOK_XML_DECL
Represents an XML declaration or text declaration (a processing instruction whose target is xml).

 o TOK_COMMENT
 public static final int TOK_COMMENT
Represents a comment <!-- comment -->. This can occur both in the prolog and in content.

 o TOK_ATTRIBUTE_VALUE_S
 public static final int TOK_ATTRIBUTE_VALUE_S
Represents a white space character in an attribute value, excluding white space characters that are part of line boundaries.

 o TOK_PARAM_ENTITY_REF
 public static final int TOK_PARAM_ENTITY_REF
Represents a parameter entity reference in the prolog.

 o TOK_PROLOG_S
 public static final int TOK_PROLOG_S
Represents whitespace in the prolog. The token contains one or more whitespace characters.

 o TOK_DECL_OPEN
 public static final int TOK_DECL_OPEN
Represents <!NAME in the prolog.

 o TOK_DECL_CLOSE
 public static final int TOK_DECL_CLOSE
Represents > in the prolog.

 o TOK_NAME
 public static final int TOK_NAME
Represents a name in the prolog.

 o TOK_NMTOKEN
 public static final int TOK_NMTOKEN
Represents a name token in the prolog that is not a name.

 o TOK_POUND_NAME
 public static final int TOK_POUND_NAME
Represents #NAME in the prolog.

 o TOK_OR
 public static final int TOK_OR
Represents | in the prolog.

 o TOK_PERCENT
 public static final int TOK_PERCENT
Represents a % in the prolog that does not start a parameter entity reference. This can occur in an entity declaration.

 o TOK_OPEN_PAREN
 public static final int TOK_OPEN_PAREN
Represents a ( in the prolog.

 o TOK_CLOSE_PAREN
 public static final int TOK_CLOSE_PAREN
Represents a ) in the prolog that is not followed immediately by any of *, + or ?.

 o TOK_OPEN_BRACKET
 public static final int TOK_OPEN_BRACKET
Represents [ in the prolog.

 o TOK_CLOSE_BRACKET
 public static final int TOK_CLOSE_BRACKET
Represents ] in the prolog.

 o TOK_LITERAL
 public static final int TOK_LITERAL
Represents a literal (EntityValue, AttValue, SystemLiteral or PubidLiteral).

 o TOK_NAME_QUESTION
 public static final int TOK_NAME_QUESTION
Represents a name followed immediately by ?.

 o TOK_NAME_ASTERISK
 public static final int TOK_NAME_ASTERISK
Represents a name followed immediately by *.

 o TOK_NAME_PLUS
 public static final int TOK_NAME_PLUS
Represents a name followed immediately by +.

 o TOK_COND_SECT_OPEN
 public static final int TOK_COND_SECT_OPEN
Represents <![ in the prolog.

 o TOK_COND_SECT_CLOSE
 public static final int TOK_COND_SECT_CLOSE
Represents ]]> in the prolog.

 o TOK_CLOSE_PAREN_QUESTION
 public static final int TOK_CLOSE_PAREN_QUESTION
Represents )? in the prolog.

 o TOK_CLOSE_PAREN_ASTERISK
 public static final int TOK_CLOSE_PAREN_ASTERISK
Represents )* in the prolog.

 o TOK_CLOSE_PAREN_PLUS
 public static final int TOK_CLOSE_PAREN_PLUS
Represents )+ in the prolog.

 o TOK_COMMA
 public static final int TOK_COMMA
Represents , in the prolog.

Methods

 o convert
 public abstract int convert(byte sourceBuf[],
                             int sourceStart,
                             int sourceEnd,
                             char targetBuf[],
                             int targetStart)
Convert bytes to characters. The bytes on sourceBuf between sourceStart and sourceEnd are converted to characters and stored in targetBuf starting at targetStart. (targetBuf.length - targetStart) * getMinBytesPerChar() must be at greater than or equal to sourceEnd - sourceStart. If getFixedBytesPerChar returns a value greater than 0, then the return value will be equal to (sourceEnd - sourceStart)/getFixedBytesPerChar().

Returns:
the number of characters stored into targetBuf
See Also:
getFixedBytesPerChar
 o getFixedBytesPerChar
 public abstract int getFixedBytesPerChar()
Returns the number of bytes required to represent each char, or zero if different chars are represented by different numbers of bytes. The value returned will 0, 1, 2, or 4.

 o movePosition
 public abstract void movePosition(byte buf[],
                                   int off,
                                   int end,
                                   Position pos)
Moves a position forward. On entry, pos gives the position of the byte at index off in buf. On exit, it pos will give the position of the byte at index end, which must be greater than or equal to off. The bytes between off and end must encode one or more complete characters. A carriage return followed by a line feed will be treated as a single line delimiter provided that they are given to movePosition together.

 o tokenizeCdataSection
 public final int tokenizeCdataSection(byte buf[],
                                       int off,
                                       int end,
                                       Token token) throws EmptyTokenException, PartialTokenException, InvalidTokenException, ExtensibleTokenException
Scans the first token of a byte subarrary that starts with the content of a CDATA section. Returns one of the following integers according to the type of token that the subarray starts with:

Information about the token is stored in token.

After TOK_CDATA_SECT_CLOSE is returned, the application should use tokenizeContent.

Throws: EmptyTokenException
if the subarray is empty
Throws: PartialTokenException
if the subarray contains only part of a legal token
Throws: InvalidTokenException
if the subarrary does not start with a legal token or part of one
Throws: ExtensibleTokenException
if the subarray encodes just a carriage return ('\r')
See Also:
TOK_DATA_CHARS, TOK_DATA_NEWLINE, TOK_CDATA_SECT_CLOSE, Token, EmptyTokenException, PartialTokenException, InvalidTokenException, ExtensibleTokenException, tokenizeContent
 o tokenizeContent
 public final int tokenizeContent(byte buf[],
                                  int off,
                                  int end,
                                  ContentToken token) throws PartialTokenException, InvalidTokenException, EmptyTokenException, ExtensibleTokenException
Scans the first token of a byte subarrary that contains content. Returns one of the following integers according to the type of token that the subarray starts with:

Information about the token is stored in token.

When TOK_CDATA_SECT_OPEN is returned, tokenizeCdataSection should be called until it returns TOK_CDATA_SECT.

Throws: EmptyTokenException
if the subarray is empty
Throws: PartialTokenException
if the subarray contains only part of a legal token
Throws: InvalidTokenException
if the subarrary does not start with a legal token or part of one
Throws: ExtensibleTokenException
if the subarray encodes just a carriage return ('\r')
See Also:
TOK_START_TAG_NO_ATTS, TOK_START_TAG_WITH_ATTS, TOK_EMPTY_ELEMENT_NO_ATTS, TOK_EMPTY_ELEMENT_WITH_ATTS, TOK_END_TAG, TOK_DATA_CHARS, TOK_DATA_NEWLINE, TOK_CDATA_SECT_OPEN, TOK_ENTITY_REF, TOK_MAGIC_ENTITY_REF, TOK_CHAR_REF, TOK_CHAR_PAIR_REF, TOK_PI, TOK_XML_DECL, TOK_COMMENT, ContentToken, EmptyTokenException, PartialTokenException, InvalidTokenException, ExtensibleTokenException, tokenizeCdataSection
 o getInitialEncoding
 public static final Encoding getInitialEncoding(byte buf[],
                                                 int off,
                                                 int end,
                                                 Token token)
Returns an encoding object to be used to start parsing an external entity. The encoding is chosen based on the initial 4 bytes of the entity.

Parameters:
buf - the byte array containing the initial bytes of the entity
off - the index in buf of the first byte of the entity
end - the index in buf following the last available byte of the entity; end - off must be greater than or equal to 4 unless the entity has fewer that 4 bytes, in which case it must be equal to the length of the entity
token - receives information about the presence of a byte order mark; if the entity starts with a byte order mark then token.getTokenEnd() will return off + 2, otherwise it will return off
See Also:
TextDecl, XmlDecl, TOK_XML_DECL, getEncoding, getInternalEncoding
 o getEncoding
 public final Encoding getEncoding(String name)
Returns an Encoding corresponding to the specified IANA character set name. Returns this Encoding if the name is null. Returns null if the specified encoding is not supported. Note that there are two distinct Encoding objects associated with the name UTF-16, one for each possible byte order; if this Encoding is UTF-16 with little-endian byte ordering, then getEncoding("UTF-16") will return this, otherwise it will return an Encoding for UTF-16 with big-endian byte ordering.

Parameters:
name - a string specifying the IANA name of the encoding; this is case insensitive
 o getSingleByteEncoding
 public final Encoding getSingleByteEncoding(String map)
Returns an Encoding for entities encoded with a single-byte encoding (an encoding in which each byte represents exactly one character).

Parameters:
map - a string specifying the character represented by each byte; the string must have a length of 256; map.charAt(b) specifies the character encoded by byte b; bytes that do not represent any character should be mapped to ?
 o getInternalEncoding
 public static final Encoding getInternalEncoding()
Returns an Encoding object for use with internal entities. This is a UTF-16 big endian encoding, except that newlines are assumed to have been normalized into line feed, so carriage return is treated like a space.

 o tokenizeProlog
 public final int tokenizeProlog(byte buf[],
                                 int off,
                                 int end,
                                 Token token) throws PartialTokenException, InvalidTokenException, EmptyTokenException, ExtensibleTokenException, EndOfPrologException
Scans the first token of a byte subarray that contains part of a prolog. Returns one of the following integers according to the type of token that the subarray starts with:

Throws: EmptyTokenException
if the subarray is empty
Throws: PartialTokenException
if the subarray contains only part of a legal token
Throws: InvalidTokenException
if the subarrary does not start with a legal token or part of one
Throws: EndOfPrologException
if the subarray starts with the document element; tokenizeContent should be used on the remainder of the entity
Throws: ExtensibleTokenException
if the subarray is a legal token but subsequent bytes in the same entity could be part of the token
See Also:
TOK_PI, TOK_XML_DECL, TOK_COMMENT, TOK_PARAM_ENTITY_REF, TOK_PROLOG_S, TOK_DECL_OPEN, TOK_DECL_CLOSE, TOK_NAME, TOK_NMTOKEN, TOK_POUND_NAME, TOK_OR, TOK_PERCENT, TOK_OPEN_PAREN, TOK_CLOSE_PAREN, TOK_OPEN_BRACKET, TOK_CLOSE_BRACKET, TOK_LITERAL, TOK_NAME_QUESTION, TOK_NAME_ASTERISK, TOK_NAME_PLUS, TOK_COND_SECT_OPEN, TOK_COND_SECT_CLOSE, TOK_CLOSE_PAREN_QUESTION, TOK_CLOSE_PAREN_ASTERISK, TOK_CLOSE_PAREN_PLUS, TOK_COMMA, ContentToken, EmptyTokenException, PartialTokenException, InvalidTokenException, ExtensibleTokenException, EndOfPrologException
 o tokenizeAttributeValue
 public final int tokenizeAttributeValue(byte buf[],
                                         int off,
                                         int end,
                                         Token token) throws PartialTokenException, InvalidTokenException, EmptyTokenException, ExtensibleTokenException
Scans the first token of a byte subarrary that contains part of literal attribute value. The opening and closing delimiters are not included in the subarrary. Returns one of the following integers according to the type of token that the subarray starts with:

Throws: EmptyTokenException
if the subarray is empty
Throws: PartialTokenException
if the subarray contains only part of a legal token
Throws: InvalidTokenException
if the subarrary does not start with a legal token or part of one
Throws: ExtensibleTokenException
if the subarray encodes just a carriage return ('\r')
See Also:
TOK_DATA_CHARS, TOK_DATA_NEWLINE, TOK_ATTRIBUTE_VALUE_S, TOK_MAGIC_ENTITY_REF, TOK_ENTITY_REF, TOK_CHAR_REF, TOK_CHAR_PAIR_REF, Token, EmptyTokenException, PartialTokenException, InvalidTokenException, ExtensibleTokenException
 o tokenizeEntityValue
 public final int tokenizeEntityValue(byte buf[],
                                      int off,
                                      int end,
                                      Token token) throws PartialTokenException, InvalidTokenException, EmptyTokenException, ExtensibleTokenException
Scans the first token of a byte subarrary that contains part of literal entity value. The opening and closing delimiters are not included in the subarrary. Returns one of the following integers according to the type of token that the subarray starts with:

Throws: EmptyTokenException
if the subarray is empty
Throws: PartialTokenException
if the subarray contains only part of a legal token
Throws: InvalidTokenException
if the subarrary does not start with a legal token or part of one
Throws: ExtensibleTokenException
if the subarray encodes just a carriage return ('\r')
See Also:
TOK_DATA_CHARS, TOK_DATA_NEWLINE, TOK_MAGIC_ENTITY_REF, TOK_ENTITY_REF, TOK_PARAM_ENTITY_REF, TOK_CHAR_REF, TOK_CHAR_PAIR_REF, Token, EmptyTokenException, PartialTokenException, InvalidTokenException, ExtensibleTokenException
 o skipIgnoreSect
 public final int skipIgnoreSect(byte buf[],
                                 int off,
                                 int end) throws PartialTokenException, InvalidTokenException
Skips over an ignored conditional section. The subarray starts following the <![ IGNORE [.

Returns:
the index of the character following the closing ]]>
Throws: PartialTokenException
if the subarray does not contain the complete ignored conditional section
Throws: InvalidTokenException
if the ignored conditional section contains illegal characters
 o getPublicId
 public final String getPublicId(byte buf[],
                                 int off,
                                 int end) throws InvalidTokenException
Checks that a literal contained in the specified byte subarray is a legal public identifier and returns a string with the normalized content of the public id. The subarray includes the opening and closing quotes.

Throws: InvalidTokenException
if it is not a legal public identifier
 o matchesXMLString
 public final boolean matchesXMLString(byte buf[],
                                       int off,
                                       int end,
                                       String str)
Returns true if the specified byte subarray is equal to the string. The string must contain only XML significant characters.

 o skipS
 public final int skipS(byte buf[],
                        int off,
                        int end)
Skips over XML whitespace characters at the start of the specified subarray.

Returns:
the index of the first non-whitespace character, end if there is the subarray is all whitespace
 o getMinBytesPerChar
 public final int getMinBytesPerChar()
Returns the minimum number of bytes required to represent a single character in this encoding. The value will be 1, 2 or 4.


All Packages  Class Hierarchy  This Package  Previous  Next  Index