ionflux.org | Impressum

Ionflux::Tools::Utf8Tokenizer Class Reference
[String tokenizer]

Tokenizer with UTF-8 support. More...

#include <Utf8Tokenizer.hpp>

Inheritance diagram for Ionflux::Tools::Utf8Tokenizer:

Inheritance graph
[legend]
Collaboration diagram for Ionflux::Tools::Utf8Tokenizer:

Collaboration graph
[legend]
List of all members.

Public Member Functions

 Utf8Tokenizer ()
 Constructor.
 Utf8Tokenizer (const std::string &initInput)
 Constructor.
 Utf8Tokenizer (const std::vector< Utf8TokenType > &initTokenTypes, const std::string &initInput="")
 Constructor.
virtual ~Utf8Tokenizer ()
 Destructor.
virtual void reset ()
 Reset.
virtual void clearTokenTypes ()
 Clear token types.
virtual void useDefaultTokenTypes ()
 Use default token types.
virtual void addDefaultTokenType ()
 Add default token type.
virtual void setTokenTypes (const std::vector< Utf8TokenType > &newTokenTypes)
 Set token types.
virtual void addTokenTypes (const std::vector< Utf8TokenType > &newTokenTypes)
 Add token types.
virtual void addTokenType (const Utf8TokenType &newTokenType)
 Add token type.
virtual void setInput (const std::string &newInput)
 Set input.
virtual void setInput (const std::vector< unsigned int > &newInput)
 Set input.
virtual Utf8Token getNextToken (Utf8TokenTypeMap *otherTypeMap=0)
 Get next token.
virtual Utf8Token getCurrentToken ()
 Get current token.
virtual int getCurrentTokenType ()
 Get current token type.
virtual unsigned int getCurrentPos ()
 Get current position.
virtual unsigned int getCurrentTokenPos ()
 Get current token position.
virtual unsigned int getQuoteChar ()
 Get quote character.
virtual void setExtractQuoted (bool newExtractQuoted)
 Set extract quoted strings flag.
virtual bool getExtractQuoted () const
 Get extract quoted strings flag.
virtual void setExtractEscaped (bool newExtractEscaped)
 Set extract escaped characters flag.
virtual bool getExtractEscaped () const
 Get extract escaped characters flag.

Static Public Member Functions

static bool isValid (const Utf8Token &checkToken)
 Validate token.

Static Public Attributes

static const Utf8TokenType TT_INVALID
 Token type: invalid (special).
static const Utf8TokenType TT_NONE
 Token type: none (special).
static const Utf8TokenType TT_DEFAULT = {1, "", 0}
 Token type: default (special).
static const Utf8TokenType TT_QUOTED = {2, "", 0}
 Token type: quoted (special).
static const Utf8TokenType TT_ESCAPED = {3, "", 0}
 Token type: escaped (special).
static const Utf8TokenType TT_LINEAR_WHITESPACE = {4, " \t", 0}
 Token type: linear whitespace.
static const Utf8TokenType TT_LINETERM = {5, "\n\r", 1}
 Token type: linear whitespace.
static const Utf8TokenType TT_IDENTIFIER
 Token type: identifier.
static const Utf8TokenType TT_NUMBER = {7, "0123456789", 0}
 Token type: identifier.
static const Utf8TokenType TT_ALPHA
 Token type: latin alphabet.
static const Utf8TokenType TT_DEFAULT_SEP = {9, "_-.", 0}
 Token type: default separators.
static const Utf8TokenType TT_LATIN
 Token type: lots of latin characters.
static const Utf8Token TOK_INVALID
 Token type: invalid (special).
static const Utf8Token TOK_NONE
 Token type: none (special).
static const std::string QUOTE_CHARS = "'\""
 Quote characters.
static const unsigned int ESCAPE_CHAR = '\\'
 Escape character.
static const Utf8TokenizerClassInfo utf8TokenizerClassInfo
 Class information instance.
static const Ionflux::Tools::ClassInfoCLASS_INFO
 Class information.

Protected Attributes

std::vector< unsigned int > theInput
 Input characters to be tokenized.
std::vector< unsigned int > quoteChars
 Quote characters.
unsigned int currentPos
 Current position in the input character string.
unsigned int currentTokenPos
 Position of the current token in the input character string.
unsigned int currentQuoteChar
 The current quote character.
Utf8TokenTypeMaptypeMap
 Token type map.
Utf8Token currentToken
 Current token.
bool extractQuoted
 Extract quoted strings flag.
bool extractEscaped
 Extract escaped characters flag.

Detailed Description

Tokenizer with UTF-8 support.

A generic tokenizer for parsing UTF-8 strings. To set up a tokenizer, first create a Utf8Tokenizer object. This will be set up using the default token types Utf8Tokenizer::TT_WHITESPACE, Utf8Tokenizer::TT_LINETERM and Utf8Tokenizer::TT_IDENTIFIER. You may then add your own custom token types and optionally set up the Utf8Tokenizer::TT_ANYTHING token type (which will match anything not matched by previously defined token types). To enable extraction of quoted strings and escaped characters, call Utf8Tokenizer::setExtractQuoted() with true as an argument.
To get a token from the token stream, call Utf8Tokenizer::getNextToken(). Make sure your code handles the Utf8Tokenizer::TT_NONE and Utf8Tokenizer::TT_INVALID special token types (which cannot be disabled). Utf8Tokenizer::getNextToken() will always return Utf8Tokenizer::TT_NONE at the end of the token stream and Utf8Tokenizer::TT_INVALID if an invalid token is encountered.


Constructor & Destructor Documentation

Ionflux::Tools::Utf8Tokenizer::Utf8Tokenizer  ) 
 

Constructor.

Construct new Utf8Tokenizer object.

Ionflux::Tools::Utf8Tokenizer::Utf8Tokenizer const std::string &  initInput  ) 
 

Constructor.

Construct new Utf8Tokenizer object.

Parameters:
initInput UTF-8 input string.

Ionflux::Tools::Utf8Tokenizer::Utf8Tokenizer const std::vector< Utf8TokenType > &  initTokenTypes,
const std::string &  initInput = ""
 

Constructor.

Construct new Utf8Tokenizer object.

Parameters:
initTokenTypes Token types.
initInput UTF-8 input string.

Ionflux::Tools::Utf8Tokenizer::~Utf8Tokenizer  )  [virtual]
 

Destructor.

Destruct Utf8Tokenizer object.


Member Function Documentation

void Ionflux::Tools::Utf8Tokenizer::addDefaultTokenType  )  [virtual]
 

Add default token type.

Add a special token type TT_DEFAULT which will be returned if a token is not recognized.

void Ionflux::Tools::Utf8Tokenizer::addTokenType const Utf8TokenType newTokenType  )  [virtual]
 

Add token type.

Add the specified token type.

Parameters:
newTokenType Token type.

void Ionflux::Tools::Utf8Tokenizer::addTokenTypes const std::vector< Utf8TokenType > &  newTokenTypes  )  [virtual]
 

Add token types.

Add the specified token types.

Parameters:
newTokenTypes .

void Ionflux::Tools::Utf8Tokenizer::clearTokenTypes  )  [virtual]
 

Clear token types.

Remove all token types.

unsigned int Ionflux::Tools::Utf8Tokenizer::getCurrentPos  )  [virtual]
 

Get current position.

Get the current position in the input string.

Returns:
Current position.

Utf8Token Ionflux::Tools::Utf8Tokenizer::getCurrentToken  )  [virtual]
 

Get current token.

Get the current token from the input string.

Returns:
Current token.

unsigned int Ionflux::Tools::Utf8Tokenizer::getCurrentTokenPos  )  [virtual]
 

Get current token position.

Get the position of the current token in the input string.

Returns:
Current token position.

int Ionflux::Tools::Utf8Tokenizer::getCurrentTokenType  )  [virtual]
 

Get current token type.

Get the type ID of the current token.

Returns:
Type ID of current token.

bool Ionflux::Tools::Utf8Tokenizer::getExtractEscaped  )  const [virtual]
 

Get extract escaped characters flag.

Returns:
Current value of extract escaped characters flag.

bool Ionflux::Tools::Utf8Tokenizer::getExtractQuoted  )  const [virtual]
 

Get extract quoted strings flag.

Returns:
Current value of extract quoted strings flag.

Utf8Token Ionflux::Tools::Utf8Tokenizer::getNextToken Utf8TokenTypeMap otherTypeMap = 0  )  [virtual]
 

Get next token.

Get the next token from the input string. If the optional otherTypeMap is set, the specified token type map will be used instead of the default token type map.

Parameters:
otherTypeMap Token type map.
Returns:
Next token.

unsigned int Ionflux::Tools::Utf8Tokenizer::getQuoteChar  )  [virtual]
 

Get quote character.

Get the quote character for the current token.

Returns:
Current quote character.

bool Ionflux::Tools::Utf8Tokenizer::isValid const Utf8Token checkToken  )  [static]
 

Validate token.

Check whether the specified token is valid (i.e. it is not invalid or empty).

Parameters:
checkToken Token to be checked.
Returns:
true if the specified token is valid, false otherwise.

void Ionflux::Tools::Utf8Tokenizer::reset  )  [virtual]
 

Reset.

Reset the tokenizer.

void Ionflux::Tools::Utf8Tokenizer::setExtractEscaped bool  newExtractEscaped  )  [virtual]
 

Set extract escaped characters flag.

Set new value of extract escaped characters flag.

Parameters:
newExtractEscaped New value of extract escaped characters flag.

void Ionflux::Tools::Utf8Tokenizer::setExtractQuoted bool  newExtractQuoted  )  [virtual]
 

Set extract quoted strings flag.

Set new value of extract quoted strings flag.

Parameters:
newExtractQuoted New value of extract quoted strings flag.

void Ionflux::Tools::Utf8Tokenizer::setInput const std::vector< unsigned int > &  newInput  )  [virtual]
 

Set input.

Set the unicode input characters.

Parameters:
newInput Unicode characters.

void Ionflux::Tools::Utf8Tokenizer::setInput const std::string &  newInput  )  [virtual]
 

Set input.

Set the UTF-8 encoded input string.

Parameters:
newInput UTF-8 input string.

void Ionflux::Tools::Utf8Tokenizer::setTokenTypes const std::vector< Utf8TokenType > &  newTokenTypes  )  [virtual]
 

Set token types.

Set the token types for the tokenizer.

Parameters:
newTokenTypes .

void Ionflux::Tools::Utf8Tokenizer::useDefaultTokenTypes  )  [virtual]
 

Use default token types.

Use default token types (TT_LINEAR_WHITESPACE, TT_IDENTIFIER, TT_LINETERM).


Member Data Documentation

const ClassInfo * Ionflux::Tools::Utf8Tokenizer::CLASS_INFO [static]
 

Initial value:

Class information.

Reimplemented from Ionflux::Tools::ManagedObject.

unsigned int Ionflux::Tools::Utf8Tokenizer::currentPos [protected]
 

Current position in the input character string.

unsigned int Ionflux::Tools::Utf8Tokenizer::currentQuoteChar [protected]
 

The current quote character.

Utf8Token Ionflux::Tools::Utf8Tokenizer::currentToken [protected]
 

Current token.

unsigned int Ionflux::Tools::Utf8Tokenizer::currentTokenPos [protected]
 

Position of the current token in the input character string.

const unsigned int Ionflux::Tools::Utf8Tokenizer::ESCAPE_CHAR = '\\' [static]
 

Escape character.

bool Ionflux::Tools::Utf8Tokenizer::extractEscaped [protected]
 

Extract escaped characters flag.

bool Ionflux::Tools::Utf8Tokenizer::extractQuoted [protected]
 

Extract quoted strings flag.

const std::string Ionflux::Tools::Utf8Tokenizer::QUOTE_CHARS = "'\"" [static]
 

Quote characters.

std::vector<unsigned int> Ionflux::Tools::Utf8Tokenizer::quoteChars [protected]
 

Quote characters.

std::vector<unsigned int> Ionflux::Tools::Utf8Tokenizer::theInput [protected]
 

Input characters to be tokenized.

const Utf8Token Ionflux::Tools::Utf8Tokenizer::TOK_INVALID [static]
 

Initial value:

Token type: invalid (special).

const Utf8Token Ionflux::Tools::Utf8Tokenizer::TOK_NONE [static]
 

Initial value:

Token type: none (special).

const Utf8TokenType Ionflux::Tools::Utf8Tokenizer::TT_ALPHA [static]
 

Initial value:

 {8, 
        "abcdefghijklmnopqrstuvwxyz"
        "ABCDEFGHIJKLMNOPQRSTUVWXYZ", 0}
Token type: latin alphabet.

const Utf8TokenType Ionflux::Tools::Utf8Tokenizer::TT_DEFAULT = {1, "", 0} [static]
 

Token type: default (special).

const Utf8TokenType Ionflux::Tools::Utf8Tokenizer::TT_DEFAULT_SEP = {9, "_-.", 0} [static]
 

Token type: default separators.

const Utf8TokenType Ionflux::Tools::Utf8Tokenizer::TT_ESCAPED = {3, "", 0} [static]
 

Token type: escaped (special).

const Utf8TokenType Ionflux::Tools::Utf8Tokenizer::TT_IDENTIFIER [static]
 

Initial value:

 {6, 
        "abcdefghijklmnopqrstuvwxyz"
        "ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789_", 0}
Token type: identifier.

const Utf8TokenType Ionflux::Tools::Utf8Tokenizer::TT_INVALID [static]
 

Initial value:

Token type: invalid (special).

const Utf8TokenType Ionflux::Tools::Utf8Tokenizer::TT_LATIN [static]
 

Initial value:

 {10, 
        "abcdefghijklmnopqrstuvwxyz"
        "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
        "ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíî"
        "ïðñòóôõöøùúûüýþÿĀāĂ㥹ĆćĈĉĊċČčĎďĐđĒēĔĕĖėĘęĚěĜ"
        "ĞğĠġĢģĤĥĦħĨĩĪīĬĭĮįİıIJijĴĵĶķĸĹĺĻļĽľĿŀŁłŃńŅņŇňʼnŊŋŌ"
        "ōŎŏŐőŒœŔŕŖŗŘřŚśŜŝŞşŠšŢţŤťŦŧŨũŪūŬŭŮůŰűŲųŴŵŶŷŸŹźŻżŽžſ", 0}
Token type: lots of latin characters.

const Utf8TokenType Ionflux::Tools::Utf8Tokenizer::TT_LINEAR_WHITESPACE = {4, " \t", 0} [static]
 

Token type: linear whitespace.

const Utf8TokenType Ionflux::Tools::Utf8Tokenizer::TT_LINETERM = {5, "\n\r", 1} [static]
 

Token type: linear whitespace.

const Utf8TokenType Ionflux::Tools::Utf8Tokenizer::TT_NONE [static]
 

Initial value:

Token type: none (special).

const Utf8TokenType Ionflux::Tools::Utf8Tokenizer::TT_NUMBER = {7, "0123456789", 0} [static]
 

Token type: identifier.

const Utf8TokenType Ionflux::Tools::Utf8Tokenizer::TT_QUOTED = {2, "", 0} [static]
 

Token type: quoted (special).

Utf8TokenTypeMap* Ionflux::Tools::Utf8Tokenizer::typeMap [protected]
 

Token type map.

const Utf8TokenizerClassInfo Ionflux::Tools::Utf8Tokenizer::utf8TokenizerClassInfo [static]
 

Class information instance.


The documentation for this class was generated from the following files:
Generated on Tue Mar 14 21:11:52 2006 for Ionflux Tools Class Library (iftools) by  doxygen 1.4.6