Tokenizer
class Tokenizer
The HTML5 tokenizer.
The tokenizer's role is reading data from the scanner and gathering it into semantic units. From the tokenizer, data is emitted to an event handler, which may (for example) create a DOM tree.
The HTML5 specification has a detailed explanation of tokenizing HTML5. We follow that specification to the maximum extent that we can. If you find a discrepancy that is not documented, please file a bug and/or submit a patch.
This tokenizer is implemented as a recursive descent parser.
Within the API documentation, you may see references to the specific section of the HTML5 spec that the code attempts to reproduce. Example: 8.2.4.1. This refers to section 8.2.4.1 of the HTML5 CR specification.
Constants
WHITE |
|
Methods
Begin parsing.
Set the text mode for the character data reader.
Details
at line line 61
__construct(
Scanner $scanner,
EventHandler $eventHandler)
Create a new tokenizer.
Typically, parsing a document involves creating a new tokenizer, giving it a scanner (input) and an event handler (output), and then calling the Tokenizer::parse() method.`
at line line 78
parse()
Begin parsing.
This will begin scanning the document, tokenizing as it goes. Tokens are emitted into the event handler.
Tokenizing will continue until the document is completely read. Errors are emitted into the event handler, but the parser will attempt to continue parsing until the entire input stream is read.
at line line 109
setTextMode(
integer $textmode,
string $untilTag = null)
Set the text mode for the character data reader.
HTML5 defines three different modes for reading text: - Normal: Read until a tag is encountered. - RCDATA: Read until a tag is encountered, but skip a few otherwise- special characters. - Raw: Read until a special closing tag is encountered (viz. pre, script)
This allows those modes to be set.
Normally, setting is done by the event handler via a special return code on startTag(), but it can also be set manually using this function.