Lexing and parsing AsciiDoc files

Lexing and parsing input files is the beginning of both highlighting the code in the editor and building auto-completion and refactoring functionality.

Lexing of input files

Lexing chops the input file into a stream of tokens. Each token has a type and a snippet of characters.

The standard tool in IntelliJ is JFlex.

The heart of lexing is asciidoc.flex. It defines multiple states and uses a lot of functionalities and tweaking to parse AsciiDoc. Developers can add new token types in AsciiDocTokenTypes. Ensure to update the list TOKENS_TO_MERGE if consecutive identical types of the tokens should be merged. If the content of the token types should be spell-checked, add the token types to the list of tokens in AsciiDocSpellcheckingStrategy.

Once the file changed asciidoc.flex, run gradlew compileJava to generate the parser code.

A test suite for the lexer is in AsciiDocLexerTest. I recommend running it from the IDE. Each test case contains a doTest(...) method that parses one snippet of AsciiDoc and compares it to an expected “known good” result.

A typical developer workflow for enhancing the lexer looks like this:

  1. Change asciidoc.flex in the IDE, adding new entries to AsciiDocTokenTypes as needed.

  2. Run gradlew compileJava on the command line.

  3. Add a test case to AsciiDocLexerTest and run it from the IDE.

  4. If lexing does not yet work as expected, repeat from step 1.

  5. If lexing returns the expected result, update the expected parameter in the test.

There is a ready-to-use run-configuration in Intellij called “AsciiDocLexerTest” that will re-generate the lexer and run the tests in one go.

Things to consider when parsing AsciiDoc with JFlex:

  • JFlex has originally been designed to parse Java code. AsciiDoc is different.

  • There are no wrong characters in AsciiDoc. If you get the syntax wrong, the converter usually prints the wrong syntax “as is”. Only a matching set of for example asterisks (*) produces bold text.

Here are some JFlex rules for AsciiDoc together with an explanation:

Look ahead rules

Look ahead rules are considered slow in JFlex, but they give the power to recognize tokens only when there is a matching closing token.

A slash (/) separates the matching pattern from the look ahead.

Example of parsing typographic quotes
{TYPOGRAPHIC_QUOTE_START} / [^\n \t] {WORD}* {TYPOGRAPHIC_QUOTE_END}
End of line and end of file parsing

JFlex supports $ to describe the end of a line as look-ahead. But this doesn’t work at the end of a file. To match the end of a file, the lexer uses the fact that JFlex will match the longest rule first, including any look-ahead rules. So first match the end of a line, then when not at the end of a line with the same length, and then the end of the file (the shortest rule).

Example: matching a delimiter at the end of the file
// delimiter at end of line
{LISTING_BLOCK_DELIMITER} $ { /* ... */ }

// delimiter not at end of line
{LISTING_BLOCK_DELIMITER} / [^\-\n \t] { /* ... */ }

// delimiter at end of file (shorter match than the two above)
{LISTING_BLOCK_DELIMITER} { /* ... */ }
Stateful parser

To parse bold, italic and monospace text, which can be nested, there is a set of boolean variables to memorize the current text style. They are reset at the end of a block, like in a regular AsciiDoc file. The function textFormat() uses them to determine the current token type from a combination of these flags.

Other states memorize the length of block separator line to find the matching closing separator.

Qualifying matches, push back and state change

After a match, the Java code checks additional conditions like if this is an unconstrained position in the stream. If the code decides to discard the match, two possible strategies out of several are:

  1. Push back all but the first character, and return the token type for the single character. As an example, when a double-asterisk occurs, but no bold text is to end here, see {DOUBLEMONO} in the lexer.

  2. Push back the complete text and continue with a different state using yybegin() (for example when matching a {HEADING_OLDSTYLE} in the MULTILINE state).

  3. Some expressions can be prefixed with a backslash (\) to escape the expression. Use isEscaped() to check if it has been escaped.

Unfortunately, the parser can’t continue with other matches in the same state. To work around this issue, blocks are parsed first in state MULTILINE, then in state SINGLELINE, and finally INSIDE_LINE to implement a hierarchy and some ordering of matches.

Auto-completion

Expressions described above match expressions once they have their closing syntax completed, and it is essential for the correct highlighting. To support autocomplete, the matching must handle an expression where only the left part of the expression exists.

A special case is in the parser to support autocompletion, as IntelliJ inserts a special string when parsing the content for autocompletion (named auto-complete in our parser).

In the case for references (<<ref>>) there are two rules, one for regular parsing and highlighting, one without:

// regular
{REFSTART} / [^>\n]+ {REFEND} { yybegin(REF); return AsciiDocTokenTypes.REFSTART; }
// auto-complete
{REFSTART} / [^>\n ]* {AUTOCOMPLETE} { yybegin(REFAUTO); return AsciiDocTokenTypes.REFSTART; }

Highlighting

Highlighting is coloring the text in the editor.

The file AsciiDocSyntaxHighlighter defines one TextAttributesKey to each entry in AsciiDocTokenTypes parsed during lexing. Currently, several tokens have the same highlighting ASCIIDOC_MARKER, so users have the same color for the pointy brackets around references (<<ref>>)and markers for bold (*bold*).

Once you add a new TextAttributesKey, either:

  1. Reference an existing color (like ASCIIDOC_COMMENT references DefaultLanguageHighlighterColors.LINE_COMMENT).

  2. Add a color the AsciiDoc themes AsciidocDefault.xml and AsciidocDarcula.xml.

Once you add a new token, you will need to add it to AsciiDocColorSettingsPage so users can customize the colors of their theme. This class references also SampleDocument.adoc and AsciiDocBundle.properties, therefore you’ll probably need to change these two files as well.

Parsing

Why

Parsing gives a hierarchical structure and meaning to the tokens created in the parsing phase.

It can define PsiElements inside the tree to allow interactions with the user like renaming of elements and autocompletion. The structure is the foundation of the structure outline view and the folding capabilities.

How

The AsciiDocParserDefinition separates spaces and comments from functional tokens. It also serves as a factory for all PsiElements like AsciiDocSection for sections and AsciiDocBlock for blocks.

AsciiDocParserImpl encodes the logic how to group the tokens to a tree. To do this, it has several strategies. This outline summarizes the most distinct strategies:

References

Once it sees the start token REFSTART (usually two opening pointy brackets, like <<), it sets a marker. Then it reads all tokens that are valid inside a reference. Once there are no more valid tokens for a reference, it marks this block as a AsciiDocElementTypes.REF.

Blocks

A block starts, for example, with a LISTING_BLOCK_DELIMITER (usually four dashes in a line, like ----). Then the block continues up to the point where the same marker occurs again.

But the block can be preceded, for example, by a title (it starts with a dot, following by the title itself, like .Title). This title is part of the block. To support this TITLE and several other elements call markPreBlock() to memorize the first token that is part of a following block. It is stored in a variable myPreBlockMarker.

When parsing of the block starts and the myPreBlockMarker is set, it uses this marker. If the marker is not set, it creates a new marker at the start of the block delimiter. When the block doesn’t start on one of the following lines, dropPreBlock() drops the marker.

Sections

Sections build on top of blocks. They can have pre-block elements as well.

In addition to standard blocks, they build a hierarchy: Each section has a level determined by the number of equal signs at the start (or, if it is an old style heading by the character underlining the heading).

Whenever a section with the same level as the one before starts, the previous section needs to be closed. Whenever a section of a higher order starts (let’s assume two equal signs at the start, like ==), all open sections with a lower order must be closed (in this case with three or more equal signs at the start). This logic is encapsulated in closeSections(). It is also called at the end of the document to close all sections at the end of the document.

Debugging

To analyze the structure interactively, install the PsiViewer plugin. The plugin is pre-installed in the sandbox IDE you start using the runIde Gradle ask.

You can also install it in the IDE you develop in, but this is optional.

Right-click on the AsciiDoc editor and choose PsiViewer  View PSI for entire file to browse the tree. There is also a keyboard shortcut for this.

Testing

There are unit tests for the parser. You can run them from your IDE. The tests come in two variants:

AsciiDocPsiTest

This test parses a minimal snippet of AsciiDoc, creates the PSI tree, and allows for assertions like in normal unit tests.

Use this to write specific tests. Consider a given/when/then structure to write tests that are comprehensible for other developers. As you test only specific elements in the created tree, your tests will not break when parts of the tree change that are irrelevant to the tested functionality.

AsciiDocParserTest

This test acts on example files in /testData/parser together with a known good file.

To write a new test, create a new method in the class (like testSectionsWithPreBlock()). Then put a matching AsciiDoc file to the example file directory (like sectionsWithPreBlock.adoc). When you run the test for the first time, it will create a known good file (like sectionsWithPreBlock.txt). Check the contents of the known good file if the result matches your expectations.

On consecutive runs, the test will compare the parser result with the contents of the known good file. If the content matches, the test will pass. If there are differences, the test will fail. If you expected these differences, for example, because you changed the parser or lexer, copy the result shown in your IDE to the known good file.

Please check in the known good file to the Git repository!

So why are there two types of tests? Each has its own strengths!

The known good approach will trigger even on minor changes to the output and gives you the chance to approve or reject the changes. The downside is that these tests will fail when there are unrelated changes because they check too many things. For a known good test, it is also hard to see the parts of the known good that are relevant for the expected behavior and must not change.

The test with single assertions will be most specific to the described functionality, and will leave out parts that are unrelated to the test. Therefore, it will not break for unrelated changes. Meaningful assertions allow fellow developers to understand the expected functionality. Writing such a test is often slower as it requires more code and skill, but it will pay off as it will break less often due to unrelated changes.

Interacting with PsiElements

References and renaming

All PsiElement that reference files like, for example, an include::[], or IDs like, for example, an <<id>> return references. Examples for this are AsciiDocBlockMacro and AsciiDocRef. They all need to provide a Manipulator that IntelliJ calls when the user renames such a reference. To make the “Find References” functionality work, the tokens that contain the IDs need to be part of the Identifier-Token-Set in AsciiDocWordsScanner.

TODO: refactoring, folding, autocompletion