Lexing and parsing AsciiDoc files
Lexing and parsing input files is the beginning of both highlighting the code in the editor and building auto-completion and refactoring functionality.
Lexing of input files
Lexing chops the input file into a stream of tokens. Each token has a type and a snippet of characters.
The standard tool in IntelliJ is JFlex.
The heart of lexing is asciidoc.flex
.
It defines multiple states and uses a lot of functionalities and tweaking to parse AsciiDoc.
Developers can add new token types in AsciiDocTokenTypes
.
Ensure to update the list TOKENS_TO_MERGE
if consecutive identical types of the tokens should be merged.
If the content of the token types should be spell-checked, add the token types to the list of tokens in AsciiDocSpellcheckingStrategy
.
Once the file changed asciidoc.flex
, run gradlew compileJava
to generate the parser code.
A test suite for the lexer is in AsciiDocLexerTest
.
I recommend running it from the IDE.
Each test case contains a doTest(...)
method that parses one snippet of AsciiDoc and compares it to an expected “known good” result.
A typical developer workflow for enhancing the lexer looks like this:
-
Change
asciidoc.flex
in the IDE, adding new entries toAsciiDocTokenTypes
as needed. -
Run
gradlew compileJava
on the command line. -
Add a test case to
AsciiDocLexerTest
and run it from the IDE. -
If lexing does not yet work as expected, repeat from step 1.
-
If lexing returns the expected result, update the
expected
parameter in the test.
There is a ready-to-use run-configuration in Intellij called “AsciiDocLexerTest” that will re-generate the lexer and run the tests in one go.
Things to consider when parsing AsciiDoc with JFlex:
|
Here are some JFlex rules for AsciiDoc together with an explanation:
- Look ahead rules
-
Look ahead rules are considered slow in JFlex, but they give the power to recognize tokens only when there is a matching closing token.
A slash (
/
) separates the matching pattern from the look ahead.Example of parsing typographic quotes{TYPOGRAPHIC_QUOTE_START} / [^\n \t] {WORD}* {TYPOGRAPHIC_QUOTE_END}
- End of line and end of file parsing
-
JFlex supports
$
to describe the end of a line as look-ahead. But this doesn’t work at the end of a file. To match the end of a file, the lexer uses the fact that JFlex will match the longest rule first, including any look-ahead rules. So first match the end of a line, then when not at the end of a line with the same length, and then the end of the file (the shortest rule).Example: matching a delimiter at the end of the file// delimiter at end of line {LISTING_BLOCK_DELIMITER} $ { /* ... */ } // delimiter not at end of line {LISTING_BLOCK_DELIMITER} / [^\-\n \t] { /* ... */ } // delimiter at end of file (shorter match than the two above) {LISTING_BLOCK_DELIMITER} { /* ... */ }
- Stateful parser
-
To parse bold, italic and monospace text, which can be nested, there is a set of boolean variables to memorize the current text style. They are reset at the end of a block, like in a regular AsciiDoc file. The function
textFormat()
uses them to determine the current token type from a combination of these flags.Other states memorize the length of block separator line to find the matching closing separator.
- Qualifying matches, push back and state change
-
After a match, the Java code checks additional conditions like if this is an unconstrained position in the stream. If the code decides to discard the match, two possible strategies out of several are:
-
Push back all but the first character, and return the token type for the single character. As an example, when a double-asterisk occurs, but no bold text is to end here, see
{DOUBLEMONO}
in the lexer. -
Push back the complete text and continue with a different state using
yybegin()
(for example when matching a{HEADING_OLDSTYLE}
in theMULTILINE
state). -
Some expressions can be prefixed with a backslash (
\
) to escape the expression. UseisEscaped()
to check if it has been escaped.
Unfortunately, the parser can’t continue with other matches in the same state. To work around this issue, blocks are parsed first in state
MULTILINE
, then in stateSINGLELINE
, and finallyINSIDE_LINE
to implement a hierarchy and some ordering of matches. -
- Auto-completion
-
Expressions described above match expressions once they have their closing syntax completed, and it is essential for the correct highlighting. To support autocomplete, the matching must handle an expression where only the left part of the expression exists.
A special case is in the parser to support autocompletion, as IntelliJ inserts a special string when parsing the content for autocompletion (named
auto-complete
in our parser).In the case for references (
<<ref>>
) there are two rules, one for regular parsing and highlighting, one without:// regular {REFSTART} / [^>\n]+ {REFEND} { yybegin(REF); return AsciiDocTokenTypes.REFSTART; } // auto-complete {REFSTART} / [^>\n ]* {AUTOCOMPLETE} { yybegin(REFAUTO); return AsciiDocTokenTypes.REFSTART; }
Highlighting
Highlighting is coloring the text in the editor.
The file AsciiDocSyntaxHighlighter
defines one TextAttributesKey
to each entry in AsciiDocTokenTypes
parsed during lexing.
Currently, several tokens have the same highlighting ASCIIDOC_MARKER
, so users have the same color for the pointy brackets around references (<<ref>>
)and markers for bold (*bold*
).
Once you add a new TextAttributesKey
, either:
-
Reference an existing color (like
ASCIIDOC_COMMENT
referencesDefaultLanguageHighlighterColors.LINE_COMMENT
). -
Add a color the AsciiDoc themes
AsciidocDefault.xml
andAsciidocDarcula.xml
.
Once you add a new token, you will need to add it to AsciiDocColorSettingsPage
so users can customize the colors of their theme.
This class references also SampleDocument.adoc
and AsciiDocBundle.properties
, therefore you’ll probably need to change these two files as well.
Parsing
Why
Parsing gives a hierarchical structure and meaning to the tokens created in the parsing phase.
It can define PsiElements
inside the tree to allow interactions with the user like renaming of elements and autocompletion.
The structure is the foundation of the structure outline view and the folding capabilities.
How
The AsciiDocParserDefinition
separates spaces and comments from functional tokens.
It also serves as a factory for all PsiElement
s like AsciiDocSection
for sections and AsciiDocBlock
for blocks.
AsciiDocParserImpl
encodes the logic how to group the tokens to a tree.
To do this, it has several strategies.
This outline summarizes the most distinct strategies:
- References
-
Once it sees the start token
REFSTART
(usually two opening pointy brackets, like<<
), it sets a marker. Then it reads all tokens that are valid inside a reference. Once there are no more valid tokens for a reference, it marks this block as aAsciiDocElementTypes.REF
. - Blocks
-
A block starts, for example, with a
LISTING_BLOCK_DELIMITER
(usually four dashes in a line, like----
). Then the block continues up to the point where the same marker occurs again.But the block can be preceded, for example, by a title (it starts with a dot, following by the title itself, like
.Title
). This title is part of the block. To support thisTITLE
and several other elements callmarkPreBlock()
to memorize the first token that is part of a following block. It is stored in a variablemyPreBlockMarker
.When parsing of the block starts and the
myPreBlockMarker
is set, it uses this marker. If the marker is not set, it creates a new marker at the start of the block delimiter. When the block doesn’t start on one of the following lines,dropPreBlock()
drops the marker. - Sections
-
Sections build on top of blocks. They can have pre-block elements as well.
In addition to standard blocks, they build a hierarchy: Each section has a level determined by the number of equal signs at the start (or, if it is an old style heading by the character underlining the heading).
Whenever a section with the same level as the one before starts, the previous section needs to be closed. Whenever a section of a higher order starts (let’s assume two equal signs at the start, like
==
), all open sections with a lower order must be closed (in this case with three or more equal signs at the start). This logic is encapsulated incloseSections()
. It is also called at the end of the document to close all sections at the end of the document.
Debugging
To analyze the structure interactively, install the PsiViewer plugin.
The plugin is pre-installed in the sandbox IDE you start using the runIde
Gradle ask.
You can also install it in the IDE you develop in, but this is optional.
Right-click on the AsciiDoc editor and choose
to browse the tree. There is also a keyboard shortcut for this.Testing
There are unit tests for the parser. You can run them from your IDE. The tests come in two variants:
- AsciiDocPsiTest
-
This test parses a minimal snippet of AsciiDoc, creates the PSI tree, and allows for assertions like in normal unit tests.
Use this to write specific tests. Consider a given/when/then structure to write tests that are comprehensible for other developers. As you test only specific elements in the created tree, your tests will not break when parts of the tree change that are irrelevant to the tested functionality.
- AsciiDocParserTest
-
This test acts on example files in
/testData/parser
together with a known good file.To write a new test, create a new method in the class (like
testSectionsWithPreBlock()
). Then put a matching AsciiDoc file to the example file directory (likesectionsWithPreBlock.adoc
). When you run the test for the first time, it will create a known good file (likesectionsWithPreBlock.txt
). Check the contents of the known good file if the result matches your expectations.On consecutive runs, the test will compare the parser result with the contents of the known good file. If the content matches, the test will pass. If there are differences, the test will fail. If you expected these differences, for example, because you changed the parser or lexer, copy the result shown in your IDE to the known good file.
Please check in the known good file to the Git repository!
So why are there two types of tests? Each has its own strengths!
The known good approach will trigger even on minor changes to the output and gives you the chance to approve or reject the changes. The downside is that these tests will fail when there are unrelated changes because they check too many things. For a known good test, it is also hard to see the parts of the known good that are relevant for the expected behavior and must not change.
The test with single assertions will be most specific to the described functionality, and will leave out parts that are unrelated to the test. Therefore, it will not break for unrelated changes. Meaningful assertions allow fellow developers to understand the expected functionality. Writing such a test is often slower as it requires more code and skill, but it will pay off as it will break less often due to unrelated changes.
Interacting with PsiElements
References and renaming
All PsiElement that reference files like, for example, an include::[]
, or IDs like, for example, an <<id>>
return references.
Examples for this are AsciiDocBlockMacro
and AsciiDocRef
.
They all need to provide a Manipulator
that IntelliJ calls when the user renames such a reference.
To make the “Find References” functionality work, the tokens that contain the IDs need to be part of the Identifier-Token-Set in AsciiDocWordsScanner
.
TODO: refactoring, folding, autocompletion