License Public Domain
lexers (1)
Owner: Stou S.
Group Owner: Siafoo
Viewable by Everyone
Editable by All Siafoo Users
Siafoo is here to make coding less frustrating and to save you time. Join Siafoo Now or Learn More

How to write a Pygments lexer Atom Feed 1

A short tutorial on adding more builtins to your favorite language.

The article title is somewhat deceptive since what you will actually be doing is "Hacking" a lexer.

1   The Basics

A lexer is a state machine that parses the program text and marks specific sequences of characters with predefined tokens. If you were writing a compiler the tokens would probably be used to create an Abstract Syntax Tree, a tree representation of the program, which will then be used to generate the compiler output. In the Pygments case, [from what I understand] the tokens are used directly by the formatter to generate highlighted output.

In general the lexer you are planning to write will fit in one of the following three categories:

  1. More builtins for an existing language
This would be useful for embedded languages such as Blender's Python.
  1. Custom extensions to a standard language
For a lack of a better example this type of lexer would be useful for something like J++, the polluted Java, Microsoft released in the late 1990s.
  1. A new language
Everything else.

2   Writing a Lexer

Because the Pygments lexer hacking page is a very good resource, for now I will only provide you with information on adding builtins, below. If you need to do something more complicated refer to the Pygments documentation

2.1   Adding builtins

Adding some extra builtins to your favorite language is by far the easiest thing you can do, it is basically a copy/paste job. It is done by subclassing the parent language Lexer and marking the correct 'keywords' as keyword tokens.

The following example from illustrates the point very well:

 1from pygments.lexers.agile import PythonLexer
2from pygments.token import Name, Keyword
4class MyPythonLexer(PythonLexer):
5 EXTRA_KEYWORDS = ['foo', 'bar', 'foobar', 'barfoo', 'spam', 'eggs']
7 def get_tokens_unprocessed(self, text):
8 for index, token, value in PythonLexer.get_tokens_unprocessed(self, text):
9 if token is Name and value in self.EXTRA_KEYWORDS:
10 yield index, Keyword.Pseudo, value
11 else:
12 yield index, token, value

just add your keywords to EXTRA_KEYWORDS list and you are set.