Iterable Parser Combinators for fast parsing in pure Julia

JuliaCon 2020


I introduce the CombinedParsers package for writing complex recursive parsers efficiently in a composable functional style. The package API will be demonstrated by example of an CombinedParser for regular expressions which generates compiled regular expression parsers in pure julia.

Jul 31, 2020 1:00 PM — 3:00 PM
JuliaCon 2020

Far more expressive than regular expressions, parser combinators allow for arbitrary transformations and higher-order parsers depending on the parsing state (exemplified with a very short html parser).

Parsing data from strings recurrently is at the beginning of scientific computing and thus regular expressions are a familiar part of standard tooling. CombinedParsers constructors will be presented side-by-side with the equivalent regex syntax. The regex parser provided with the package can be used as a pure julia plug-in replacement for the current julia Regex type.

Benchmarks and compliance with PCRE syntax will be reported based on the extensive unit tests of the PCRE library. Leveraging julia compiler optimizations for multiple dispatch and parametric types, CombinedParsers performance can for many patterns compete with the PCRE C library that currently is used by julia base Regex.

Arbitrary transformations can be defined as part of the grammar definition, with convenient syntax for extracting data as named tuples. For optimized performance, parsing and transformation are decoupled, and parsing memoization can be used optionally.

Parser combinators straightforwardly generalize from strings to parsing any iterator type. Logging and human-readable error messages help debugging complex parsers.

CombinedParsers supports the iterate interface to lazily generate all valid parsings, and the TextParse interface to include CombinedParsers e.g. in CSV.jl. Preliminary packages for parsing wikitext and orgmode markup with ParserIterators are available.

Other parsing packages (Automa.jl, ParserCombinator.jl) will be acknowledged. Current limitations and considerations for further optimization will be discussed.