To learn more about both Lua and Zig, I've started writing a Lua 5.1 implementation in Zig beginning with the lexer (i.e. separating a source file into tokens). Lua's lexer, however, does not have any test cases and the lexing API is not easily separated from the Lua parser. So, when writing test cases for a new, compatible implementation, it's hard to be certain you've covered all the edge cases. This is especially true for Lua, since any 8-bit value is technically a valid character in .lua
source files (embedded NUL
, other control characters, you name it).
After writing some obvious test cases based on my reading of the Lua lexer's source code, I decided to try using fuzz testing to find any edge cases that I wasn't accounting for, and ended up with both some surprising and not-so-surprising results.
libFuzzer
uses various techniques to generate a random set of inputs that maximizes the code coverage of the fuzz-tested code (in this case, the Lua lexer). After fuzzing for a while, I then took those generated inputs and ran them back through the Lua lexer, generating corresponding output files that consist of a list of the lexed tokens and the resulting error, if any. For example:
local hello = "world"
→ Output: local <name> = <string> <eof>
local hello = "world
→ Output: local <name> =
[string "fuzz"]:1: unfinished string near '"world'
With these pairs of input/output files, I can simply make my lexer implementation generate similar output, and then compare that with Lua's for each input file. Any discrepancies (different tokens, different errors, errors at different locations, etc) is then an opportunity to figure out what's happening, write a minimal reproduction test case, and fix it. Once all of the discrepancies are ironed out, we can be reasonably sure that our implementation is compatible with the reference implementation.
See squeek502/fuzzing-lua for the full Lua lexer fuzzing implementation.
elseif
keyword definitionThe above changes could have been caught without fuzzing (given enough scrutiny/time), but there was one additional edge case that likely would not have been caught without fuzz testing due to how rarely it affects normal source code. It turns out that the Lua 5.1 lexer has a bug in its check_next
function. That is, when it is looking ahead at the next character and checking that it is within some set of expected characters, it accidentally accepts '\0'
/ NUL
characters as well (due to its unchecked use of strchr
which has been fixed for Lua 5.2+). Luckily, check_next
is only used in a few places in the Lua 5.1 lexer:
.
characters in concat (..
) and ellipsis (...
) tokense
/E
exponent markers in number tokens-
/+
to denote exponent signed-ness in number tokensThis means that Lua's lexer will tokenize the following such that (where 0
is the NUL
character):
.0
will lex to ..
..0
will lex to ...
1.50-1
will lex to 1.5
(internally, the lexer's state will 'think' the token is 1.5e-1
, but the finished token's string will be treated as NUL
-terminated when converting from string to double).This behavior can be verified by doing the following:
$ printf 'print("con".\0"cat")' > concat.lua
$ lua51 concat.lua
concat
$ lua53 concat.lua
lua53: concat.lua:1: ')' expected near '.'
Because .lua
source files rarely actually have embedded NUL
s--especially outside of string literals--very few people have likely ever run into this particular edge case, but if absolute compatibility with a reference implementation is a goal, then such edge cases have to be taken into account. That's not a goal for my project, but it's still illustrative of the depth of the test cases that fuzzing can bubble up, and it has allowed me to make check_next
-bug compatibility an option in my implementation.