Description
Nim UTF-8 Issue
The validateUtf8 procedure allows invalid byte sequences.
Issues:
- it allows characters over U+10FFFF
- it allows surrogates
- it allows over long sequences
I created a nim project to test UTF-8 decoders prompted by this issue. It's on github: https://github.com/flenniken/utf8tests
If you pull it down and run the following you will see the tests that fail:
bin/utf8tests -e=utf8tests.txt -s=artifacts/utf8.skip.nim.1.4.8.txt | grep invalid | awk -F: '{print $1}'
Tests that fail:
too big:
6.0
6.2
6.4
6.5
overlong
22.3
22.4
23.1
23.2
37.1
37.2
surrogate
24.0
24.0.1
24.0.2
24.2
24.3
24.4
24.5
24.6
24.7
25.0
25.1
25.2
25.3
25.4
25.5
25.6
25.7
Below are three sample test cases. They are commented out in the test_unicodes.nim file.
# too big U+001FFFFF, <F7 BF BF BF>
# 6.0:invalid hex:F7 BF BF BF:nothing:EFBFBD EFBFBD EFBFBD EFBFBD
check validateUtf8("\xf7\xbf\xbf\xbf") == 0
# overlong solidus <e0 80 af>
# 22.3:invalid hex:e0 80 af:nothing:EFBFBD EFBFBD EFBFBD
check validateUtf8("\xe0\x80\xaf") == 0
# 1 surrogate U+D800, <ed a0 80>
# 24.0:invalid hex:ed a0 80:nothing:EFBFBD EFBFBD EFBFBD
check validateUtf8("\xed\xa0\x80") == 0
And here is the result of running the tests:
nim c --gc:orc --verbosity:0 --hint[Performance]:off --hint[XCannotRaiseY]:off -d:test -r -p:src --out:bin/test_unicodes.bin tests/test_unicodes.nim [OSError] validateUtf8
............................
[Suite] unicodes.nim
/Users/steve/code/utf8tests/tests/test_unicodes.nim(157, 43): Check failed: validateUtf8("????") == 0
validateUtf8("????") was -1
/Users/steve/code/utf8tests/tests/test_unicodes.nim(161, 39): Check failed: validateUtf8("???") == 0
validateUtf8("???") was -1
/Users/steve/code/utf8tests/tests/test_unicodes.nim(165, 39): Check failed: validateUtf8("???") == 0
validateUtf8("???") was -1
[FAILED] validateUtf8
Error: execution of an external program failed: '/Users/steve/code/utf8tests/bin/test_unicodes.bin '[OSError]' validateUtf8'
The overlong solidus issue is a potential security problem.
You can use my code in the unicodes.nim module, it is MIT licensed, if you want to fix it that way.
I suggest you consider adding the sanitizeUtf8 procedure to nim's runtime since you cannot replace invalid sequences correctly following best practices with just validateUtf8 without writing your own decoder.
utf8CharString is a nice procedure you should consider too.
Thanks,
Steve Flenniken