Skip to content

The validateUtf8 procedure allows invalid byte sequences. #19333

Open
@flenniken

Description

@flenniken

Nim UTF-8 Issue

The validateUtf8 procedure allows invalid byte sequences.

Issues:

  • it allows characters over U+10FFFF
  • it allows surrogates
  • it allows over long sequences

I created a nim project to test UTF-8 decoders prompted by this issue. It's on github: https://github.com/flenniken/utf8tests

If you pull it down and run the following you will see the tests that fail:

bin/utf8tests -e=utf8tests.txt -s=artifacts/utf8.skip.nim.1.4.8.txt | grep invalid | awk -F: '{print $1}'

Tests that fail:

too big:
6.0 
6.2
6.4
6.5

overlong
22.3
22.4
23.1
23.2
37.1
37.2

surrogate
24.0
24.0.1
24.0.2
24.2
24.3
24.4
24.5
24.6
24.7
25.0
25.1
25.2
25.3
25.4
25.5
25.6
25.7

Below are three sample test cases. They are commented out in the test_unicodes.nim file.

    # too big U+001FFFFF, <F7 BF BF BF>                                                      
    # 6.0:invalid hex:F7 BF BF BF:nothing:EFBFBD  EFBFBD  EFBFBD  EFBFBD                     
    check validateUtf8("\xf7\xbf\xbf\xbf") == 0

    # overlong solidus <e0 80 af>                                                            
    # 22.3:invalid hex:e0 80 af:nothing:EFBFBD EFBFBD EFBFBD                                 
    check validateUtf8("\xe0\x80\xaf") == 0

    # 1 surrogate U+D800, <ed a0 80>                                                         
    # 24.0:invalid hex:ed a0 80:nothing:EFBFBD EFBFBD EFBFBD                                 
    check validateUtf8("\xed\xa0\x80") == 0

And here is the result of running the tests:

nim c --gc:orc --verbosity:0 --hint[Performance]:off --hint[XCannotRaiseY]:off -d:test  -r -p:src --out:bin/test_unicodes.bin tests/test_unicodes.nim [OSError] validateUtf8
............................
[Suite] unicodes.nim
    /Users/steve/code/utf8tests/tests/test_unicodes.nim(157, 43): Check failed: validateUtf8("????") == 0
    validateUtf8("????") was -1
    /Users/steve/code/utf8tests/tests/test_unicodes.nim(161, 39): Check failed: validateUtf8("???") == 0
    validateUtf8("???") was -1
    /Users/steve/code/utf8tests/tests/test_unicodes.nim(165, 39): Check failed: validateUtf8("???") == 0
    validateUtf8("???") was -1
  [FAILED] validateUtf8

Error: execution of an external program failed: '/Users/steve/code/utf8tests/bin/test_unicodes.bin '[OSError]' validateUtf8'

The overlong solidus issue is a potential security problem.

You can use my code in the unicodes.nim module, it is MIT licensed, if you want to fix it that way.

I suggest you consider adding the sanitizeUtf8 procedure to nim's runtime since you cannot replace invalid sequences correctly following best practices with just validateUtf8 without writing your own decoder.

utf8CharString is a nice procedure you should consider too.

Thanks,

Steve Flenniken

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions