The validateUtf8 procedure allows invalid byte sequences.

Nim UTF-8 Issue

The validateUtf8 procedure allows invalid byte sequences.

Issues:

* it allows characters over U+10FFFF
* it allows surrogates
* it allows over long sequences

I created a nim project to test UTF-8 decoders prompted by this issue.  It's on github: https://github.com/flenniken/utf8tests

If you pull it down and run the following you will see the tests that fail:

~~~
bin/utf8tests -e=utf8tests.txt -s=artifacts/utf8.skip.nim.1.4.8.txt | grep invalid | awk -F: '{print $1}'
~~~

Tests that fail:

~~~
too big:
6.0 
6.2
6.4
6.5

overlong
22.3
22.4
23.1
23.2
37.1
37.2

surrogate
24.0
24.0.1
24.0.2
24.2
24.3
24.4
24.5
24.6
24.7
25.0
25.1
25.2
25.3
25.4
25.5
25.6
25.7
~~~

Below are three sample test cases.  They are commented out in the test_unicodes.nim file.

~~~
    # too big U+001FFFFF, <F7 BF BF BF>                                                      
    # 6.0:invalid hex:F7 BF BF BF:nothing:EFBFBD  EFBFBD  EFBFBD  EFBFBD                     
    check validateUtf8("\xf7\xbf\xbf\xbf") == 0

    # overlong solidus <e0 80 af>                                                            
    # 22.3:invalid hex:e0 80 af:nothing:EFBFBD EFBFBD EFBFBD                                 
    check validateUtf8("\xe0\x80\xaf") == 0

    # 1 surrogate U+D800, <ed a0 80>                                                         
    # 24.0:invalid hex:ed a0 80:nothing:EFBFBD EFBFBD EFBFBD                                 
    check validateUtf8("\xed\xa0\x80") == 0
~~~

And here is the result of running the tests:

~~~
nim c --gc:orc --verbosity:0 --hint[Performance]:off --hint[XCannotRaiseY]:off -d:test  -r -p:src --out:bin/test_unicodes.bin tests/test_unicodes.nim [OSError] validateUtf8
............................
[Suite] unicodes.nim
    /Users/steve/code/utf8tests/tests/test_unicodes.nim(157, 43): Check failed: validateUtf8("????") == 0
    validateUtf8("????") was -1
    /Users/steve/code/utf8tests/tests/test_unicodes.nim(161, 39): Check failed: validateUtf8("???") == 0
    validateUtf8("???") was -1
    /Users/steve/code/utf8tests/tests/test_unicodes.nim(165, 39): Check failed: validateUtf8("???") == 0
    validateUtf8("???") was -1
  [FAILED] validateUtf8

Error: execution of an external program failed: '/Users/steve/code/utf8tests/bin/test_unicodes.bin '[OSError]' validateUtf8'
~~~

The overlong solidus issue is a potential security problem.

You can use my code in the unicodes.nim module, it is MIT licensed, if you want to fix it that way.

I suggest you consider adding the sanitizeUtf8 procedure to nim's runtime since you cannot replace invalid sequences correctly following best practices with just validateUtf8 without writing your own decoder.

utf8CharString is a nice procedure you should consider too. 

Thanks,

Steve Flenniken


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The validateUtf8 procedure allows invalid byte sequences. #19333

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

The validateUtf8 procedure allows invalid byte sequences. #19333

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions