Skip to content

doc clarification: confusing match behavior for non-existent ASCII character classes #1234

Open
@dawnofmidnight

Description

@dawnofmidnight

Crate version: 1.11.0
Example code: https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=c4b4cfe18c2e6413444e53315de33b27 (used for snippets below and extra checks)

The behavior of the crate when trying to use the ASCII character class syntax [[:foo:]] with invalid character classes is somewhat confusing. A friend was trying to use [[:XID_Start:]] to check whether _ (underscore/low line) was included in the XID_Start character class (it's not), and was confused when it returned true.

let expr = regex::Regex::new(r"[[:XID_Start:]]").unwrap();
dbg!(expr.is_match("_")); // true

The correct syntax, \p{XID_Start}, does work correctly:

let correct = regex::Regex::new(r"\p{XID_Start}").unwrap();
dbg!(correct.is_match("a")); // true
dbg!(correct.is_match("1")); // false
dbg!(correct.is_match("_")); // false

It seems that when the class is invalid for an ASCII character class (regex § ASCII character classes), it falls back to marking any character present within the brackets as true:

dbg!(expr.is_match(":")); // true
dbg!(expr.is_match("X")); // true
dbg!(expr.is_match("x")); // false
dbg!(expr.is_match("a")); // true
dbg!(expr.is_match("b")); // false
dbg!(expr.is_match("[")); // false
dbg!(expr.is_match("]")); // false

I'm not entirely sure what regex is actually interpreting this sequence as, but, assuming this is intentional behavior, I think that it might be something that is worth documenting in the aforementioned section on ASCII character classes in the docs, as the behavior is not immediately intuitive.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions