Description
Bug report
Bug description:
With the unicodedata
module, it is possible to determine if a unicode character is a valid identifier start or identifier continuation character, but not in a few cases.
The method is to look at unicodedata.category(c)
.
A start character has category in "Lu Ll Lt Lm Lo Nl Pc".split()
.
A continue character has category in "Lu Ll Lt Lm Lo Mn Mc Nd Nl Pc".split()
.
However, there are several codepoints which don't match these criteria, either because they are not that type of character or because their category is different.
Here is a complete list of the exceptions, on Python 3.13 and Unicode version 16.0:
Should be XID_START
but are not:
005f Pc True LOW LINE
037a Lm True GREEK YPOGEGRAMMENI
0e33 Lo True THAI CHARACTER SARA AM
0eb3 Lo True LAO VOWEL SIGN AM
203f Pc True UNDERTIE
2040 Pc True CHARACTER TIE
2054 Pc True INVERTED UNDERTIE
2e2f Lm True VERTICAL TILDE
fc5e Lo True ARABIC LIGATURE SHADDA WITH DAMMATAN ISOLATED FORM
fc5f Lo True ARABIC LIGATURE SHADDA WITH KASRATAN ISOLATED FORM
fc60 Lo True ARABIC LIGATURE SHADDA WITH FATHA ISOLATED FORM
fc61 Lo True ARABIC LIGATURE SHADDA WITH DAMMA ISOLATED FORM
fc62 Lo True ARABIC LIGATURE SHADDA WITH KASRA ISOLATED FORM
fc63 Lo True ARABIC LIGATURE SHADDA WITH SUPERSCRIPT ALEF ISOLATED FORM
fdfa Lo True ARABIC LIGATURE SALLALLAHOU ALAYHE WASALLAM
fdfb Lo True ARABIC LIGATURE JALLAJALALOUHOU
fe33 Pc True PRESENTATION FORM FOR VERTICAL LOW LINE
fe34 Pc True PRESENTATION FORM FOR VERTICAL WAVY LOW LINE
fe4d Pc True DASHED LOW LINE
fe4e Pc True CENTRELINE LOW LINE
fe4f Pc True WAVY LOW LINE
fe70 Lo True ARABIC FATHATAN ISOLATED FORM
fe72 Lo True ARABIC DAMMATAN ISOLATED FORM
fe74 Lo True ARABIC KASRATAN ISOLATED FORM
fe76 Lo True ARABIC FATHA ISOLATED FORM
fe78 Lo True ARABIC DAMMA ISOLATED FORM
fe7a Lo True ARABIC KASRA ISOLATED FORM
fe7c Lo True ARABIC SHADDA ISOLATED FORM
fe7e Lo True ARABIC SUKUN ISOLATED FORM
ff3f Pc True FULLWIDTH LOW LINE
ff9e Lm True HALFWIDTH KATAKANA VOICED SOUND MARK
ff9f Lm True HALFWIDTH KATAKANA SEMI-VOICED SOUND MARK
Should not be XID_START
but are:
1885 Mn False MONGOLIAN LETTER ALI GALI BALUDA
1886 Mn False MONGOLIAN LETTER ALI GALI THREE BALUDA
2118 Sm False SCRIPT CAPITAL P
212e So False ESTIMATED SYMBOL
Should be XID_CONTINUE
but are not:
037a Lm True GREEK YPOGEGRAMMENI
2e2f Lm True VERTICAL TILDE
fc5e Lo True ARABIC LIGATURE SHADDA WITH DAMMATAN ISOLATED FORM
fc5f Lo True ARABIC LIGATURE SHADDA WITH KASRATAN ISOLATED FORM
fc60 Lo True ARABIC LIGATURE SHADDA WITH FATHA ISOLATED FORM
fc61 Lo True ARABIC LIGATURE SHADDA WITH DAMMA ISOLATED FORM
fc62 Lo True ARABIC LIGATURE SHADDA WITH KASRA ISOLATED FORM
fc63 Lo True ARABIC LIGATURE SHADDA WITH SUPERSCRIPT ALEF ISOLATED FORM
fdfa Lo True ARABIC LIGATURE SALLALLAHOU ALAYHE WASALLAM
fdfb Lo True ARABIC LIGATURE JALLAJALALOUHOU
fe70 Lo True ARABIC FATHATAN ISOLATED FORM
fe72 Lo True ARABIC DAMMATAN ISOLATED FORM
fe74 Lo True ARABIC KASRATAN ISOLATED FORM
fe76 Lo True ARABIC FATHA ISOLATED FORM
fe78 Lo True ARABIC DAMMA ISOLATED FORM
fe7a Lo True ARABIC KASRA ISOLATED FORM
fe7c Lo True ARABIC SHADDA ISOLATED FORM
fe7e Lo True ARABIC SUKUN ISOLATED FORM
Should not be XID_CONTINUE
but are:
00b7 Po False MIDDLE DOT
0387 Po False GREEK ANO TELEIA
1369 No False ETHIOPIC DIGIT ONE
136a No False ETHIOPIC DIGIT TWO
136b No False ETHIOPIC DIGIT THREE
136c No False ETHIOPIC DIGIT FOUR
136d No False ETHIOPIC DIGIT FIVE
136e No False ETHIOPIC DIGIT SIX
136f No False ETHIOPIC DIGIT SEVEN
1370 No False ETHIOPIC DIGIT EIGHT
1371 No False ETHIOPIC DIGIT NINE
19da No False NEW TAI LUE THAM DIGIT ONE
200c Cf False ZERO WIDTH NON-JOINER
200d Cf False ZERO WIDTH JOINER
2118 Sm False SCRIPT CAPITAL P
212e So False ESTIMATED SYMBOL
30fb Po False KATAKANA MIDDLE DOT
ff65 Po False HALFWIDTH KATAKANA MIDDLE DOT
Many of these exceptions are specified in the UAX#31 Section 5.1, NFKC Modifications.
Proposal
I suggest adding two functions to the module, unicodedata.isidstart(chr)
and unicodedata.isidcontinue(chr)
. These return True
if chr
appears in the DerivedCoreProperties.txt
file as XID_Start
or XID_Continue
, resp.
CPython versions tested on:
3.13
Operating systems tested on:
Windows