Description
Bug Report
While processing pybind modules, stubgen inspects the docstring in order to determine the possible function signatures (there may be many if the function is overloaded). During inspection, tokenize.tokenize
is invoked and TokenError
is suppressed:
Lines 349 to 358 in 55d4c17
However, some tokenization errors prevent detection of the function signature. For example, having an unterminated string literal in the docstring is valid (in the context of a Python docstring), but will cause tokenization to stop.
To Reproduce
The following docstring should trigger this behavior:
def thing():
"""
thing(*args, **kwargs)
Overloaded function.
1. thing(x: int) -> None
This is a valid docstring. "We do not need to terminate this string literal on this line.
2. thing(x: int, y: int) -> str
This signature will never get parsed due to TokenError.
"""
The example above terminates with unterminated string literal
before overload 2
is reached, resulting in a missing signature. The resulting signatures produced by infer_sig_from_docstring
are:
[FunctionSig(name='thing', args=[ArgSig(name='x', type='int', default=False)], ret_type='None', type_args=''),
FunctionSig(name='thing', args=[ArgSig(name='*args', type=None, default=False), ArgSig(name='**kwargs', type=None, default=False)], ret_type='Any', type_args='')]
Alternatively, a math
RST block in the docstring will also cause this behavior.
def thing():
"""
thing(*args, **kwargs)
Overloaded function.
1. thing(x: int) -> None
.. math::
\mathbf{x} = 3 \cdot \mathbf{y}
2. thing(x: int, y: int) -> str
This signature will never get parsed due to TokenError.
"""
The second signature is never parsed due to unexpected character after line continuation character
.
Expected Behavior
Ideally all signatures would be detected. It is understandable that it fails, since the scope of things that can appear in a docstring is fairly arbitrary.
Actual Behavior
The first signature will be extracted, but subsequent signatures are not detected. My guess is that this happens because of the tokenization error produced by the first docstring in the list of overloads.
Your Environment
- Mypy version used: 1.14
- Mypy command-line flags: --package --output
- Mypy configuration options from
mypy.ini
(and other config files): None - Python version used: 3.12
Possible Fix
The following might be a viable fix. I tried changing the logic to resume tokenization after errors (provided there is data remaining):
# Keep tokenizing after an error. If `TokenError` is enountered, tokenize() will
# stop. We check the remaining bytes in bytes_io and resume tokenizing on the next
# loop iteration.
encoded_docstr = docstr.encode("utf-8")
bytes_io = io.BytesIO(encoded_docstr)
while bytes_io.tell() < len(encoded_docstr):
# Return all found signatures, even if there is a parse error after some are found.
with contextlib.suppress(tokenize.TokenError):
try:
tokens = tokenize.tokenize(bytes_io.readline)
for token in tokens:
state.add_token(token)
except IndentationError:
return None
On both of my examples above, this produces the correct # of signatures. infer_sig_from_docstring
returns:
[FunctionSig(name='thing', args=[ArgSig(name='x', type='int', default=False)], ret_type='None', type_args=''),
FunctionSig(name='thing', args=[ArgSig(name='x', type='int', default=False), ArgSig(name='y', type='int', default=False)], ret_type='str', type_args=''),
FunctionSig(name='thing', args=[ArgSig(name='*args', type=None, default=False), ArgSig(name='**kwargs', type=None, default=False)], ret_type='Any', type_args='')]
If you are amenable to this solution, I can open a PR.