[stubgen] Overloaded signatures are dropped if TokenError is encountered while parsing docstrings.

**Bug Report**

While processing pybind modules, stubgen inspects the docstring in order to determine the possible function signatures (there may be many if the function is overloaded). During inspection, `tokenize.tokenize` is invoked and `TokenError` is suppressed:

https://github.com/python/mypy/blob/55d4c1725bae29ad5ac2ce857b4b4b3363e5518c/mypy/stubdoc.py#L349-L358

However, some tokenization errors prevent detection of the function signature. For example, having an unterminated string literal in the docstring is valid (in the context of a Python docstring), but will cause tokenization to stop.

**To Reproduce**

The following docstring should trigger this behavior:

```python
def thing():
  """
  thing(*args, **kwargs)
  Overloaded function.

  1. thing(x: int) -> None

  This is a valid docstring. "We do not need to terminate this string literal on this line.

  2. thing(x: int, y: int) -> str

  This signature will never get parsed due to TokenError.
  """
```
The example above terminates with `unterminated string literal` before overload `2` is reached, resulting in a missing signature. The resulting signatures produced by `infer_sig_from_docstring` are:
```
[FunctionSig(name='thing', args=[ArgSig(name='x', type='int', default=False)], ret_type='None', type_args=''),
 FunctionSig(name='thing', args=[ArgSig(name='*args', type=None, default=False), ArgSig(name='**kwargs', type=None, default=False)], ret_type='Any', type_args='')]
```

Alternatively, a `math` RST block in the docstring will also cause this behavior.

```python
def thing():
  """
  thing(*args, **kwargs)
  Overloaded function.

  1. thing(x: int) -> None

  .. math::
    \mathbf{x} = 3 \cdot \mathbf{y}

  2. thing(x: int, y: int) -> str

  This signature will never get parsed due to TokenError.
  """
```
The second signature is never parsed due to `unexpected character after line continuation character`.

**Expected Behavior**

Ideally all signatures would be detected. It is understandable that it fails, since the scope of things that can appear in a docstring is fairly arbitrary.

**Actual Behavior**

The first signature will be extracted, but subsequent signatures are not detected. My guess is that this happens because of the tokenization error produced by the first docstring in the list of overloads.

**Your Environment**

- Mypy version used: 1.14
- Mypy command-line flags: --package <package name> --output <output directory for stubs>
- Mypy configuration options from `mypy.ini` (and other config files): None
- Python version used: 3.12

**Possible Fix**

The following might be a viable fix. I tried changing the logic to resume tokenization after errors (provided there is data remaining):
```python
# Keep tokenizing after an error. If `TokenError` is enountered, tokenize() will
# stop. We check the remaining bytes in bytes_io and resume tokenizing on the next
# loop iteration.
encoded_docstr = docstr.encode("utf-8")
bytes_io = io.BytesIO(encoded_docstr)
while bytes_io.tell() < len(encoded_docstr):
    # Return all found signatures, even if there is a parse error after some are found.
    with contextlib.suppress(tokenize.TokenError):
        try:
            tokens = tokenize.tokenize(bytes_io.readline)
            for token in tokens:
                state.add_token(token)
        except IndentationError:
            return None
```

On both of my examples above, this produces the correct # of signatures. `infer_sig_from_docstring` returns:

```
[FunctionSig(name='thing', args=[ArgSig(name='x', type='int', default=False)], ret_type='None', type_args=''),
 FunctionSig(name='thing', args=[ArgSig(name='x', type='int', default=False), ArgSig(name='y', type='int', default=False)], ret_type='str', type_args=''),
 FunctionSig(name='thing', args=[ArgSig(name='*args', type=None, default=False), ArgSig(name='**kwargs', type=None, default=False)], ret_type='Any', type_args='')]
```

If you are amenable to this solution, I can open a PR.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[stubgen] Overloaded signatures are dropped if TokenError is encountered while parsing docstrings. #18388

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

	state = DocStringParser(name)
	# Return all found signatures, even if there is a parse error after some are found.
	with contextlib.suppress(tokenize.TokenError):
	try:
	tokens = tokenize.tokenize(io.BytesIO(docstr.encode("utf-8")).readline)
	for token in tokens:
	state.add_token(token)
	except IndentationError:
	return None
	sigs = state.get_signatures()

[stubgen] Overloaded signatures are dropped if TokenError is encountered while parsing docstrings. #18388

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions