Description
Question
I am following the example in https://ai.pydantic.dev/examples/rag/#example-code
I am also using the latest version of vllm (0.8.3) with toolcalling support with --guided_decoding_backend.
The rag example doesnt work with any of models listed in https://docs.vllm.ai/en/stable/features/tool_calling.html#llama-models-llama3-json. Some models (llama3.2) correctly returns the tool call, but agent is unable to parse and make a call tool
I used llama3.1 https://docs.vllm.ai/en/latest/features/tool_calling.html#llama-models-llama3-json, the tool call happens repeatedly (never ending), so the token limit exceed, and crashes.
However, using openai model works perfectly. Even ollama works (without streaming)
Additional Context
Pydantic AI version - pydantic-ai==0.0.43
Python - 3.11.0
vllm - 0.0.83 - vllm-project/vllm#13483 (this doesnt fix the issue I am facing)