agent: split code from openai example

This commit is contained in:
ochafik 2024-03-29 16:17:59 +00:00
parent 253b68d9a7
commit e874565a13
18 changed files with 1010 additions and 608 deletions

175
examples/agent/README.md Normal file
View file

@ -0,0 +1,175 @@
# examples.agent: Interactive agent that can use Python tools!
Have any LLM use local (sandboxed) tools, with a simple CLI.
```bash
python -m examples.agent \
--model ~/AI/Models/mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf \
--tools examples/agent/tools/example_math_tools.py \
--goal "What is the sum of 2535 squared and 32222000403 then multiplied by one and a half. What's a third of the result?"
```
<!-- --format float \ -->
<details>
<summary>Show output</summary>
```bash
💭 First, I will calculate the square of 2535, then add it to 32222000403. After that, I will multiply the result by 1.5 and finally, I will divide the result by 3.
⚙️ pow(value=2535, power=2) -> 6426225.0
💭 Now that I have calculated the square of 2535, I will calculate the sum of 6426225 and 32222000403.
⚙️ add(a=6426225, b=32222000403) -> 32228426628
💭 Now that I have calculated the sum, I will multiply it by 1.5.
⚙️ multiply(a=32228426628, b=1.5) -> 48342639942.0
💭 Now that I have calculated the product, I will divide it by 3.
⚙️ divide(a=48342639942.0, b=3) -> 16114213314.0
➡️ "\nThe result of the calculation is 16114213314.0."
```
</details>
```bash
python -m examples.agent \
--tools examples/agent/tools/example_weather_tools.py \
--goal "What is the weather going to be like in San Francisco and Glasgow over the next 4 days."
```
<details>
<summary>Show output</summary>
```bash
```
</details>
```bash
python -m examples.agent \
--model ~/AI/Models/mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf \
--std_tools \
--goal "Wait 10sec then say Hi out loud"
```
<details>
<summary>Show output</summary>
```bash
```
</details>
## Prerequisites
Note: To get conda, just install Miniforge (it's OSS): https://github.com/conda-forge/miniforge
```bash
conda create -n agent python=3.11
conda activate agent
pip install -r examples/agent/requirements.txt
pip install -r examples/openai/requirements.txt
```
## Components
This example relies on the new [OpenAI compatibility server](../openai).
```
agent.py → examples.openai → server.cpp
→ safe_tools.py
→ ( run_sandboxed_tools.sh : Docker → fastify.py ) → unsafe_tools.py → code interpreter, etc...
```
The agent can use tools written in Python, or (soon) exposed under OpenAPI endpoints. Only has standard Python deps (e.g. no langchain)
- Can call into any OpenAI endpoint that supports tool calling, spawns a local one if `--endpoint` isn't specified
(can pass all llama.cpp params)
- [Standard tools](./tools/std.py) include "safe" TTS, wait for/until helpers, and *requesting user input*.
- Tools are often "unsafe" (e.g. [Python execution functions](./tools/unsafe_python_tools.py)),
so we provide a script to run them in a Docker-sandboxed environment, exposed as an OpenAPI server:
```bash
examples/openai/run_sandboxed_tools.sh \
examples/agent/tools/unsafe_python_tools.py 6666 &
python -m examples.openai.reactor \
--model ~/AI/Models/mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf \
--tools http://localhost:6666 \
--goal "Whats cos(123) / 23 * 12.6 ?"
```
- [fastify.py](./fastify.py) turns a python module into an OpenAPI endpoint using FastAPI
- [run_sandboxed_tools.sh](./run_sandboxed_tools.sh) builds and runs a Docker environment with fastify inside it, and exposes its port locally
- Beyond just "tools", output format can be constrained using JSON schemas or Pydantic types
```bash
python -m examples.agent \
--model ~/AI/Models/mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf \
--tools examples/agent/tools/example_summaries.py \
--format PyramidalSummary \
--goal "Create a pyramidal summary of Mankind's recent advancements"
```
## Launch parts separately
If you'd like to debug each binary separately (rather than have an agent spawing an OAI compat proxy spawning a C++ server), you can run these commands:
```bash
# C++ server
make -j server
./server --model mixtral.gguf --port 8081
# OpenAI compatibility layer
python -m examples.openai \
--port 8080
--endpoint http://localhost:8081 \
--template_hf_model_id_fallback mistralai/Mixtral-8x7B-Instruct-v0.1
# Or have the OpenAI compatibility layer spawn the C++ server under the hood:
# python -m examples.openai --model mixtral.gguf
# Agent itself:
python -m examples.agent --endpoint http://localhost:8080 \
```
## Use existing tools (WIP)
```bash
git clone https://github.com/NousResearch/Hermes-Function-Calling examples/openai/hermes_function_calling
```
Then edit `examples/agents/hermes_function_calling/utils.py`:
```py
log_folder = os.environ.get('LOG_FOLDER', os.path.join(script_dir, "inference_logs"))
```
Then run tools in a sandbox:
```bash
REQUIREMENTS_FILE=<( cat examples/agents/hermes_function_calling/requirements.txt | grep -vE "bitsandbytes|flash-attn" ) \
examples/agents/run_sandboxed_tools.sh \
examples/agents/hermes_function_calling/functions.py \
-e LOG_FOLDER=/data/inference_logs
```
## TODO
- Add model URL / HF loading support
- Add Embedding endpoint + storage / retrieval tools (Faiss? ScaNN?), or spontaneous RAG
- Auto discover tools exposed by an OpenAPI endpoint
- Add a Python notebook tool example
- Update `run_sandboxed_tools.sh` to support dev mode (`uvicorn fastify:app --reload`)
- Follow-ups (depending on the vibe)
- Remove OAI support from server
- Remove non-Python json schema to grammar converters

View file

@ -0,0 +1,6 @@
import typer
from examples.agent.agent import main
if __name__ == "__main__":
typer.run(main)

243
examples/agent/agent.py Normal file
View file

@ -0,0 +1,243 @@
import atexit
from pathlib import Path
import subprocess
import sys
from time import sleep
import typer
from pydantic import Json, TypeAdapter
from typing import Annotated, Callable, List, Union, Optional, Type
import json, requests
from examples.json_schema_to_grammar import SchemaConverter
from examples.agent.tools.std_tools import StandardTools
from examples.openai.api import ChatCompletionRequest, ChatCompletionResponse, Message, Tool, ToolFunction
from examples.agent.utils import collect_functions, load_module
def _get_params_schema(fn: Callable, verbose):
converter = SchemaConverter(prop_order={}, allow_fetch=False, dotall=False, raw_pattern=False)
schema = TypeAdapter(fn).json_schema()
# Do NOT call converter.resolve_refs(schema) here. Let the server resolve local refs.
if verbose:
sys.stderr.write(f'# PARAMS SCHEMA: {json.dumps(schema, indent=2)}\n')
return schema
def completion_with_tool_usage(
*,
response_model: Optional[Union[Json, Type]]=None,
max_tool_iterations: Optional[int]=None,
tools: List[Callable],
endpoint: str,
messages: List[Message],
auth: Optional[str],
verbose: bool,
**kwargs):
'''
Creates a chat completion using an OpenAI-compatible endpoint w/ JSON schema support
(llama.cpp server, llama-cpp-python, Anyscale / Together...)
The response_model param takes a type (+ supports Pydantic) and behaves just as w/ Instructor (see below)
'''
response_format = None
type_adapter = None
if response_model:
if isinstance(response_model, dict):
schema = response_model
else:
type_adapter = TypeAdapter(response_model)
schema = type_adapter.json_schema()
response_format={"type": "json_object", "schema": schema }
tool_map = {fn.__name__: fn for fn in tools}
tools_schemas = [
Tool(
type="function",
function=ToolFunction(
name=fn.__name__,
description=fn.__doc__,
parameters=_get_params_schema(fn, verbose=verbose)
)
)
for fn in tools
]
i = 0
while (max_tool_iterations is None or i < max_tool_iterations):
request = ChatCompletionRequest(
messages=messages,
response_format=response_format,
tools=tools_schemas,
**kwargs,
)
if verbose:
sys.stderr.write(f'# REQUEST: {request.model_dump_json(indent=2)}\n')
headers = {
"Content-Type": "application/json",
}
if auth:
headers["Authorization"] = auth
response = requests.post(
endpoint,
headers=headers,
json=request.model_dump(),
)
if response.status_code != 200:
raise Exception(f"Request failed ({response.status_code}): {response.text}")
response = ChatCompletionResponse(**response.json())
if verbose:
sys.stderr.write(f'# RESPONSE: {response.model_dump_json(indent=2)}\n')
if response.error:
raise Exception(f'Inference failed: {response.error.message}')
assert len(response.choices) == 1
choice = response.choices[0]
content = choice.message.content
if choice.finish_reason == "tool_calls":
messages.append(choice.message)
for tool_call in choice.message.tool_calls:
if content:
print(f'💭 {content}')
pretty_call = f'{tool_call.function.name}({", ".join(f"{k}={v}" for k, v in tool_call.function.arguments.items())})'
sys.stdout.write(f'⚙️ {pretty_call}')
tool_result = tool_map[tool_call.function.name](**tool_call.function.arguments)
sys.stdout.write(f" -> {tool_result}\n")
messages.append(Message(
tool_call_id=tool_call.id,
role="tool",
name=tool_call.function.name,
# content=f'{tool_result}',
content=f'{pretty_call} = {tool_result}',
))
else:
assert content
result = type_adapter.validate_json(content) if type_adapter else content
return result
i += 1
if max_tool_iterations is not None:
raise Exception(f"Failed to get a valid response after {max_tool_iterations} tool calls")
def main(
goal: Annotated[str, typer.Option()],
tools: Optional[List[str]] = None,
format: Annotated[str, typer.Option(help="The output format: either a Python type (e.g. 'float' or a Pydantic model defined in one of the tool files), or a JSON schema, e.g. '{\"format\": \"date\"}'")] = None,
max_iterations: Optional[int] = 10,
std_tools: Optional[bool] = False,
auth: Optional[str] = None,
verbose: bool = False,
model: Annotated[Optional[Path], typer.Option("--model", "-m")] = "models/7B/ggml-model-f16.gguf",
endpoint: Optional[str] = None,
context_length: Optional[int] = None,
# endpoint: str = 'http://localhost:8080/v1/chat/completions',
n_predict: Optional[int] = 1000,
top_k: Optional[int] = None,
top_p: Optional[float] = None,
min_p: Optional[float] = None,
tfs_z: Optional[float] = None,
typical_p: Optional[float] = None,
temperature: Optional[float] = 0,
dynatemp_range: Optional[float] = None,
dynatemp_exponent: Optional[float] = None,
repeat_last_n: Optional[int] = None,
repeat_penalty: Optional[float] = None,
frequency_penalty: Optional[float] = None,
presense_penalty: Optional[float] = None,
mirostat: Optional[bool] = None,
mirostat_tau: Optional[float] = None,
mirostat_eta: Optional[float] = None,
penalize_nl: Optional[bool] = None,
n_keep: Optional[int] = None,
seed: Optional[int] = None,
n_probs: Optional[int] = None,
min_keep: Optional[int] = None,
):
if not endpoint:
server_port = 8080
server_host = 'localhost'
endpoint: str = f'http://{server_host}:{server_port}/v1/chat/completions'
if verbose:
sys.stderr.write(f"# Starting C++ server with model {model} on {endpoint}\n")
cmd = [
"python", "-m", "examples.openai.server",
"--model", model,
*(['--verbose'] if verbose else []),
*([f'--context_length={context_length}'] if context_length else []),
]
print(cmd)
server_process = subprocess.Popen(cmd, stdout=sys.stderr)
atexit.register(server_process.kill)
sleep(5)
tool_functions = []
types = {}
for f in tools:
module = load_module(f)
tool_functions.extend(collect_functions(module))
types.update({
k: v
for k, v in module.__dict__.items()
if isinstance(v, type)
})
if std_tools:
tool_functions.extend(collect_functions(StandardTools))
response_model = None#str
if format:
if format in types:
response_model = types[format]
elif format == 'json':
response_model = {}
else:
try:
response_model = json.loads(format)
except:
response_model = eval(format)
result = completion_with_tool_usage(
model="...",
endpoint=endpoint,
response_model=response_model,
max_tool_iterations=max_tool_iterations,
tools=tool_functions,
auth=auth,
verbose=verbose,
n_predict=n_predict,
top_k=top_k,
top_p=top_p,
min_p=min_p,
tfs_z=tfs_z,
typical_p=typical_p,
temperature=temperature,
dynatemp_range=dynatemp_range,
dynatemp_exponent=dynatemp_exponent,
repeat_last_n=repeat_last_n,
repeat_penalty=repeat_penalty,
frequency_penalty=frequency_penalty,
presense_penalty=presense_penalty,
mirostat=mirostat,
mirostat_tau=mirostat_tau,
mirostat_eta=mirostat_eta,
penalize_nl=penalize_nl,
n_keep=n_keep,
seed=seed,
n_probs=n_probs,
min_keep=min_keep,
messages=[{
"role": "user",
"content": goal,
}]
)
print(result if response_model else f'➡️ {result}')
if __name__ == '__main__':
typer.run(main)

View file

@ -3,21 +3,11 @@
This is useful in combination w/ the examples/agent/run_sandboxed_tools.sh
'''
import os, sys, typing, importlib.util
from anyio import Path
import fastapi, uvicorn
import typer
from typing import Type, List
def load_source_as_module(source):
i = 0
while (module_name := f'mod_{i}') in sys.modules:
i += 1
spec = importlib.util.spec_from_file_location(module_name, source)
module = importlib.util.module_from_spec(spec)
sys.modules[module_name] = module
spec.loader.exec_module(module)
return module
from examples.agent.utils import load_module
def bind_functions(app, module):
for k in dir(module):
@ -26,7 +16,7 @@ def bind_functions(app, module):
if k == k.capitalize():
continue
v = getattr(module, k)
if not callable(v) or isinstance(v, typing.Type):
if not callable(v) or isinstance(v, Type):
continue
if not hasattr(v, '__annotations__'):
continue
@ -41,18 +31,11 @@ def bind_functions(app, module):
except Exception as e:
print(f'WARNING: Failed to bind /{k}\n\t{e}')
def main(files: typing.List[str], host: str = '0.0.0.0', port: int = 8000):
def main(files: List[str], host: str = '0.0.0.0', port: int = 8000):
app = fastapi.FastAPI()
for f in files:
if f.endswith('.py'):
sys.path.insert(0, str(Path(f).parent))
module = load_source_as_module(f)
else:
module = importlib.import_module(f)
bind_functions(app, module)
bind_functions(app, load_module(f))
uvicorn.run(app, host=host, port=port)

View file

@ -35,23 +35,16 @@ echo "INFO: using DATA_DIR: $DATA_DIR"
cp \
"$SCRIPT_DIR/fastify-requirements.txt" \
"$SCRIPT_DIR/fastify.py" \
"$SCRIPT_DIR/utils.py" \
"$BUILD_DIR"
mkdir -p "$DATA_DIR"
PORT=${PORT:-8088}
# BASE_IMAGE=pytorch/pytorch:latest
# BASE_IMAGE=python:3.10-slim
BASE_IMAGE=python:3.11-slim
# torch
# FROM nvidia/cuda:12.1.1-runtime-ubuntu20.04
# RUN apt-get update && \
# apt-get install -y python3-pip python3-dev && \
# rm -rf /var/lib/apt/lists/*
readonly PORT=${PORT:-8088}
readonly LLAMA_IMAGE_NAME=llama.cpp/tools-base
echo "
FROM $BASE_IMAGE
FROM ${BASE_IMAGE:-python:3.11-slim}
RUN apt-get update
RUN apt-get install -y gcc python3-dev git cmake
RUN pip install --upgrade pip
@ -63,12 +56,11 @@ echo "
RUN pip install -r /root/fastify-requirements.txt
COPY script-requirements.txt /root
RUN pip install -r /root/script-requirements.txt
COPY fastify.py /root
COPY fastify.py utils.py /root
WORKDIR /data
# ENTRYPOINT uvicorn fastify:app --reload
ENTRYPOINT PYTHONPATH=/src python /root/fastify.py --port=$PORT '/src/$( basename "$script" )'
" | docker build "$BUILD_DIR" -f - -t llama.cpp/tools-base
" | docker build "$BUILD_DIR" -f - -t "$LLAMA_IMAGE_NAME"
echo "#"
echo "# Binding $script to http://localhost:$PORT/"
@ -79,4 +71,4 @@ docker run \
--mount "type=bind,source=$( realpath "$script_folder" ),target=/src,readonly" \
--mount "type=bind,source=$( realpath "$DATA_DIR" ),target=/data" \
-p "$PORT:$PORT" \
-it llama.cpp/tools-base
-it "$LLAMA_IMAGE_NAME"

View file

@ -0,0 +1,23 @@
import math
def add(a: float, b: float) -> float:
"""
Add a and b reliably.
Don't use this tool to compute the square of a number (use multiply or pow instead)
"""
return a + b
def multiply(a: float, b: float) -> float:
"""Multiply a with b reliably"""
return a * b
def divide(a: float, b: float) -> float:
"""Divide a by b reliably"""
return a / b
def pow(value: float, power: float) -> float:
"""
Raise a value to a power (exponent) reliably.
The square of x is pow(x, 2), its cube is pow(x, 3), etc.
"""
return math.pow(value, power)

View file

@ -0,0 +1,8 @@
import math
def eval_python_expression(expr: str) -> float:
"""
Evaluate a Python expression reliably.
This can be used to compute complex nested mathematical expressions, or any python, really.
"""
return eval(expr)

View file

@ -0,0 +1,16 @@
from typing import Annotated, List, Optional
from annotated_types import MinLen
from pydantic import BaseModel
class QAPair(BaseModel):
question: str
concise_answer: str
justification: str
class PyramidalSummary(BaseModel):
title: str
summary: str
question_answers: Annotated[List[QAPair], MinLen(2)]
sub_sections: Optional[Annotated[List['PyramidalSummary'], MinLen(2)]]

View file

@ -0,0 +1,36 @@
import random
from typing import Literal
def _weather(w: str, temp, f):
return f'{w}, {temp}C' if format == 'celsius' \
else f'{w}, {(temp * 9/5) + 32}F'
def get_current_weather(location: str, format: Literal["celsius", "fahrenheit"]) -> str:
'''
Get the current weather
Args:
location: The city and state, e.g. San Francisco, CA
format: The temperature unit to use. Infer this from the users location.
'''
return _weather('Sunny', 31, format)
def get_n_day_weather_forecast(location: str, format: Literal["celsius", "fahrenheit"], num_days: int) -> str:
'''
Get an N-day weather forecast
Args:
location: The city and state, e.g. San Francisco, CA
format: The temperature unit to use. Infer this from the users location.
num_days: The number of days to forecast
'''
random.seed(123)
return '\n'.join([
f'{num_days} forecast for {location}:',
*(
f'- in {i} day{"s" if i > 1 else ""}: {_weather("Sunny" if i % 2 == 0 else "Cloudy", random.randrange(15, 35), format)}'
for i in range(1, num_days)
)
])

View file

@ -0,0 +1,78 @@
import atexit
from datetime import date
import datetime
import subprocess
import sys
from time import sleep
import time
import typer
from pydantic import BaseModel, Json, TypeAdapter
from annotated_types import MinLen
from typing import Annotated, Callable, List, Union, Literal, Optional, Type, get_args, get_origin
import json, requests
class Duration(BaseModel):
seconds: Optional[int] = None
minutes: Optional[int] = None
hours: Optional[int] = None
days: Optional[int] = None
months: Optional[int] = None
years: Optional[int] = None
@property
def get_total_seconds(self) -> int:
return sum([
self.seconds or 0,
(self.minutes or 0)*60,
(self.hours or 0)*3600,
(self.days or 0)*86400,
(self.months or 0)*2592000,
(self.years or 0)*31536000,
])
class WaitForDuration(BaseModel):
duration: Duration
class WaitForDate(BaseModel):
until: date
def __call__(self):
# Get the current date
current_date = datetime.date.today()
if self.until < current_date:
raise ValueError("Target date cannot be in the past.")
time_diff = datetime.datetime.combine(self.until, datetime.time.min) - datetime.datetime.combine(current_date, datetime.time.min)
days, seconds = time_diff.days, time_diff.seconds
sys.stderr.write(f"Waiting for {days} days and {seconds} seconds until {d}...\n")
time.sleep(days * 86400 + seconds)
sys.stderr.write(f"Reached the target date: {self.until}\n")
class StandardTools:
@staticmethod
def ask_user(question: str) -> str:
'''
Ask the user a question and return the answer.
This allows getting additional information, requesting disambiguation, etc.
'''
return typer.prompt(question)
@staticmethod
def wait(_for: Union[WaitForDuration, WaitForDate]) -> None:
'''
Wait for a certain amount of time before continuing.
This can be used to wait for a specific duration or until a specific date.
'''
return _for()
@staticmethod
def say_out_loud(something: str) -> str:
"""
Just says something. Used to say each thought out loud
"""
return subprocess.check_call(["say", something])

41
examples/agent/utils.py Normal file
View file

@ -0,0 +1,41 @@
from pathlib import Path
import sys
import importlib.util
from typing import Type
def load_source_as_module(source):
i = 0
while (module_name := f'mod_{i}') in sys.modules:
i += 1
spec = importlib.util.spec_from_file_location(module_name, source)
module = importlib.util.module_from_spec(spec)
sys.modules[module_name] = module
spec.loader.exec_module(module)
return module
def load_module(f: str):
if f.endswith('.py'):
sys.path.insert(0, str(Path(f).parent))
return load_source_as_module(f)
else:
return importlib.import_module(f)
def collect_functions(module):
for k in dir(module):
if k.startswith('_'):
continue
if k == k.capitalize():
continue
v = getattr(module, k)
if not callable(v) or isinstance(v, Type):
continue
if not hasattr(v, '__annotations__'):
continue
vt = type(v)
if vt.__module__ == 'langchain_core.tools' and vt.__name__.endswith('Tool') and hasattr(v, 'func') and callable(v.func):
v = v.func
yield v

View file

@ -1,87 +1,189 @@
# examples.openai: OpenAI API-compatible server + agent / tools examples
# examples.agent: Interactive agent that can use Python tools!
A simple Python server that sits above the C++ [../server](examples/server) and offers improved OAI compatibility.
## Usage
Run a simple test:
New Python OpenAI API compatibility server, which calls into the C++ server under the hood:
```bash
# Spawns a Python server (which spawns a C++ Server) then hits it w/ a tool-calling request
examples/openai/test.sh
python -m examples.openai.server --model model.gguf
```
To simply run the Python server (+ C++ server under the hood):
## Prerequisites
Note: To get conda, just install Miniforge (it's OSS): https://github.com/conda-forge/miniforge
```bash
python -m examples.openai
conda create -n agent python=3.11
conda activate agent
pip install -r examples/openai/requirements.txt
```
## Tools usage (WIP)
```bash
git clone https://github.com/NousResearch/Hermes-Function-Calling examples/openai/hermes_function_calling
```
Then edit `examples/agents/hermes_function_calling/utils.py`:
```py
log_folder = os.environ.get('LOG_FOLDER', os.path.join(script_dir, "inference_logs"))
```
Then run tools in a sandbox:
```bash
REQUIREMENTS_FILE=<( cat examples/agents/hermes_function_calling/requirements.txt | grep -vE "bitsandbytes|flash-attn" ) \
examples/agents/run_sandboxed_tools.sh \
examples/agents/hermes_function_calling/functions.py \
-e LOG_FOLDER=/data/inference_logs
```
TODO: reactor that reads OpenAPI definitions and does the tool calling
## Features
The new examples/openai/server.py:
The new [examples/openai/server.py](./server.py):
- Uses llama.cpp C++ server as a backend (spawns it or connects to existing)
- Supports grammar-constrained tool calling for **all** models (incl. Mixtral 7x8B)
- Uses actual jinja2 chat templates read from the models
- Optimised support for Functionary & Nous Hermes, easy to extend to other tool-calling schemes
- Supports grammar-constrained output for both JSON response format and tool calls
- Generic support w/ JSON schema that guides the model towards tool usage (at the cost of extra tokens):
- Tool calling “works” w/ all models (even non-specialized ones like Mixtral 7x8B)
```ts
{
// original_thought: string,
thought_about_next_step_only: string,
next_step: {tool_calls: {name: string, arguments: any}} | {result: T}
}
// Where T is the output JSON schema, or 'any'
```
- Option to publicise schemas to models as TypeScript signatures (as for Functionary) or JSON schema.
- Optimised support for Functionary & Nous Hermes, easy to extend to other tool-calling fine-tunes
- Supports models that require user/assistant alternance (like Mixtral Instruct) by merging system messages into user messages.
- Spawns the C++ [llama.cpp server](../server) under the hood (unless passed `--endpoint`), but only uses its non-chat endpoint
(depending on the prompting strategy, we weave the tool & output schema along with the chat template into the raw model grammar constraints)
- Uses the actual Jinja2 templates stored in the GGUF models
- Will eventually also spawn `whisper.cpp` and another server subprocess for the embeddings endpoint
Rationale: the C++ server lacks some OpenAI compatibility features (and can't realistically keep up with prompt templates w/o bringing in too many dependencies), this new layer could allow focusing the C++ server on serving efficiency and delegate OAI compliance to a layer easier to maintain.
## Test
If you want to see tools in action, look at the [agent example](../agent). Otherwise:
Start the server in Terminal 1:
```bash
python -m examples.openai --model ~/AI/Models/mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf
```
Query it in Terminal 2 (or use it from any framework that makes use of tools: note tool calls are guaranteed to comply to the schema, so retries are likely not necessary!):
```bash
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-3.5-turbo",
"tools": [{
"type": "function",
"function": {
"name": "get_current_weather",
"description": "Get the current weather",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "The city and state, e.g. San Francisco, CA"
},
"format": {
"type": "string",
"enum": ["celsius", "fahrenheit"],
"description": "The temperature unit to use. Infer this from the users location."
}
},
"required": ["location", "format"]
}
}
}, {
"type": "function",
"function": {
"name": "get_n_day_weather_forecast",
"description": "Get an N-day weather forecast",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "The city and state, e.g. San Francisco, CA"
},
"format": {
"type": "string",
"enum": ["celsius", "fahrenheit"],
"description": "The temperature unit to use. Infer this from the users location."
},
"num_days": {
"type": "integer",
"description": "The number of days to forecast"
}
},
"required": ["location", "format", "num_days"]
}
}
}],
"messages": [
{"role": "system", "content": "Do not make assumptions about what values to plug into functions. Ask for clarification if a user request is ambiguous."},
{"role": "user", "content": "what is the weather going to be like in San Francisco and Glasgow over the next 4 days"}
]
}'
```
<details>
<summary>Show output</summary>
```json
{
"id": "chatcmpl-3095057176",
"object": "chat.completion",
"created": 1711726921,
"model": "gpt-3.5-turbo",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"name": null,
"tool_call_id": null,
"content": "In order to provide the required information, I need to call the get_n_day_weather_forecast function twice, once for San Francisco and once for Glasgow.",
"tool_calls": [
{
"id": "call_970977",
"type": "function",
"function": {
"name": "get_n_day_weather_forecast",
"arguments": {
"location": "San Francisco, CA",
"format": "celsius",
"num_days": 4
}
}
}
]
},
"logprobs": null,
"finish_reason": "tool_calls"
}
],
"usage": {
"prompt_tokens": 546,
"completion_tokens": 118,
"total_tokens": 664
},
"system_fingerprint": "...",
"error": null
}
```
</details>
## TODO
- Support tool result messages
- Reactor /
- Embedding endpoint w/ distinct server subprocess
- Automatic/manual session caching
- Evaluate options for session caching
- Spawns the main C++ CLI under the hood
- Pass session id & store / read from file?
- Support parent session ids for trees of thought?
- Support precaching long prompts from CLI
- Instant incremental inference in long threads
- Improve examples/agent:
- Interactive agent CLI that auto-discovers tools from OpenAPI endpoints
- Script that wraps any Python source as a container-sandboxed OpenAPI endpoint (allowing running ~unsafe code w/ tools)
- Basic memory / RAG / python interpreter tools
- Support precaching long prompts from CLI / read session files?
- Follow-ups
- Remove OAI support from server
- Remove non-Python json schema to grammar converters
- Remove non-Python json-schema-to-grammar versions
- Reach out to frameworks to advertise new option.

View file

@ -1,8 +1,7 @@
from jsonargparse import CLI
import typer
from examples.openai.server import main
if __name__ == "__main__":
CLI(main)
typer.run(main)

View file

@ -1,3 +1,4 @@
from abc import ABC
from typing import Any, Dict, Literal, Optional, Union
from pydantic import BaseModel, Json, TypeAdapter
@ -10,8 +11,6 @@ class ToolCall(BaseModel):
type: Literal["function"] = "function"
function: FunctionCall
ToolCallsTypeAdapter = TypeAdapter(list[ToolCall])
class Message(BaseModel):
role: str
name: Optional[str] = None
@ -32,15 +31,7 @@ class ResponseFormat(BaseModel):
type: str
json_schema: Optional[Any] = None
class ChatCompletionRequest(BaseModel):
model: str
tools: Optional[list[Tool]] = None
messages: list[Message] = None
prompt: Optional[str] = None
response_format: Optional[ResponseFormat] = None
stream: bool = False
cache_prompt: Optional[bool] = None
class LlamaCppParams(BaseModel):
n_predict: Optional[int] = None
top_k: Optional[int] = None
top_p: Optional[float] = None
@ -63,6 +54,16 @@ class ChatCompletionRequest(BaseModel):
n_probs: Optional[int] = None
min_keep: Optional[int] = None
class ChatCompletionRequest(LlamaCppParams):
model: str
tools: Optional[list[Tool]] = None
messages: list[Message] = None
prompt: Optional[str] = None
response_format: Optional[ResponseFormat] = None
stream: bool = False
cache_prompt: Optional[bool] = None
class Choice(BaseModel):
index: int
message: Message
@ -74,6 +75,10 @@ class Usage(BaseModel):
completion_tokens: int
total_tokens: int
class CompletionError(BaseModel):
message: str
# code: int
class ChatCompletionResponse(BaseModel):
id: str
object: Literal["chat.completion"]
@ -81,4 +86,5 @@ class ChatCompletionResponse(BaseModel):
model: str
choices: list[Choice]
usage: Usage
system_fingerprint: str
system_fingerprint: str
error: Optional[CompletionError] = None

View file

@ -9,130 +9,13 @@ import re
import sys
from typing import Any, Dict, Literal, Optional, Tuple, Callable, Union
from pydantic import BaseModel
from typeguard import typechecked
# from typeguard import typechecked
from examples.json_schema_to_grammar import SchemaConverter
from examples.openai.api import Tool, Message, FunctionCall, ToolCall
from examples.openai.gguf_kvs import GGUFKeyValues, Keys
from examples.openai.ts_converter import SchemaToTypeScriptConverter
@typechecked
def raise_exception(msg: str):
raise Exception(msg)
@typechecked
class ChatTemplate(BaseModel):
template: str
@property
def tool_style(self) -> 'ToolsPromptStyle':
return self._tool_style
def __init__(self, template: str, eos_token: str, bos_token: str):
super().__init__(template=template
)
env = jinja2.Environment(loader=jinja2.BaseLoader(), trim_blocks=True, lstrip_blocks=True)
self._template = env.from_string(template)
self._eos_token = eos_token
self._bos_token = bos_token
self._strict_user_assistant_alternation = "{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception" in template
if "<|recipient|>' + tool_call['function']['name']" in template:
self._tool_style = ToolsPromptStyle.TYPESCRIPT_FUNCTIONARY_V2
else:
# self._tool_style = ToolsPromptStyle.TOOLS_BESPOKE
self._tool_style = ToolsPromptStyle.TOOLS_LONG
# self._tool_style = ToolsPromptStyle.TOOLS_MISTRAL
# TODO: Test whether the template supports formatting tool_calls
delimiter = '<%$[SAMPLE]$%>'
user_msg = Message(role="user", content="Hey")
empty_prompt = self.render([user_msg], add_generation_prompt=True).strip()
planted_prompt = self.render([user_msg, Message(role="assistant", content=delimiter)], add_generation_prompt=False).strip()
assert planted_prompt.startswith(empty_prompt), f"Planted prompt does not start with empty prompt: {planted_prompt} vs {empty_prompt}"
[prefix, suffix] = planted_prompt[len(empty_prompt):].split(delimiter)
sys.stderr.write(f"\n# prefix={prefix}\n# suffix={suffix}\n\n")
self._prefix = prefix
self._suffix = suffix
def strip_suffix(self, s: str) -> str:
if s.endswith(self._suffix):
return s[:-len(self._suffix)]
else:
sys.stderr.write(f"Expected suffix ({self._suffix}) not found: {s}\n")
return s
def __str__(self):
return f"ChatTemplate(template={self.template}, eos_token={self._eos_token}, bos_token={self._bos_token})"
def add_system_prompt(self, messages: list[Message], system_prompt: Message) -> list[Message]:
assert system_prompt.role == "system"
# TODO: add to last system message, or create a new one just before the last user message
system_message = next(((i, m) for i, m in enumerate(messages) if m.role == "system"), None)
if system_message is not None:
(i, m) = system_message
return messages[:i] + [Message(role="system", content=system_prompt.content + '\n' + m.content)] + messages[i+1:]
else:
return [system_prompt] + messages
@staticmethod
def from_gguf(metadata: GGUFKeyValues):
tokens = metadata[Keys.Tokenizer.LIST]
return ChatTemplate(
template = metadata[Keys.Tokenizer.CHAT_TEMPLATE],
bos_token = tokens[metadata[Keys.Tokenizer.BOS_ID]],
eos_token = tokens[metadata[Keys.Tokenizer.EOS_ID]])
def render(self, messages: list[Message], add_generation_prompt: bool, omit_bos: bool = False):
sys.stderr.write(f'# strict_user_assistant_alternation={self._strict_user_assistant_alternation}\n')
sys.stderr.write(f'# messages=' + "\n".join(json.dumps(m.model_dump(), indent=2) for m in messages) + '\n')
if self._strict_user_assistant_alternation and any(m.role not in ('user', 'assistant') for m in messages):
new_messages=[]
i = 0
n = len(messages)
while i < n:
if messages[i].role == 'system':
assert messages[i+1].role == 'user'
new_messages.append(Message(
role="user",
content=f'[SYS]{messages[i].content}[/SYS]\n{messages[i+1].content}'
))
i += 2
elif messages[i].role == 'assistant' and messages[i].tool_calls and messages[i].content:
tc = '\n'.join(f'<tool_call>{json.dumps(tc.model_dump())}</tool_call>' for tc in messages[i].tool_calls)
new_messages.append(Message(
role="assistant",
content=f'{messages[i].content}\n{tc}'
))
i += 1
elif messages[i].role == 'tool':
new_messages.append(Message(
role="user",
content=f'TOOL(name={messages[i].name}, id={messages[i].tool_call_id}): {messages[i].content}',
))
i += 1
else:
new_messages.append(messages[i])
i += 1
# print(f'new_messages={json.dumps(new_messages, indent=2)}')
messages = new_messages
# print(f'messages={messages}')
result = self._template.render(
messages=messages,
eos_token=self._eos_token,
bos_token='' if omit_bos else self._bos_token,
raise_exception=raise_exception,
add_generation_prompt=add_generation_prompt,
)
sys.stderr.write(f'\n# RENDERED:\n\n{result}\n\n')
return result
# While the API will be usable with a generic tools usage like OpenAI,
# (see https://cookbook.openai.com/examples/how_to_call_functions_with_chat_models),
# each model may need specific prompting (and/or constrained output,
@ -163,6 +46,133 @@ class ToolsPromptStyle(Enum):
# Note: see this prior attempt to support Functionary: https://github.com/ggerganov/llama.cpp/pull/5695
TYPESCRIPT_FUNCTIONARY_V2 = 6
def raise_exception(msg: str):
raise Exception(msg)
class ChatTemplate(BaseModel):
template: str
@property
def tool_style(self) -> 'ToolsPromptStyle':
return self._tool_style
def __init__(self, template: str, eos_token: str, bos_token: str):
super().__init__(template=template
)
env = jinja2.Environment(loader=jinja2.BaseLoader(), trim_blocks=True, lstrip_blocks=True)
self._template = env.from_string(template)
self._eos_token = eos_token
self._bos_token = bos_token
self._strict_user_assistant_alternation = "{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception" in template
if "<|recipient|>' + tool_call['function']['name']" in template:
self._tool_style = ToolsPromptStyle.TYPESCRIPT_FUNCTIONARY_V2
else:
self._tool_style = ToolsPromptStyle.TOOLS_BESPOKE
# self._tool_style = ToolsPromptStyle.TOOLS_LONG
# self._tool_style = ToolsPromptStyle.TOOLS_HERMES_2_PRO
# self._tool_style = ToolsPromptStyle.TOOLS_MISTRAL
# TODO: Test whether the template supports formatting tool_calls
delimiter = '<%$[SAMPLE]$%>'
user_msg = Message(role="user", content="Hey")
empty_prompt = self.render([user_msg], add_generation_prompt=True).strip()
planted_prompt = self.render([user_msg, Message(role="assistant", content=delimiter)], add_generation_prompt=False).strip()
assert planted_prompt.startswith(empty_prompt), f"Planted prompt does not start with empty prompt: {planted_prompt} vs {empty_prompt}"
[prefix, suffix] = planted_prompt[len(empty_prompt):].split(delimiter)
# sys.stderr.write(f"\n# prefix={prefix}\n# suffix={suffix}\n\n")
self._prefix = prefix
self._suffix = suffix
def strip_suffix(self, s: str) -> str:
if s.endswith(self._suffix):
return s[:-len(self._suffix)]
else:
sys.stderr.write(f"Expected suffix ({self._suffix}) not found: {s}\n")
return s
def __str__(self):
return f"ChatTemplate(template={self.template}, eos_token={self._eos_token}, bos_token={self._bos_token})"
def add_system_prompt(self, messages: list[Message], system_prompt: Message) -> list[Message]:
assert system_prompt.role == "system"
# TODO: add to last system message, or create a new one just before the last user message
system_message = next(((i, m) for i, m in enumerate(messages) if m.role == "system"), None)
if system_message is not None:
(i, m) = system_message
return messages[:i] + [Message(role="system", content=system_prompt.content + '\n' + m.content)] + messages[i+1:]
else:
return [system_prompt] + messages
@staticmethod
def from_gguf(metadata: GGUFKeyValues):
if Keys.Tokenizer.CHAT_TEMPLATE not in metadata:
raise NotImplementedError(f'Only supporting models with {Keys.Tokenizer.CHAT_TEMPLATE} entry in their GGUF key-values (TODO: add default template, maybe pick llama2\'s?)')
tokens = metadata[Keys.Tokenizer.LIST]
return ChatTemplate(
template = metadata[Keys.Tokenizer.CHAT_TEMPLATE],
bos_token = tokens[metadata[Keys.Tokenizer.BOS_ID]],
eos_token = tokens[metadata[Keys.Tokenizer.EOS_ID]])
@staticmethod
def from_huggingface(model_id: str):
from transformers import LlamaTokenizer
tokenizer = LlamaTokenizer.from_pretrained(model_id)
return ChatTemplate(
template = tokenizer.chat_template or tokenizer.default_chat_template,
bos_token = tokenizer.bos_token,
eos_token = tokenizer.eos_token)
def render(self, messages: list[Message], add_generation_prompt: bool, omit_bos: bool = False):
# sys.stderr.write(f'# strict_user_assistant_alternation={self._strict_user_assistant_alternation}\n')
# sys.stderr.write(f'# messages=' + "\n".join(json.dumps(m.model_dump(), indent=2) for m in messages) + '\n')
if self._strict_user_assistant_alternation and any(m.role not in ('user', 'assistant') for m in messages):
new_messages=[]
i = 0
n = len(messages)
while i < n:
if messages[i].role == 'system':
assert messages[i+1].role == 'user'
new_messages.append(Message(
role="user",
content=f'[SYS]{messages[i].content}[/SYS]\n{messages[i+1].content}'
))
i += 2
elif messages[i].role == 'assistant' and messages[i].tool_calls and messages[i].content:
tc = '\n'.join(f'<tool_call>{json.dumps(tc.model_dump())}</tool_call>' for tc in messages[i].tool_calls)
new_messages.append(Message(
role="assistant",
content=f'{messages[i].content}\n{tc}'
))
i += 1
elif messages[i].role == 'tool':
new_messages.append(Message(
role="user",
content=f'TOOL RESULT(name={messages[i].name}, id={messages[i].tool_call_id}): {messages[i].content}',
))
i += 1
else:
new_messages.append(messages[i])
i += 1
# print(f'new_messages={json.dumps(new_messages, indent=2)}')
messages = new_messages
# print(f'messages={messages}')
result = self._template.render(
messages=messages,
eos_token=self._eos_token,
bos_token='' if omit_bos else self._bos_token,
raise_exception=raise_exception,
add_generation_prompt=add_generation_prompt,
)
# sys.stderr.write(f'\n# RENDERED:\n\n{result}\n\n')
return result
class ChatHandlerArgs(BaseModel):
chat_template: ChatTemplate
response_schema: Optional[dict] = None
@ -189,12 +199,14 @@ class NoToolsChatHandler(ChatHandler):
content=_please_respond_with_schema(args.response_schema)
)
converter = SchemaConverter(prop_order={}, allow_fetch=False, dotall=False, raw_pattern=False)
self.grammar = converter.visit(args.response_schema, '')
schema = converter.resolve_refs(args.response_schema, 'response')
converter.visit(schema, '')
self.grammar = converter.format_grammar()
else:
self.output_format_prompt = None
self.grammar = None
@typechecked
# @typechecked
def parse(self, s: str) -> Optional[Message]:
return Message(role="assistant", content=s)
@ -203,21 +215,24 @@ class ToolCallTagsChatHandler(ChatHandler):
super().__init__(args)
converter = SchemaConverter(prop_order={}, allow_fetch=False, dotall=False, raw_pattern=False)
tool_rules = [
converter.visit(
tool_rules = []
for tool in self.args.tools:
parameters_schema = tool.function.parameters
parameters_schema = converter.resolve_refs(parameters_schema, tool.function.name)
tool_rules.append(converter.visit(
dict(
type="object",
properties=dict(
name=dict(type="string", pattern='^' + tool.function.name.replace('_', f'\\?_') + '$') if escapes_underscores \
else dict(const=tool.function.name),
arguments=tool.function.parameters,
arguments=parameters_schema,
),
required=['name', 'arguments']
),
f'{tool.function.name}-tool-call'
)
for tool in self.args.tools
]
))
def format_literal(s: str) -> str:
if escapes_underscores:
@ -253,7 +268,7 @@ class ToolCallTagsChatHandler(ChatHandler):
# ") " + converter._format_literal("</tool_call>") +
# ")") # + converter._format_literal(suffix))
@typechecked
# @typechecked
def parse(self, s: str) -> Optional[Message]:
s = self.args.chat_template.strip_suffix(s)
@ -386,7 +401,7 @@ class FunctionaryToolsChatHandler(ChatHandler):
# ") " +
# ")") # + converter._format_literal(suffix))
@typechecked
# @typechecked
def parse(self, s: str) -> Optional[Message]:
s = self.args.chat_template.strip_suffix(s)
@ -422,7 +437,7 @@ def _make_bespoke_schema(response_schema, tool_call_schema, allow_parallel_calls
return {
"type": "object",
"properties": {
"original_goal": {"title": "Original Goal", "type": "string"},
# "original_goal": {"title": "Original Goal", "type": "string"},
"thought_about_next_step_only": {
"title": "Thought about next step",
# "title": "Thought about how the next step brings us closer to achieving the original goal",
@ -455,6 +470,7 @@ def _make_bespoke_schema(response_schema, tool_call_schema, allow_parallel_calls
},
},
"required": ["original_goal", "thought_about_next_step_only", "next_step"]
# "required": ["next_step"]
}
class BespokeToolsChatHandler(ChatHandler):
@ -513,7 +529,7 @@ class BespokeToolsChatHandler(ChatHandler):
])
)
@typechecked
# @typechecked
def parse(self, s: str) -> Optional[Message]:
s = self.args.chat_template.strip_suffix(s)
try:
@ -527,7 +543,7 @@ class BespokeToolsChatHandler(ChatHandler):
elif 'tool_calls' in next_step:
return Message(
role="assistant",
content=data["thought_about_next_step_only"],
content=data["thought_about_next_step_only"] if "thought_about_next_step_only" in data else None,
tool_calls=[
ToolCall(id=gen_callid(), function=FunctionCall(**tc))
for tc in next_step['tool_calls']
@ -545,7 +561,8 @@ _SHORT_TEMPLATE='\n'.join([
_LONG_TEMPLATE='\n'.join([
# '''You are a function calling AI model. You are provided with function signatures within <tools></tools> XML tags.''',
'You may call one or more functions to assist with the user query. Don\'t make assumptions about what values to plug into functions. Here are the available tools:',
# 'You may call one or more functions to assist with the user query. Don\'t make assumptions about what values to plug into functions. Here are the available tools:',
'Call one or more functions to assist with the user query, every time this is possible. Don\'t make assumptions about what values to plug into functions. Here are the available tools:',
'<tools>',
'{tools}',
'</tools>',
@ -564,7 +581,7 @@ def get_chat_handler(args: ChatHandlerArgs, allow_parallel_calls=False) -> ChatH
if not args.tools:
return NoToolsChatHandler(args)
elif args.chat_template.tool_style == ToolsPromptStyle.TYPESCRIPT_FUNCTIONARY_V2:
return FunctionaryToolsChatHandler(args)
return FunctionaryToolsChatHandler(args, allow_parallel_calls=False)
elif args.chat_template.tool_style == ToolsPromptStyle.TOOLS_SHORT:
return TemplatedToolsChatHandler(args, _SHORT_TEMPLATE, allow_parallel_calls=allow_parallel_calls)
elif args.chat_template.tool_style == ToolsPromptStyle.TOOLS_LONG:

View file

@ -1,344 +0,0 @@
# Usage:
#! ./server -m some-model.gguf &
#! pip install pydantic
#! python examples/json-schema-pydantic-example.py
#
# TODO:
# - https://github.com/NousResearch/Hermes-Function-Calling
#
# <|im_start|>system
# You are a function calling AI model. You are provided with function signatures within <tools></tools> XML tags
# You may call one or more functions to assist with the user query. Don't make assumptions about what values to plug into functions. Here are the available tools:
# <tools> {'type': 'function', 'function': {'name': 'get_stock_fundamentals',
# 'description': 'get_stock_fundamentals(symbol: str) -> dict - Get fundamental data for a given stock symbol using yfinance API.\n\n Args:\n symbol (str): The stock symbol.\n\n Returns:\n dict: A dictionary containing fundamental data.', 'parameters': {'type': 'object', 'properties': {'symbol': {'type': 'string'}}, 'required': ['symbol']}}}
# </tools> Use the following pydantic model json schema for each tool call you will make: {'title': 'FunctionCall', 'type': 'object', 'properties': {'arguments': {'title': 'Arguments', 'type': 'object'}, 'name': {'title': 'Name', 'type': 'string'}}, 'required': ['arguments', 'name']} For each function call return a json object with function name and arguments within <tool_call></tool_call> XML tags as follows:
# <tool_call>
# {'arguments': <args-dict>, 'name': <function-name>}
# </tool_call><|im_end|>
from dataclasses import dataclass
import subprocess
import sys
from pydantic import BaseModel, TypeAdapter
from annotated_types import MinLen
from typing import Annotated, Callable, List, Union, Literal, Optional, Type, get_args, get_origin
import json, requests
from examples.openai.api import ToolCallsTypeAdapter
def type_to_str(t):
origin = get_origin(t)
if origin is None:
return t.__name__
args = get_args(t)
return origin.__name__ + (
f'[{", ".join(type_to_str(a) for a in args)}]' if args else ''
)
def build_union_type_adapter(*types):
src = '\n'.join([
'from pydantic import TypeAdapter',
'from typing import Union',
f'_out = TypeAdapter(Union[{", ".join(type_to_str(t) for t in types)}])',
])
globs = {
**globals(),
**{t.__name__: t for t in types},
}
exec(src, globs)
return globs['_out']
class Thought(BaseModel):
thought: str
def build_tool_call_adapter2(final_output_type, *tools):
lines = [
'from pydantic import BaseModel, TypeAdapter',
'from typing import Literal, Union',
]
globs = {
**globals(),
**locals(),
final_output_type.__name__: final_output_type,
}
tool_calls = []
for fn in tools:
# TODO: escape fn.__doc__ and fn.__doc__ to avoid comment or metadata injection!
fn_name = fn.__name__
fn_doc = fn.__doc__.replace('"""', "'''") if fn.__doc__ else None
name = fn_name.replace('_', ' ').title().replace(' ', '')
lines += [
f'class {name}ToolArgs(BaseModel):',
*(f' {k}: {type_to_str(v)}' for k, v in fn.__annotations__.items() if k != 'return'),
f'class {name}ToolCall(BaseModel):',
*([f' """{fn_doc}"""'] if fn_doc else []),
f' name: Literal["{fn_name}"]',
f' arguments: {name}ToolArgs',
f'class {name}Tool(BaseModel):',
# *([f' """{fn_doc}"""'] if fn_doc else []),
f' id: str',
f' type: Literal["function"]',
f' function: {name}ToolCall',
f' def __call__(self) -> {type_to_str(fn.__annotations__.get("return"))}:',
f' return {fn_name}(**self.function.arguments.dict())',
]
tool_calls.append(f'{name}Tool')
lines += [
# 'class FinalResult(BaseModel):',
# f' result: {type_to_str(final_output_type)}',
# 'class Response(BaseModel):',
# f' """A response that starts with a thought about whether we need tools or not, the plan about tool usage (maybe a sequence of tool calls), and then either a final result (of type {final_output_type.__name__}) or a first tool call"""',
# f' original_goal: str',
# f' thought_process: str',
# # f' thought: str',
# f' next_step: Union[FinalResult, {", ".join(tool_calls)}]',
# f'response_adapter = TypeAdapter(Response)'
f'response_adapter = TypeAdapter(Union[{", ".join(tool_calls)}])',
]
exec('\n'.join(lines), globs)
return globs['response_adapter']
def create_completion2(*, response_model=None, max_tool_iterations=None, tools=[], endpoint="http://localhost:8080/v1/chat/completions", messages, **kwargs):
'''
Creates a chat completion using an OpenAI-compatible endpoint w/ JSON schema support
(llama.cpp server, llama-cpp-python, Anyscale / Together...)
The response_model param takes a type (+ supports Pydantic) and behaves just as w/ Instructor (see below)
'''
if response_model:
type_adapter = TypeAdapter(response_model)
schema = type_adapter.json_schema()
# messages = [{
# "role": "system",
# "content": f"Respond in JSON format with the following schema: {json.dumps(schema, indent=2)}"
# }] + messages
# print("Completion: ", json.dumps(messages, indent=2))
# print("SCHEMA: " + json.dumps(schema, indent=2))
response_format={"type": "json_object", "schema": schema }
tool_call_adapter = build_tool_call_adapter2(response_model, *tools)
tool_adapters = [(fn, TypeAdapter(fn)) for fn in tools]
tools_schemas = [{
"type": "function",
"function": {
"name": fn.__name__,
"description": fn.__doc__,
"parameters": ta.json_schema()
}
} for (fn, ta) in tool_adapters]
# messages = [{
# "role": "system",
# "content": '\n'.join([
# # "You are a function calling AI model. You are provided with function signatures within <tools></tools> XML tags.",
# # "You may call one or more functions to assist with the user query. Don't make assumptions about what values to plug into functions. Here are the available tools:",
# # f'<tools>{json.dumps(tools_schemas)}</tools>',
# 'Before calling each tool, you think clearly and briefly about why and how you are using the tool.',
# f"Respond in JSON format with the following schema: {json.dumps(schema, indent=2)}" if schema else "",
# ])
# }] + messages
i = 0
while (max_tool_iterations is None or i < max_tool_iterations):
body=dict(
messages=messages,
response_format=response_format,
tools=tools_schemas,
**kwargs
)
# sys.stderr.write(f'# REQUEST: {json.dumps(body, indent=2)}\n')
response = requests.post(
endpoint,
headers={"Content-Type": "application/json"},
json=body,
)
if response.status_code != 200:
raise Exception(f"Request failed ({response.status_code}): {response.text}")
# sys.stderr.write(f"\n# RESPONSE:\n\n<<<{response.text}>>>\n\n")
data = response.json()
if 'error' in data:
raise Exception(data['error']['message'])
# sys.stderr.write(f"\n# RESPONSE DATA:\n\n{json.dumps(data, indent=2)}\n\n")
# print(json.dumps(data, indent=2))
choice = data["choices"][0]
content = choice["message"].get("content")
if choice.get("finish_reason") == "tool_calls":
# sys.stderr.write(f'\n# TOOL CALLS:\n{json.dumps(choice["message"]["tool_calls"], indent=2)}\n\n')
# tool_calls =ToolCallsTypeAdapter.validate_json(json.dumps(choice["tool_calls"]))
messages.append(choice["message"])
for tool_call in choice["message"]["tool_calls"]:
# id = tool_call.get("id")
# if id:
# del tool_call["id"]
if content:
print(f'💭 {content}')
tc = tool_call_adapter.validate_json(json.dumps(tool_call))
pretty_call = f'{tc.function.name}({", ".join(f"{k}={v}" for k, v in tc.function.arguments.model_dump().items())})'
sys.stdout.write(f'⚙️ {pretty_call}')
result = tc()
sys.stdout.write(f" -> {result}\n")
messages.append({
"tool_call_id": tc.id,
"role": "tool",
"name": tc.function.name,
# "content": f'{result}',
"content": f'{pretty_call} = {result}',
})
else:
assert content
# print(content)
# print(json.dumps(json.loads(content), indent=2))
result = type_adapter.validate_json(content) if type_adapter else content
# if isinstance(result, Thought):
# print(f'💭 {result.thought}')
# messages.append({
# "role": "assistant",
# "content": json.dumps(result.model_dump(), indent=2),
# })
# else:
return result
i += 1
if max_tool_iterations is not None:
raise Exception(f"Failed to get a valid response after {max_tool_iterations} tool calls")
if __name__ == '__main__':
class QAPair(BaseModel):
question: str
concise_answer: str
justification: str
class PyramidalSummary(BaseModel):
title: str
summary: str
question_answers: Annotated[List[QAPair], MinLen(2)]
sub_sections: Optional[Annotated[List['PyramidalSummary'], MinLen(2)]]
# print("# Summary\n", create_completion(
# model="...",
# response_model=PyramidalSummary,
# messages=[{
# "role": "user",
# "content": f"""
# You are a highly efficient corporate document summarizer.
# Create a pyramidal summary of an imaginary internal document about our company processes
# (starting high-level, going down to each sub sections).
# Keep questions short, and answers even shorter (trivia / quizz style).
# """
# }]))
import math
def eval_python_expression(expr: str) -> float:
"""
Evaluate a Python expression reliably.
This can be used to compute complex nested mathematical expressions, or any python, really.
"""
print("# Evaluating expression: ", expr)
return "0.0"
def add(a: float, b: float) -> float:
"""
Add a and b reliably.
Don't use this tool to compute the square of a number (use multiply or pow instead)
"""
return a + b
# def say(something: str) -> str:
# """
# Just says something. Used to say each thought out loud
# """
# return subprocess.check_call(["say", something])
def multiply(a: float, b: float) -> float:
"""Multiply a with b reliably"""
return a * b
def divide(a: float, b: float) -> float:
"""Divide a by b reliably"""
return a / b
def pow(value: float, power: float) -> float:
"""
Raise a value to a power (exponent) reliably.
The square of x is pow(x, 2), its cube is pow(x, 3), etc.
"""
return math.pow(value, power)
result = create_completion2(
model="...",
response_model=str,
tools=[add, multiply, divide, pow], #, say],#, eval_python_expression],
# tools=[eval_python_expression],
temperature=0.0,
# repetition_penalty=1.0,
n_predict=1000,
top_k=1,
top_p=0.0,
# logit_bias={
# i: 10.0
# for i in range(1, 259)
# },
messages=[{
# "role": "system",
# "content": f"""
# You are a reliable assistant. You think step by step and think before using tools
# """
# }, {
"role": "user",
# "content": f"""
# What is 10 squared?
# """
"content": f"""
What is the sum of 2535 squared and 32222000403 then multiplied by one and a half. What's a third of the result?
Keep your goal in mind at every step.
"""
# Think step by step, start expressing the problem as an arithmetic expression
}])
# result = create_completion(
# model="...",
# response_model=float,
# tools=[add, multiply, divide, pow], #, say],#, eval_python_expression],
# temperature=0.0,
# # logit_bias={
# # i: 10.0
# # for i in range(1, 259)
# # },
# messages=[{
# "role": "user",
# # "content": f"""
# # What is 10 squared?
# # """
# "content": f"""
# What is the sum of 2535 squared and 32222000403 then multiplied by one and a half. What's a third of the result?
# """
# # Think step by step, start expressing the problem as an arithmetic expression
# }])
# 💭 First, I need to square the number 2535. For this, I will use the 'pow' tool.
# ⚙️ pow(args={'value': 2535.0, 'power': 2.0})-> 6426225.0
# 💭 Now that I have the square of 2535, I need to add it to 32222000403.0 and store the result.
# ⚙️ add(args={'a': 6426225.0, 'b': 32222000403.0})-> 32228426628.0
# 💭 Now that I have the sum of 2535 squared and 32222000403, I need to multiply it by 1.5.
# ⚙️ pow(args={'value': 32228426628.0, 'power': 1.5})-> 5785736571757004.0
# 💭 Now that I have the result of the sum multiplied by 1.5, I need to divide it by 3 to get a third of the result.
# ⚙️ divide(args={'a': 5785736571757004.0, 'b': 3.0})-> 1928578857252334.8
# 💭 I have now calculated a third of the result, which is 1928578857252334.8. I can now share this as the final answer.
# Result: 1928578857252334.8
expected_result = (2535 ** 2 + 32222000403) * 1.5 / 3.0
print("➡️", result)
assert math.fabs(result - expected_result) < 0.0001, f"Expected {expected_result}, got {result}"

View file

@ -21,39 +21,56 @@ import random
from starlette.responses import StreamingResponse
from typing import Annotated, Optional
import typer
from typeguard import typechecked
def generate_id(prefix):
return f"{prefix}{random.randint(0, 1 << 32)}"
def main(
model: Annotated[Optional[Path], typer.Option("--model", "-m")] = "models/7B/ggml-model-f16.gguf",
# model: Path = Path("/Users/ochafik/AI/Models/Hermes-2-Pro-Mistral-7B.Q8_0.gguf"),
template_hf_model_id_fallback: Annotated[Optional[str], typer.Option(help="If the GGUF model does not contain a chat template, get it from this HuggingFace tokenizer")] = 'meta-llama/Llama-2-7b-chat-hf',
# model_url: Annotated[Optional[str], typer.Option("--model-url", "-mu")] = None,
host: str = "localhost",
port: int = 8080,
cpp_server_endpoint: Optional[str] = None,
cpp_server_host: str = "localhost",
cpp_server_port: Optional[int] = 8081,
auth: Optional[str] = None,
verbose: bool = False,
context_length: Optional[int] = None,
endpoint: Optional[str] = None,
server_host: str = "localhost",
server_port: Optional[int] = 8081,
):
import uvicorn
metadata = GGUFKeyValues(model)
context_length = metadata[Keys.LLM.CONTEXT_LENGTH]
chat_template = ChatTemplate.from_gguf(metadata)
# print(chat_template)
if endpoint:
sys.stderr.write(f"# WARNING: Unsure which model we're talking to, fetching its chat template from HuggingFace tokenizer of {template_hf_model_id_fallback}\n")
chat_template = ChatTemplate.from_huggingface(template_hf_model_id_fallback)
else:
metadata = GGUFKeyValues(model)
if not cpp_server_endpoint:
sys.stderr.write(f"# Starting C++ server with model {model} on {cpp_server_host}:{cpp_server_port}\n")
if not context_length:
context_length = metadata[Keys.LLM.CONTEXT_LENGTH]
if Keys.Tokenizer.CHAT_TEMPLATE in metadata:
chat_template = ChatTemplate.from_gguf(metadata)
else:
sys.stderr.write(f"# WARNING: Model does not contain a chat template, fetching it from HuggingFace tokenizer of {template_hf_model_id_fallback}\n")
chat_template = ChatTemplate.from_huggingface(template_hf_model_id_fallback)
if verbose:
sys.stderr.write(f"# CHAT TEMPLATE:\n\n{chat_template}\n\n")
if verbose:
sys.stderr.write(f"# Starting C++ server with model {model} on {server_host}:{server_port}\n")
server_process = subprocess.Popen([
"./server", "-m", model,
"--host", cpp_server_host, "--port", f'{cpp_server_port}',
"--host", server_host, "--port", f'{server_port}',
# TODO: pass these from JSON / BaseSettings?
'-ctk', 'q4_0', '-ctv', 'f16',
"-c", f"{2*8192}",
# "-c", f"{context_length}",
"-c", f"{context_length}",
*([] if verbose else ["--log-disable"]),
], stdout=sys.stderr)
atexit.register(server_process.kill)
cpp_server_endpoint = f"http://{cpp_server_host}:{cpp_server_port}"
endpoint = f"http://{server_host}:{server_port}/completions"
app = FastAPI()
@ -62,8 +79,8 @@ def main(
headers = {
"Content-Type": "application/json",
}
if (auth := request.headers.get("Authorization")):
headers["Authorization"] = auth
if (auth_value := request.headers.get("Authorization", auth)):
headers["Authorization"] = auth_value
if chat_request.response_format is not None:
assert chat_request.response_format.type == "json_object", f"Unsupported response format: {chat_request.response_format.type}"
@ -79,9 +96,12 @@ def main(
prompt = chat_template.render(messages, add_generation_prompt=True)
sys.stderr.write(f'\n# MESSAGES:\n\n{TypeAdapter(list[Message]).dump_json(messages)}\n\n')
sys.stderr.write(f'\n# PROMPT:\n\n{prompt}\n\n')
sys.stderr.write(f'\n# GRAMMAR:\n\n{chat_handler.grammar}\n\n')
if verbose:
sys.stderr.write(f'\n# REQUEST:\n\n{chat_request.model_dump_json(indent=2)}\n\n')
# sys.stderr.write(f'\n# MESSAGES:\n\n{TypeAdapter(list[Message]).dump_json(messages)}\n\n')
sys.stderr.write(f'\n# PROMPT:\n\n{prompt}\n\n')
sys.stderr.write(f'\n# GRAMMAR:\n\n{chat_handler.grammar}\n\n')
data = LlamaCppServerCompletionRequest(
**{
@ -101,7 +121,7 @@ def main(
async with httpx.AsyncClient() as client:
response = await client.post(
f"{cpp_server_endpoint}/completions",
f"{endpoint}",
json=data,
headers=headers,
timeout=None)
@ -112,7 +132,8 @@ def main(
return StreamingResponse(generate_chunks(response), media_type="text/event-stream")
else:
result = response.json()
sys.stderr.write("# RESULT:\n\n" + json.dumps(result, indent=2) + "\n\n")
if verbose:
sys.stderr.write("# RESULT:\n\n" + json.dumps(result, indent=2) + "\n\n")
if 'content' not in result:
# print(json.dumps(result, indent=2))
return JSONResponse(result)