agent: split code from openai example

2024-03-29 16:17:59 +00:00 · 2024-03-29 16:17:59 +00:00 · e874565a13
commit e874565a13
parent 253b68d9a7
18 changed files with 1010 additions and 608 deletions
--- a/examples/agent/README.md
+++ b/examples/agent/README.md
@ -0,0 +1,175 @@
+# examples.agent: Interactive agent that can use Python tools!
+
+Have any LLM use local (sandboxed) tools, with a simple CLI.
+
+```bash
+python -m examples.agent \
+    --model ~/AI/Models/mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf \
+    --tools examples/agent/tools/example_math_tools.py \
+    --goal "What is the sum of 2535 squared and 32222000403 then multiplied by one and a half. What's a third of the result?"
+```
+<!-- --format float \ -->
+
+<details>
+<summary>Show output</summary>
+
+```bash
+💭 First, I will calculate the square of 2535, then add it to 32222000403. After that, I will multiply the result by 1.5 and finally, I will divide the result by 3.
+⚙️  pow(value=2535, power=2) -> 6426225.0
+💭 Now that I have calculated the square of 2535, I will calculate the sum of 6426225 and 32222000403.
+⚙️  add(a=6426225, b=32222000403) -> 32228426628
+💭 Now that I have calculated the sum, I will multiply it by 1.5.
+⚙️  multiply(a=32228426628, b=1.5) -> 48342639942.0
+💭 Now that I have calculated the product, I will divide it by 3.
+⚙️  divide(a=48342639942.0, b=3) -> 16114213314.0
+➡️ "\nThe result of the calculation is 16114213314.0."
+```
+
+</details>
+
+```bash
+python -m examples.agent \
+    --tools examples/agent/tools/example_weather_tools.py \
+    --goal "What is the weather going to be like in San Francisco and Glasgow over the next 4 days."
+```
+
+<details>
+<summary>Show output</summary>
+
+```bash
+```
+
+</details>
+
+
+```bash
+python -m examples.agent \
+    --model ~/AI/Models/mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf \
+    --std_tools \
+    --goal "Wait 10sec then say Hi out loud"
+```
+
+<details>
+<summary>Show output</summary>
+
+```bash
+```
+
+</details>
+
+## Prerequisites
+
+Note: To get conda, just install Miniforge (it's OSS): https://github.com/conda-forge/miniforge
+
+```bash
+conda create -n agent python=3.11
+conda activate agent
+pip install -r examples/agent/requirements.txt
+pip install -r examples/openai/requirements.txt
+```
+
+## Components
+
+This example relies on the new [OpenAI compatibility server](../openai).
+
+```
+  agent.py  →  examples.openai  →  server.cpp
+            →  safe_tools.py
+            → ( run_sandboxed_tools.sh :  Docker  →  fastify.py )  →  unsafe_tools.py  →  code interpreter, etc...
+``` 
+
+The agent can use tools written in Python, or (soon) exposed under OpenAPI endpoints. Only has standard Python deps (e.g. no langchain)
+
+- Can call into any OpenAI endpoint that supports tool calling, spawns a local one if `--endpoint` isn't specified
+(can pass all llama.cpp params)
+
+- [Standard tools](./tools/std.py) include "safe" TTS, wait for/until helpers, and *requesting user input*.
+
+- Tools are often "unsafe" (e.g. [Python execution functions](./tools/unsafe_python_tools.py)),
+so we provide a script to run them in a Docker-sandboxed environment, exposed as an OpenAPI server:
+
+    ```bash
+    examples/openai/run_sandboxed_tools.sh \
+        examples/agent/tools/unsafe_python_tools.py 6666 &
+
+    python -m examples.openai.reactor \
+        --model ~/AI/Models/mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf \
+        --tools http://localhost:6666 \
+        --goal "Whats cos(123) / 23 * 12.6 ?"
+    ```
+
+    - [fastify.py](./fastify.py) turns a python module into an OpenAPI endpoint using FastAPI
+
+    - [run_sandboxed_tools.sh](./run_sandboxed_tools.sh) builds and runs a Docker environment with fastify inside it, and exposes its port locally
+
+- Beyond just "tools", output format can be constrained using JSON schemas or Pydantic types
+
+    ```bash
+    python -m examples.agent \
+        --model ~/AI/Models/mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf \
+        --tools examples/agent/tools/example_summaries.py \
+        --format PyramidalSummary \
+        --goal "Create a pyramidal summary of Mankind's recent advancements"
+    ```
+
+## Launch parts separately
+
+If you'd like to debug each binary separately (rather than have an agent spawing an OAI compat proxy spawning a C++ server), you can run these commands:
+
+```bash
+# C++ server
+make -j server
+./server --model mixtral.gguf --port 8081
+
+# OpenAI compatibility layer
+python -m examples.openai \
+    --port 8080
+    --endpoint http://localhost:8081 \
+    --template_hf_model_id_fallback mistralai/Mixtral-8x7B-Instruct-v0.1
+
+# Or have the OpenAI compatibility layer spawn the C++ server under the hood:
+#   python -m examples.openai --model mixtral.gguf
+
+# Agent itself:
+python -m examples.agent --endpoint http://localhost:8080 \
+```
+
+## Use existing tools (WIP)
+
+```bash
+git clone https://github.com/NousResearch/Hermes-Function-Calling examples/openai/hermes_function_calling
+```
+
+Then edit `examples/agents/hermes_function_calling/utils.py`:
+
+```py
+log_folder = os.environ.get('LOG_FOLDER', os.path.join(script_dir, "inference_logs"))
+```
+
+Then run tools in a sandbox:
+
+```bash
+REQUIREMENTS_FILE=<( cat examples/agents/hermes_function_calling/requirements.txt | grep -vE "bitsandbytes|flash-attn" ) \
+  examples/agents/run_sandboxed_tools.sh \
+    examples/agents/hermes_function_calling/functions.py \
+    -e LOG_FOLDER=/data/inference_logs
+```
+
+## TODO
+
+- Add model URL / HF loading support
+
+- Add Embedding endpoint + storage / retrieval tools (Faiss? ScaNN?), or spontaneous RAG
+
+- Auto discover tools exposed by an OpenAPI endpoint
+
+- Add a Python notebook tool example
+
+- Update `run_sandboxed_tools.sh` to support dev mode (`uvicorn fastify:app --reload`)
+
+- Follow-ups (depending on the vibe)
+
+    - Remove OAI support from server
+
+    - Remove non-Python json schema to grammar converters
+
--- a/examples/agent/main.py
+++ b/examples/agent/main.py
@ -0,0 +1,6 @@
+import typer
+
+from examples.agent.agent import main
+
+if __name__ == "__main__":
+    typer.run(main)
--- a/examples/agent/agent.py
+++ b/examples/agent/agent.py
@ -0,0 +1,243 @@
+import atexit
+from pathlib import Path
+import subprocess
+import sys
+from time import sleep
+import typer
+from pydantic import Json, TypeAdapter
+from typing import Annotated, Callable, List, Union, Optional, Type
+import json, requests
+
+from examples.json_schema_to_grammar import SchemaConverter
+from examples.agent.tools.std_tools import StandardTools
+from examples.openai.api import ChatCompletionRequest, ChatCompletionResponse, Message, Tool, ToolFunction
+from examples.agent.utils import collect_functions, load_module
+
+def _get_params_schema(fn: Callable, verbose):
+    converter = SchemaConverter(prop_order={}, allow_fetch=False, dotall=False, raw_pattern=False)
+    schema = TypeAdapter(fn).json_schema()
+    # Do NOT call converter.resolve_refs(schema) here. Let the server resolve local refs.
+    if verbose:
+        sys.stderr.write(f'# PARAMS SCHEMA: {json.dumps(schema, indent=2)}\n')
+    return schema
+
+def completion_with_tool_usage(
+        *,
+        response_model: Optional[Union[Json, Type]]=None,
+        max_tool_iterations: Optional[int]=None,
+        tools: List[Callable],
+        endpoint: str,
+        messages: List[Message],
+        auth: Optional[str],
+        verbose: bool,
+        **kwargs):
+    '''
+    Creates a chat completion using an OpenAI-compatible endpoint w/ JSON schema support
+    (llama.cpp server, llama-cpp-python, Anyscale / Together...)
+
+    The response_model param takes a type (+ supports Pydantic) and behaves just as w/ Instructor (see below)
+    '''
+    response_format = None
+    type_adapter = None
+    if response_model:
+        if isinstance(response_model, dict):
+            schema = response_model
+        else:
+            type_adapter = TypeAdapter(response_model)
+            schema = type_adapter.json_schema()
+        response_format={"type": "json_object", "schema": schema }
+
+    tool_map = {fn.__name__: fn for fn in tools}
+    tools_schemas = [
+        Tool(
+            type="function",
+            function=ToolFunction(
+                name=fn.__name__,
+                description=fn.__doc__,
+                parameters=_get_params_schema(fn, verbose=verbose)
+            )
+        )
+        for fn in tools
+    ]
+
+    i = 0
+    while (max_tool_iterations is None or i < max_tool_iterations):
+        request = ChatCompletionRequest(
+            messages=messages,
+            response_format=response_format,
+            tools=tools_schemas,
+            **kwargs,
+        )
+        if verbose:
+            sys.stderr.write(f'# REQUEST: {request.model_dump_json(indent=2)}\n')
+        headers = {
+            "Content-Type": "application/json",
+        }
+        if auth:
+            headers["Authorization"] = auth
+        response = requests.post(
+            endpoint,
+            headers=headers,
+            json=request.model_dump(),
+        )
+        if response.status_code != 200:
+            raise Exception(f"Request failed ({response.status_code}): {response.text}")
+
+        response = ChatCompletionResponse(**response.json())
+        if verbose:
+            sys.stderr.write(f'# RESPONSE: {response.model_dump_json(indent=2)}\n')
+        if response.error:
+            raise Exception(f'Inference failed: {response.error.message}')
+
+        assert len(response.choices) == 1
+        choice = response.choices[0]
+
+        content = choice.message.content
+        if choice.finish_reason == "tool_calls":
+            messages.append(choice.message)
+            for tool_call in choice.message.tool_calls:
+                if content:
+                    print(f'💭 {content}')
+
+                pretty_call = f'{tool_call.function.name}({", ".join(f"{k}={v}" for k, v in tool_call.function.arguments.items())})'
+                sys.stdout.write(f'⚙️  {pretty_call}')
+                tool_result = tool_map[tool_call.function.name](**tool_call.function.arguments)
+                sys.stdout.write(f" -> {tool_result}\n")
+                messages.append(Message(
+                    tool_call_id=tool_call.id,
+                    role="tool",
+                    name=tool_call.function.name,
+                    # content=f'{tool_result}',
+                    content=f'{pretty_call} = {tool_result}',
+                ))
+        else:
+            assert content
+            result = type_adapter.validate_json(content) if type_adapter else content
+            return result
+
+        i += 1
+
+    if max_tool_iterations is not None:
+        raise Exception(f"Failed to get a valid response after {max_tool_iterations} tool calls")
+
+
+def main(
+    goal: Annotated[str, typer.Option()],
+    tools: Optional[List[str]] = None,
+    format: Annotated[str, typer.Option(help="The output format: either a Python type (e.g. 'float' or a Pydantic model defined in one of the tool files), or a JSON schema, e.g. '{\"format\": \"date\"}'")] = None,
+    max_iterations: Optional[int] = 10,
+    std_tools: Optional[bool] = False,
+    auth: Optional[str] = None,
+    verbose: bool = False,
+
+    model: Annotated[Optional[Path], typer.Option("--model", "-m")] = "models/7B/ggml-model-f16.gguf",
+    endpoint: Optional[str] = None,
+    context_length: Optional[int] = None,
+    # endpoint: str = 'http://localhost:8080/v1/chat/completions',
+
+    n_predict: Optional[int] = 1000,
+    top_k: Optional[int] = None,
+    top_p: Optional[float] = None,
+    min_p: Optional[float] = None,
+    tfs_z: Optional[float] = None,
+    typical_p: Optional[float] = None,
+    temperature: Optional[float] = 0,
+    dynatemp_range: Optional[float] = None,
+    dynatemp_exponent: Optional[float] = None,
+    repeat_last_n: Optional[int] = None,
+    repeat_penalty: Optional[float] = None,
+    frequency_penalty: Optional[float] = None,
+    presense_penalty: Optional[float] = None,
+    mirostat: Optional[bool] = None,
+    mirostat_tau: Optional[float] = None,
+    mirostat_eta: Optional[float] = None,
+    penalize_nl: Optional[bool] = None,
+    n_keep: Optional[int] = None,
+    seed: Optional[int] = None,
+    n_probs: Optional[int] = None,
+    min_keep: Optional[int] = None,
+):
+    if not endpoint:
+        server_port = 8080
+        server_host = 'localhost'
+        endpoint: str = f'http://{server_host}:{server_port}/v1/chat/completions'
+        if verbose:
+            sys.stderr.write(f"# Starting C++ server with model {model} on {endpoint}\n")
+        cmd = [
+            "python", "-m", "examples.openai.server",
+            "--model", model,
+            *(['--verbose'] if verbose else []),
+            *([f'--context_length={context_length}'] if context_length else []),
+        ]
+        print(cmd)
+        server_process = subprocess.Popen(cmd, stdout=sys.stderr)
+        atexit.register(server_process.kill)
+        sleep(5)
+    
+    tool_functions = []
+    types = {}
+    for f in tools:
+        module = load_module(f)
+        tool_functions.extend(collect_functions(module))
+        types.update({
+            k: v
+            for k, v in module.__dict__.items()
+            if isinstance(v, type)
+        })
+
+    if std_tools:
+        tool_functions.extend(collect_functions(StandardTools))
+    
+    response_model = None#str
+    if format:
+        if format in types:
+            response_model = types[format]
+        elif format == 'json':
+            response_model = {}
+        else:
+            try:
+                response_model = json.loads(format)
+            except:
+                response_model = eval(format)
+    
+    
+    result = completion_with_tool_usage(
+        model="...",
+        endpoint=endpoint,
+        response_model=response_model,
+        max_tool_iterations=max_tool_iterations,
+        tools=tool_functions,
+        auth=auth,
+        verbose=verbose,
+
+        n_predict=n_predict,
+        top_k=top_k,
+        top_p=top_p,
+        min_p=min_p,
+        tfs_z=tfs_z,
+        typical_p=typical_p,
+        temperature=temperature,
+        dynatemp_range=dynatemp_range,
+        dynatemp_exponent=dynatemp_exponent,
+        repeat_last_n=repeat_last_n,
+        repeat_penalty=repeat_penalty,
+        frequency_penalty=frequency_penalty,
+        presense_penalty=presense_penalty,
+        mirostat=mirostat,
+        mirostat_tau=mirostat_tau,
+        mirostat_eta=mirostat_eta,
+        penalize_nl=penalize_nl,
+        n_keep=n_keep,
+        seed=seed,
+        n_probs=n_probs,
+        min_keep=min_keep,
+        messages=[{
+            "role": "user",
+            "content": goal,
+        }]
+    )
+    print(result if response_model else f'➡️ {result}')
+
+if __name__ == '__main__':
+    typer.run(main)
+
--- a/examples/openai/fastify-requirements.txt
+++ b/examples/openai/fastify-requirements.txt
--- a/examples/openai/fastify.py
+++ b/examples/openai/fastify.py
@ -3,21 +3,11 @@

    This is useful in combination w/ the examples/agent/run_sandboxed_tools.sh
 '''
-import os, sys, typing, importlib.util
-from anyio import Path
 import fastapi, uvicorn
 import typer
+from typing import Type, List

-def load_source_as_module(source):
-    i = 0
-    while (module_name := f'mod_{i}') in sys.modules:
-        i += 1
-
-    spec = importlib.util.spec_from_file_location(module_name, source)
-    module = importlib.util.module_from_spec(spec)
-    sys.modules[module_name] = module
-    spec.loader.exec_module(module)
-    return module
+from examples.agent.utils import load_module

 def bind_functions(app, module):
    for k in dir(module):
@ -26,7 +16,7 @@ def bind_functions(app, module):
        if k == k.capitalize():
            continue
        v = getattr(module, k)
-        if not callable(v) or isinstance(v, typing.Type):
+        if not callable(v) or isinstance(v, Type):
            continue
        if not hasattr(v, '__annotations__'):
            continue
@ -41,18 +31,11 @@ def bind_functions(app, module):
        except Exception as e:
            print(f'WARNING:    Failed to bind /{k}\n\t{e}')

-def main(files: typing.List[str], host: str = '0.0.0.0', port: int = 8000):
+def main(files: List[str], host: str = '0.0.0.0', port: int = 8000):
    app = fastapi.FastAPI()

    for f in files:
-        if f.endswith('.py'):
-            sys.path.insert(0, str(Path(f).parent))
-
-            module = load_source_as_module(f)
-        else:
-            module = importlib.import_module(f)
-
-        bind_functions(app, module)
+        bind_functions(app, load_module(f))

    uvicorn.run(app, host=host, port=port)

--- a/examples/openai/run_sandboxed_tools.sh
+++ b/examples/openai/run_sandboxed_tools.sh
@ -35,23 +35,16 @@ echo "INFO: using DATA_DIR: $DATA_DIR"
 cp \
    "$SCRIPT_DIR/fastify-requirements.txt" \
    "$SCRIPT_DIR/fastify.py" \
+    "$SCRIPT_DIR/utils.py" \
    "$BUILD_DIR"

 mkdir -p "$DATA_DIR"

-PORT=${PORT:-8088}
-
-# BASE_IMAGE=pytorch/pytorch:latest
-# BASE_IMAGE=python:3.10-slim
-BASE_IMAGE=python:3.11-slim
-# torch 
-# FROM nvidia/cuda:12.1.1-runtime-ubuntu20.04 
-# RUN apt-get update && \
-#     apt-get install -y python3-pip python3-dev && \
-#     rm -rf /var/lib/apt/lists/*
+readonly PORT=${PORT:-8088}
+readonly LLAMA_IMAGE_NAME=llama.cpp/tools-base

 echo "
-    FROM     $BASE_IMAGE
+    FROM     ${BASE_IMAGE:-python:3.11-slim}
    RUN      apt-get update
    RUN      apt-get install -y gcc python3-dev git cmake
    RUN      pip install --upgrade pip
@ -63,12 +56,11 @@ echo "
    RUN      pip install -r /root/fastify-requirements.txt
    COPY     script-requirements.txt  /root
    RUN      pip install -r /root/script-requirements.txt
-    COPY     fastify.py               /root
+    COPY     fastify.py utils.py      /root

    WORKDIR  /data
-    # ENTRYPOINT uvicorn fastify:app --reload
    ENTRYPOINT PYTHONPATH=/src python /root/fastify.py --port=$PORT '/src/$( basename "$script" )'
-" | docker build "$BUILD_DIR" -f - -t llama.cpp/tools-base
+" | docker build "$BUILD_DIR" -f - -t "$LLAMA_IMAGE_NAME"

 echo "#"
 echo "# Binding $script to http://localhost:$PORT/"
@ -79,4 +71,4 @@ docker run \
    --mount "type=bind,source=$( realpath "$script_folder" ),target=/src,readonly" \
    --mount "type=bind,source=$( realpath "$DATA_DIR" ),target=/data" \
    -p "$PORT:$PORT" \
-    -it llama.cpp/tools-base
+    -it "$LLAMA_IMAGE_NAME"
--- a/examples/agent/tools/example_math_tools.py
+++ b/examples/agent/tools/example_math_tools.py
@ -0,0 +1,23 @@
+import math
+
+def add(a: float, b: float) -> float:
+    """
+        Add a and b reliably.
+        Don't use this tool to compute the square of a number (use multiply or pow instead)
+    """
+    return a + b
+
+def multiply(a: float, b: float) -> float:
+    """Multiply a with b reliably"""
+    return a * b
+
+def divide(a: float, b: float) -> float:
+    """Divide a by b reliably"""
+    return a / b
+
+def pow(value: float, power: float) -> float:
+    """
+        Raise a value to a power (exponent) reliably.
+        The square of x is pow(x, 2), its cube is pow(x, 3), etc.
+    """
+    return math.pow(value, power)
--- a/examples/agent/tools/example_python_tools.py
+++ b/examples/agent/tools/example_python_tools.py
@ -0,0 +1,8 @@
+import math
+
+def eval_python_expression(expr: str) -> float:
+    """
+        Evaluate a Python expression reliably.
+        This can be used to compute complex nested mathematical expressions, or any python, really.
+    """
+    return eval(expr)
--- a/examples/agent/tools/example_summaries.py
+++ b/examples/agent/tools/example_summaries.py
@ -0,0 +1,16 @@
+
+from typing import Annotated, List, Optional
+from annotated_types import MinLen
+from pydantic import BaseModel
+
+
+class QAPair(BaseModel):
+    question: str
+    concise_answer: str
+    justification: str
+
+class PyramidalSummary(BaseModel):
+    title: str
+    summary: str
+    question_answers: Annotated[List[QAPair], MinLen(2)]
+    sub_sections: Optional[Annotated[List['PyramidalSummary'], MinLen(2)]]
--- a/examples/agent/tools/example_weather_tools.py
+++ b/examples/agent/tools/example_weather_tools.py
@ -0,0 +1,36 @@
+
+import random
+from typing import Literal
+
+
+def _weather(w: str, temp, f):
+    return f'{w}, {temp}C' if format == 'celsius' \
+        else f'{w}, {(temp * 9/5) + 32}F'
+
+def get_current_weather(location: str, format: Literal["celsius", "fahrenheit"]) -> str:
+      '''
+        Get the current weather
+
+        Args:
+            location: The city and state, e.g. San Francisco, CA
+            format: The temperature unit to use. Infer this from the users location.
+      '''
+      return _weather('Sunny', 31, format)
+
+def get_n_day_weather_forecast(location: str, format: Literal["celsius", "fahrenheit"], num_days: int) -> str:
+    '''
+        Get an N-day weather forecast
+
+        Args:
+            location: The city and state, e.g. San Francisco, CA
+            format: The temperature unit to use. Infer this from the users location.
+            num_days: The number of days to forecast
+    '''
+    random.seed(123)
+    return '\n'.join([
+        f'{num_days} forecast for {location}:',
+        *(
+            f'- in {i} day{"s" if i > 1 else ""}: {_weather("Sunny" if i % 2 == 0 else "Cloudy", random.randrange(15, 35), format)}'
+            for i in range(1, num_days)
+        )
+    ])
--- a/examples/agent/tools/std_tools.py
+++ b/examples/agent/tools/std_tools.py
@ -0,0 +1,78 @@
+import atexit
+from datetime import date
+import datetime
+import subprocess
+import sys
+from time import sleep
+import time
+import typer
+from pydantic import BaseModel, Json, TypeAdapter
+from annotated_types import MinLen
+from typing import Annotated, Callable, List, Union, Literal, Optional, Type, get_args, get_origin
+import json, requests
+
+class Duration(BaseModel):
+    seconds: Optional[int] = None
+    minutes: Optional[int] = None
+    hours: Optional[int] = None
+    days: Optional[int] = None
+    months: Optional[int] = None
+    years: Optional[int] = None
+
+    @property
+    def get_total_seconds(self) -> int:
+        return sum([
+            self.seconds or 0,
+            (self.minutes or 0)*60,
+            (self.hours or 0)*3600,
+            (self.days or 0)*86400,
+            (self.months or 0)*2592000,
+            (self.years or 0)*31536000,
+        ])
+
+class WaitForDuration(BaseModel):
+    duration: Duration
+
+class WaitForDate(BaseModel):
+    until: date
+
+    def __call__(self):
+        # Get the current date
+        current_date = datetime.date.today()
+
+        if self.until < current_date:
+            raise ValueError("Target date cannot be in the past.")
+
+        time_diff = datetime.datetime.combine(self.until, datetime.time.min) - datetime.datetime.combine(current_date, datetime.time.min)
+
+        days, seconds = time_diff.days, time_diff.seconds
+
+        sys.stderr.write(f"Waiting for {days} days and {seconds} seconds until {d}...\n")
+        time.sleep(days * 86400 + seconds)
+        sys.stderr.write(f"Reached the target date: {self.until}\n")
+        
+
+class StandardTools:
+
+    @staticmethod
+    def ask_user(question: str) -> str:
+        '''
+            Ask the user a question and return the answer.
+            This allows getting additional information, requesting disambiguation, etc.
+        '''
+        return typer.prompt(question)
+    
+    @staticmethod
+    def wait(_for: Union[WaitForDuration, WaitForDate]) -> None:
+        '''
+            Wait for a certain amount of time before continuing.
+            This can be used to wait for a specific duration or until a specific date.
+        '''
+        return _for()
+    
+    @staticmethod
+    def say_out_loud(something: str) -> str:
+        """
+            Just says something. Used to say each thought out loud
+        """
+        return subprocess.check_call(["say", something])
--- a/examples/agent/utils.py
+++ b/examples/agent/utils.py
@ -0,0 +1,41 @@
+from pathlib import Path
+import sys
+import importlib.util
+from typing import Type
+
+def load_source_as_module(source):
+    i = 0
+    while (module_name := f'mod_{i}') in sys.modules:
+        i += 1
+
+    spec = importlib.util.spec_from_file_location(module_name, source)
+    module = importlib.util.module_from_spec(spec)
+    sys.modules[module_name] = module
+    spec.loader.exec_module(module)
+    return module
+
+def load_module(f: str):
+    if f.endswith('.py'):
+        sys.path.insert(0, str(Path(f).parent))
+
+        return load_source_as_module(f)
+    else:
+        return importlib.import_module(f)
+
+def collect_functions(module):
+    for k in dir(module):
+        if k.startswith('_'):
+            continue
+        if k == k.capitalize():
+            continue
+        v = getattr(module, k)
+        if not callable(v) or isinstance(v, Type):
+            continue
+        if not hasattr(v, '__annotations__'):
+            continue
+
+        vt = type(v)
+        if vt.__module__ == 'langchain_core.tools' and vt.__name__.endswith('Tool') and hasattr(v, 'func') and callable(v.func):
+            v = v.func
+
+        yield v
--- a/examples/openai/README.md
+++ b/examples/openai/README.md
@ -1,87 +1,189 @@
-# examples.openai: OpenAI API-compatible server + agent / tools examples
+# examples.agent: Interactive agent that can use Python tools!

-A simple Python server that sits above the C++ [../server](examples/server) and offers improved OAI compatibility.
-
-## Usage
-
-Run a simple test:
+New Python OpenAI API compatibility server, which calls into the C++ server under the hood:

 ```bash
-# Spawns a Python server (which spawns a C++ Server) then hits it w/ a tool-calling request
-examples/openai/test.sh
+python -m examples.openai.server --model model.gguf
 ```

-To simply run the Python server (+ C++ server under the hood):
+## Prerequisites
+
+Note: To get conda, just install Miniforge (it's OSS): https://github.com/conda-forge/miniforge

 ```bash
-python -m examples.openai
+conda create -n agent python=3.11
+conda activate agent
+pip install -r examples/openai/requirements.txt
 ```

-## Tools usage (WIP)
-
-```bash
-git clone https://github.com/NousResearch/Hermes-Function-Calling examples/openai/hermes_function_calling
-```
-
-Then edit `examples/agents/hermes_function_calling/utils.py`:
-
-```py
-log_folder = os.environ.get('LOG_FOLDER', os.path.join(script_dir, "inference_logs"))
-```
-
-Then run tools in a sandbox:
-
-```bash
-REQUIREMENTS_FILE=<( cat examples/agents/hermes_function_calling/requirements.txt | grep -vE "bitsandbytes|flash-attn" ) \
-  examples/agents/run_sandboxed_tools.sh \
-    examples/agents/hermes_function_calling/functions.py \
-    -e LOG_FOLDER=/data/inference_logs
-```
-
-TODO: reactor that reads OpenAPI definitions and does the tool calling
-
 ## Features

-The new examples/openai/server.py:
+The new [examples/openai/server.py](./server.py):

- Uses llama.cpp C++ server as a backend (spawns it or connects to existing)
+- Supports grammar-constrained tool calling for **all** models (incl. Mixtral 7x8B)

- Uses actual jinja2 chat templates read from the models
+    - Optimised support for Functionary & Nous Hermes, easy to extend to other tool-calling schemes

- Supports grammar-constrained output for both JSON response format and tool calls
+    - Generic support w/ JSON schema that guides the model towards tool usage (at the cost of extra tokens):

- Tool calling “works” w/ all models (even non-specialized ones like Mixtral 7x8B)
+        ```ts
+          {
+            // original_thought: string,
+            thought_about_next_step_only: string,
+            next_step: {tool_calls: {name: string, arguments: any}} | {result: T}
+          }
+          // Where T is the output JSON schema, or 'any'
+        ```
+      
+        - Option to publicise schemas to models as TypeScript signatures (as for Functionary) or JSON schema.

-    - Optimised support for Functionary & Nous Hermes, easy to extend to other tool-calling fine-tunes
+        - Supports models that require user/assistant alternance (like Mixtral Instruct) by merging system messages into user messages.
+
+- Spawns the C++ [llama.cpp server](../server) under the hood (unless passed `--endpoint`), but only uses its non-chat endpoint
+
+  (depending on the prompting strategy, we weave the tool & output schema along with the chat template into the raw model grammar constraints)
+
+- Uses the actual Jinja2 templates stored in the GGUF models
+
+- Will eventually also spawn `whisper.cpp` and another server subprocess for the embeddings endpoint
+
+Rationale: the C++ server lacks some OpenAI compatibility features (and can't realistically keep up with prompt templates w/o bringing in too many dependencies), this new layer could allow focusing the C++ server on serving efficiency and delegate OAI compliance to a layer easier to maintain.
+
+## Test
+
+If you want to see tools in action, look at the [agent example](../agent). Otherwise:
+
+Start the server in Terminal 1:
+
+```bash
+python -m examples.openai --model  ~/AI/Models/mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf
+```
+
+Query it in Terminal 2 (or use it from any framework that makes use of tools: note tool calls are guaranteed to comply to the schema, so retries are likely not necessary!):
+
+```bash
+curl http://localhost:8080/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "gpt-3.5-turbo",
+    "tools": [{
+          "type": "function",
+          "function": {
+              "name": "get_current_weather",
+              "description": "Get the current weather",
+              "parameters": {
+                  "type": "object",
+                  "properties": {
+                      "location": {
+                          "type": "string",
+                          "description": "The city and state, e.g. San Francisco, CA"
+                      },
+                      "format": {
+                          "type": "string",
+                          "enum": ["celsius", "fahrenheit"],
+                          "description": "The temperature unit to use. Infer this from the users location."
+                      }
+                  },
+                  "required": ["location", "format"]
+              }
+          }
+      }, {
+          "type": "function",
+          "function": {
+              "name": "get_n_day_weather_forecast",
+              "description": "Get an N-day weather forecast",
+              "parameters": {
+                  "type": "object",
+                  "properties": {
+                      "location": {
+                          "type": "string",
+                          "description": "The city and state, e.g. San Francisco, CA"
+                      },
+                      "format": {
+                          "type": "string",
+                          "enum": ["celsius", "fahrenheit"],
+                          "description": "The temperature unit to use. Infer this from the users location."
+                      },
+                      "num_days": {
+                          "type": "integer",
+                          "description": "The number of days to forecast"
+                      }
+                  },
+                  "required": ["location", "format", "num_days"]
+              }
+          }
+      }],
+    "messages": [
+      {"role": "system", "content": "Do not make assumptions about what values to plug into functions. Ask for clarification if a user request is ambiguous."},
+      {"role": "user", "content": "what is the weather going to be like in San Francisco and Glasgow over the next 4 days"}
+    ]
+  }'
+```
+
+<details>
+<summary>Show output</summary>
+
+```json
+{
+  "id": "chatcmpl-3095057176",
+  "object": "chat.completion",
+  "created": 1711726921,
+  "model": "gpt-3.5-turbo",
+  "choices": [
+    {
+      "index": 0,
+      "message": {
+        "role": "assistant",
+        "name": null,
+        "tool_call_id": null,
+        "content": "In order to provide the required information, I need to call the get_n_day_weather_forecast function twice, once for San Francisco and once for Glasgow.",
+        "tool_calls": [
+          {
+            "id": "call_970977",
+            "type": "function",
+            "function": {
+              "name": "get_n_day_weather_forecast",
+              "arguments": {
+                "location": "San Francisco, CA",
+                "format": "celsius",
+                "num_days": 4
+              }
+            }
+          }
+        ]
+      },
+      "logprobs": null,
+      "finish_reason": "tool_calls"
+    }
+  ],
+  "usage": {
+    "prompt_tokens": 546,
+    "completion_tokens": 118,
+    "total_tokens": 664
+  },
+  "system_fingerprint": "...",
+  "error": null
+}
+```
+
+</details>

 ## TODO

- Support tool result messages
-
- Reactor / 
-
 - Embedding endpoint w/ distinct server subprocess

- Automatic/manual session caching
+- Evaluate options for session caching

-    - Spawns the main C++ CLI under the hood
+    - Pass session id & store / read from file?
+    
+    - Support parent session ids for trees of thought?

-    - Support precaching long prompts from CLI
-
-    - Instant incremental inference in long threads
-
- Improve examples/agent:
-
-    - Interactive agent CLI that auto-discovers tools from OpenAPI endpoints
-
-    - Script that wraps any Python source as a container-sandboxed OpenAPI endpoint (allowing running ~unsafe code w/ tools)
-
-    - Basic memory / RAG / python interpreter tools
+    - Support precaching long prompts from CLI / read session files?

 - Follow-ups

    - Remove OAI support from server

-    - Remove non-Python json schema to grammar converters
+    - Remove non-Python json-schema-to-grammar versions

    - Reach out to frameworks to advertise new option. 
--- a/examples/openai/main.py
+++ b/examples/openai/main.py
@ -1,8 +1,7 @@
-
-from jsonargparse import CLI
+import typer

 from examples.openai.server import main

 if __name__ == "__main__":
-    CLI(main)
+    typer.run(main)

--- a/examples/openai/api.py
+++ b/examples/openai/api.py
@ -1,3 +1,4 @@
+from abc import ABC
 from typing import Any, Dict, Literal, Optional, Union
 from pydantic import BaseModel, Json, TypeAdapter

@ -10,8 +11,6 @@ class ToolCall(BaseModel):
    type: Literal["function"] = "function"
    function: FunctionCall

-ToolCallsTypeAdapter = TypeAdapter(list[ToolCall])
-
 class Message(BaseModel):
    role: str
    name: Optional[str] = None
@ -32,15 +31,7 @@ class ResponseFormat(BaseModel):
    type: str
    json_schema: Optional[Any] = None

-class ChatCompletionRequest(BaseModel):
-    model: str
-    tools: Optional[list[Tool]] = None
-    messages: list[Message] = None
-    prompt: Optional[str] = None
-    response_format: Optional[ResponseFormat] = None
-
-    stream: bool = False
-    cache_prompt: Optional[bool] = None
+class LlamaCppParams(BaseModel):
    n_predict: Optional[int] = None
    top_k: Optional[int] = None
    top_p: Optional[float] = None
@ -63,6 +54,16 @@ class ChatCompletionRequest(BaseModel):
    n_probs: Optional[int] = None
    min_keep: Optional[int] = None

+class ChatCompletionRequest(LlamaCppParams):
+    model: str
+    tools: Optional[list[Tool]] = None
+    messages: list[Message] = None
+    prompt: Optional[str] = None
+    response_format: Optional[ResponseFormat] = None
+
+    stream: bool = False
+    cache_prompt: Optional[bool] = None
+
 class Choice(BaseModel):
    index: int
    message: Message
@ -74,6 +75,10 @@ class Usage(BaseModel):
    completion_tokens: int
    total_tokens: int

+class CompletionError(BaseModel):
+    message: str
+    # code: int
+
 class ChatCompletionResponse(BaseModel):
    id: str
    object: Literal["chat.completion"]
@ -81,4 +86,5 @@ class ChatCompletionResponse(BaseModel):
    model: str
    choices: list[Choice]
    usage: Usage
-    system_fingerprint: str
+    system_fingerprint: str
+    error: Optional[CompletionError] = None
--- a/examples/openai/prompting.py
+++ b/examples/openai/prompting.py
@ -9,130 +9,13 @@ import re
 import sys
 from typing import Any, Dict, Literal, Optional, Tuple, Callable, Union
 from pydantic import BaseModel
-from typeguard import typechecked
+# from typeguard import typechecked

 from examples.json_schema_to_grammar import SchemaConverter
 from examples.openai.api import Tool, Message, FunctionCall, ToolCall
 from examples.openai.gguf_kvs import GGUFKeyValues, Keys
 from examples.openai.ts_converter import SchemaToTypeScriptConverter

-@typechecked
-def raise_exception(msg: str):
-    raise Exception(msg)
-
-@typechecked
-class ChatTemplate(BaseModel):
-    template: str
-
-    @property
-    def tool_style(self) -> 'ToolsPromptStyle':
-        return self._tool_style
-    
-    def __init__(self, template: str, eos_token: str, bos_token: str):
-        super().__init__(template=template
-                         )
-        env = jinja2.Environment(loader=jinja2.BaseLoader(), trim_blocks=True, lstrip_blocks=True)
-        self._template = env.from_string(template)
-        self._eos_token = eos_token
-        self._bos_token = bos_token
-
-        self._strict_user_assistant_alternation = "{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception" in template
-
-        if "<|recipient|>' + tool_call['function']['name']" in template:
-            self._tool_style = ToolsPromptStyle.TYPESCRIPT_FUNCTIONARY_V2
-        else:
-            # self._tool_style = ToolsPromptStyle.TOOLS_BESPOKE
-
-            self._tool_style = ToolsPromptStyle.TOOLS_LONG
-            # self._tool_style = ToolsPromptStyle.TOOLS_MISTRAL
-
-        # TODO: Test whether the template supports formatting tool_calls
-        
-        delimiter = '<%$[SAMPLE]$%>'
-        user_msg = Message(role="user", content="Hey")
-        empty_prompt = self.render([user_msg], add_generation_prompt=True).strip()
-        planted_prompt = self.render([user_msg, Message(role="assistant", content=delimiter)], add_generation_prompt=False).strip()
-        assert planted_prompt.startswith(empty_prompt), f"Planted prompt does not start with empty prompt: {planted_prompt} vs {empty_prompt}"
-        [prefix, suffix] = planted_prompt[len(empty_prompt):].split(delimiter)
-
-        sys.stderr.write(f"\n# prefix={prefix}\n# suffix={suffix}\n\n")
-
-        self._prefix = prefix
-        self._suffix = suffix
-
-    def strip_suffix(self, s: str) -> str:
-        if s.endswith(self._suffix):
-            return s[:-len(self._suffix)]
-        else:
-            sys.stderr.write(f"Expected suffix ({self._suffix}) not found: {s}\n")
-            return s
-
-    def __str__(self):
-        return f"ChatTemplate(template={self.template}, eos_token={self._eos_token}, bos_token={self._bos_token})"
-
-    def add_system_prompt(self, messages: list[Message], system_prompt: Message) -> list[Message]:
-        assert system_prompt.role == "system"
-        # TODO: add to last system message, or create a new one just before the last user message
-        system_message = next(((i, m) for i, m in enumerate(messages) if m.role == "system"), None)
-        if system_message is not None:
-            (i, m) = system_message
-            return messages[:i] + [Message(role="system", content=system_prompt.content + '\n' + m.content)] + messages[i+1:]
-        else:
-            return [system_prompt] + messages
-
-    @staticmethod
-    def from_gguf(metadata: GGUFKeyValues):
-        tokens = metadata[Keys.Tokenizer.LIST]
-        return ChatTemplate(
-            template = metadata[Keys.Tokenizer.CHAT_TEMPLATE],
-            bos_token = tokens[metadata[Keys.Tokenizer.BOS_ID]],
-            eos_token = tokens[metadata[Keys.Tokenizer.EOS_ID]])
-
-    def render(self, messages: list[Message], add_generation_prompt: bool, omit_bos: bool = False):
-        sys.stderr.write(f'# strict_user_assistant_alternation={self._strict_user_assistant_alternation}\n')
-        sys.stderr.write(f'# messages=' + "\n".join(json.dumps(m.model_dump(), indent=2) for m in messages) + '\n')
-        if self._strict_user_assistant_alternation and any(m.role not in ('user', 'assistant') for m in messages):
-            new_messages=[]
-            i = 0
-            n = len(messages)
-            while i < n:
-                if messages[i].role == 'system':
-                    assert messages[i+1].role == 'user'
-                    new_messages.append(Message(
-                        role="user",
-                        content=f'[SYS]{messages[i].content}[/SYS]\n{messages[i+1].content}'
-                    ))
-                    i += 2
-                elif messages[i].role == 'assistant' and messages[i].tool_calls and messages[i].content:
-                    tc = '\n'.join(f'<tool_call>{json.dumps(tc.model_dump())}</tool_call>' for tc in messages[i].tool_calls)
-                    new_messages.append(Message(
-                        role="assistant",
-                        content=f'{messages[i].content}\n{tc}'
-                    ))
-                    i += 1
-                elif messages[i].role == 'tool':
-                    new_messages.append(Message(
-                        role="user",
-                        content=f'TOOL(name={messages[i].name}, id={messages[i].tool_call_id}): {messages[i].content}',
-                    ))  
-                    i += 1
-                else:
-                    new_messages.append(messages[i])
-                    i += 1
-            # print(f'new_messages={json.dumps(new_messages, indent=2)}')
-            messages = new_messages
-        # print(f'messages={messages}')
-        
-        result = self._template.render(
-            messages=messages,
-            eos_token=self._eos_token,
-            bos_token='' if omit_bos else self._bos_token,
-            raise_exception=raise_exception,
-            add_generation_prompt=add_generation_prompt,
-        )
-        sys.stderr.write(f'\n# RENDERED:\n\n{result}\n\n')
-        return result
-
 # While the API will be usable with a generic tools usage like OpenAI,
 # (see https://cookbook.openai.com/examples/how_to_call_functions_with_chat_models),
 # each model may need specific prompting (and/or constrained output,
@ -163,6 +46,133 @@ class ToolsPromptStyle(Enum):
    # Note: see this prior attempt to support Functionary: https://github.com/ggerganov/llama.cpp/pull/5695
    TYPESCRIPT_FUNCTIONARY_V2 = 6

+def raise_exception(msg: str):
+    raise Exception(msg)
+
+class ChatTemplate(BaseModel):
+    template: str
+
+    @property
+    def tool_style(self) -> 'ToolsPromptStyle':
+        return self._tool_style
+    
+    def __init__(self, template: str, eos_token: str, bos_token: str):
+        super().__init__(template=template
+                         )
+        env = jinja2.Environment(loader=jinja2.BaseLoader(), trim_blocks=True, lstrip_blocks=True)
+        self._template = env.from_string(template)
+        self._eos_token = eos_token
+        self._bos_token = bos_token
+
+        self._strict_user_assistant_alternation = "{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception" in template
+
+        if "<|recipient|>' + tool_call['function']['name']" in template:
+            self._tool_style = ToolsPromptStyle.TYPESCRIPT_FUNCTIONARY_V2
+        else:
+            self._tool_style = ToolsPromptStyle.TOOLS_BESPOKE
+            # self._tool_style = ToolsPromptStyle.TOOLS_LONG
+            # self._tool_style = ToolsPromptStyle.TOOLS_HERMES_2_PRO
+            # self._tool_style = ToolsPromptStyle.TOOLS_MISTRAL
+
+        # TODO: Test whether the template supports formatting tool_calls
+        
+        delimiter = '<%$[SAMPLE]$%>'
+        user_msg = Message(role="user", content="Hey")
+        empty_prompt = self.render([user_msg], add_generation_prompt=True).strip()
+        planted_prompt = self.render([user_msg, Message(role="assistant", content=delimiter)], add_generation_prompt=False).strip()
+        assert planted_prompt.startswith(empty_prompt), f"Planted prompt does not start with empty prompt: {planted_prompt} vs {empty_prompt}"
+        [prefix, suffix] = planted_prompt[len(empty_prompt):].split(delimiter)
+
+        # sys.stderr.write(f"\n# prefix={prefix}\n# suffix={suffix}\n\n")
+
+        self._prefix = prefix
+        self._suffix = suffix
+
+    def strip_suffix(self, s: str) -> str:
+        if s.endswith(self._suffix):
+            return s[:-len(self._suffix)]
+        else:
+            sys.stderr.write(f"Expected suffix ({self._suffix}) not found: {s}\n")
+            return s
+
+    def __str__(self):
+        return f"ChatTemplate(template={self.template}, eos_token={self._eos_token}, bos_token={self._bos_token})"
+
+    def add_system_prompt(self, messages: list[Message], system_prompt: Message) -> list[Message]:
+        assert system_prompt.role == "system"
+        # TODO: add to last system message, or create a new one just before the last user message
+        system_message = next(((i, m) for i, m in enumerate(messages) if m.role == "system"), None)
+        if system_message is not None:
+            (i, m) = system_message
+            return messages[:i] + [Message(role="system", content=system_prompt.content + '\n' + m.content)] + messages[i+1:]
+        else:
+            return [system_prompt] + messages
+
+    @staticmethod
+    def from_gguf(metadata: GGUFKeyValues):
+        if Keys.Tokenizer.CHAT_TEMPLATE not in metadata:
+            raise NotImplementedError(f'Only supporting models with {Keys.Tokenizer.CHAT_TEMPLATE} entry in their GGUF key-values (TODO: add default template, maybe pick llama2\'s?)')
+        
+        tokens = metadata[Keys.Tokenizer.LIST]
+        return ChatTemplate(
+            template = metadata[Keys.Tokenizer.CHAT_TEMPLATE],
+            bos_token = tokens[metadata[Keys.Tokenizer.BOS_ID]],
+            eos_token = tokens[metadata[Keys.Tokenizer.EOS_ID]])
+
+    @staticmethod
+    def from_huggingface(model_id: str):
+        from transformers import LlamaTokenizer
+        tokenizer = LlamaTokenizer.from_pretrained(model_id)
+        return ChatTemplate(
+            template = tokenizer.chat_template or tokenizer.default_chat_template,
+            bos_token = tokenizer.bos_token,
+            eos_token = tokenizer.eos_token)
+
+    def render(self, messages: list[Message], add_generation_prompt: bool, omit_bos: bool = False):
+        # sys.stderr.write(f'# strict_user_assistant_alternation={self._strict_user_assistant_alternation}\n')
+        # sys.stderr.write(f'# messages=' + "\n".join(json.dumps(m.model_dump(), indent=2) for m in messages) + '\n')
+        if self._strict_user_assistant_alternation and any(m.role not in ('user', 'assistant') for m in messages):
+            new_messages=[]
+            i = 0
+            n = len(messages)
+            while i < n:
+                if messages[i].role == 'system':
+                    assert messages[i+1].role == 'user'
+                    new_messages.append(Message(
+                        role="user",
+                        content=f'[SYS]{messages[i].content}[/SYS]\n{messages[i+1].content}'
+                    ))
+                    i += 2
+                elif messages[i].role == 'assistant' and messages[i].tool_calls and messages[i].content:
+                    tc = '\n'.join(f'<tool_call>{json.dumps(tc.model_dump())}</tool_call>' for tc in messages[i].tool_calls)
+                    new_messages.append(Message(
+                        role="assistant",
+                        content=f'{messages[i].content}\n{tc}'
+                    ))
+                    i += 1
+                elif messages[i].role == 'tool':
+                    new_messages.append(Message(
+                        role="user",
+                        content=f'TOOL RESULT(name={messages[i].name}, id={messages[i].tool_call_id}): {messages[i].content}',
+                    ))  
+                    i += 1
+                else:
+                    new_messages.append(messages[i])
+                    i += 1
+            # print(f'new_messages={json.dumps(new_messages, indent=2)}')
+            messages = new_messages
+        # print(f'messages={messages}')
+        
+        result = self._template.render(
+            messages=messages,
+            eos_token=self._eos_token,
+            bos_token='' if omit_bos else self._bos_token,
+            raise_exception=raise_exception,
+            add_generation_prompt=add_generation_prompt,
+        )
+        # sys.stderr.write(f'\n# RENDERED:\n\n{result}\n\n')
+        return result
+
 class ChatHandlerArgs(BaseModel):
    chat_template: ChatTemplate
    response_schema: Optional[dict] = None
@ -189,12 +199,14 @@ class NoToolsChatHandler(ChatHandler):
                content=_please_respond_with_schema(args.response_schema)
            )
            converter = SchemaConverter(prop_order={}, allow_fetch=False, dotall=False, raw_pattern=False)
-            self.grammar = converter.visit(args.response_schema, '')
+            schema = converter.resolve_refs(args.response_schema, 'response')
+            converter.visit(schema, '')
+            self.grammar = converter.format_grammar()
        else:
            self.output_format_prompt = None
            self.grammar = None

-    @typechecked
+    # @typechecked
    def parse(self, s: str) -> Optional[Message]:
        return Message(role="assistant", content=s)

@ -203,21 +215,24 @@ class ToolCallTagsChatHandler(ChatHandler):
        super().__init__(args)

        converter = SchemaConverter(prop_order={}, allow_fetch=False, dotall=False, raw_pattern=False)
-        tool_rules = [
-            converter.visit(
+        tool_rules = []
+        for tool in self.args.tools:
+            
+            parameters_schema = tool.function.parameters
+            parameters_schema = converter.resolve_refs(parameters_schema, tool.function.name)
+            
+            tool_rules.append(converter.visit(
                dict(
                    type="object",
                    properties=dict(
                        name=dict(type="string", pattern='^' + tool.function.name.replace('_', f'\\?_') + '$') if escapes_underscores \
                            else dict(const=tool.function.name),
-                        arguments=tool.function.parameters,
+                        arguments=parameters_schema,
                    ),
                    required=['name', 'arguments']
                ),
                f'{tool.function.name}-tool-call'
-            )
-            for tool in self.args.tools
-        ]
+            ))

        def format_literal(s: str) -> str:
            if escapes_underscores:
@ -253,7 +268,7 @@ class ToolCallTagsChatHandler(ChatHandler):
        #         ") " + converter._format_literal("</tool_call>") +
        #     ")") # + converter._format_literal(suffix))
        
-    @typechecked
+    # @typechecked
    def parse(self, s: str) -> Optional[Message]:
        s = self.args.chat_template.strip_suffix(s)

@ -386,7 +401,7 @@ class FunctionaryToolsChatHandler(ChatHandler):
        #         ") " +
        #     ")") # + converter._format_literal(suffix))
    
-    @typechecked
+    # @typechecked
    def parse(self, s: str) -> Optional[Message]:
        s = self.args.chat_template.strip_suffix(s)
        
@ -422,7 +437,7 @@ def _make_bespoke_schema(response_schema, tool_call_schema, allow_parallel_calls
    return {
        "type": "object",
        "properties": {
-            "original_goal": {"title": "Original Goal", "type": "string"},
+            # "original_goal": {"title": "Original Goal", "type": "string"},
            "thought_about_next_step_only": {
                "title": "Thought about next step",
                # "title": "Thought about how the next step brings us closer to achieving the original goal",
@ -455,6 +470,7 @@ def _make_bespoke_schema(response_schema, tool_call_schema, allow_parallel_calls
            },
        },
        "required": ["original_goal", "thought_about_next_step_only", "next_step"]
+        # "required": ["next_step"]
    }

 class BespokeToolsChatHandler(ChatHandler):
@ -513,7 +529,7 @@ class BespokeToolsChatHandler(ChatHandler):
            ])
        )

-    @typechecked
+    # @typechecked
    def parse(self, s: str) -> Optional[Message]:
        s = self.args.chat_template.strip_suffix(s)
        try:
@ -527,7 +543,7 @@ class BespokeToolsChatHandler(ChatHandler):
        elif 'tool_calls' in next_step:
            return Message(
                role="assistant",
-                content=data["thought_about_next_step_only"],
+                content=data["thought_about_next_step_only"] if "thought_about_next_step_only" in data else None,
                tool_calls=[
                    ToolCall(id=gen_callid(), function=FunctionCall(**tc))
                    for tc in next_step['tool_calls']
@ -545,7 +561,8 @@ _SHORT_TEMPLATE='\n'.join([

 _LONG_TEMPLATE='\n'.join([
    # '''You are a function calling AI model. You are provided with function signatures within <tools></tools> XML tags.''',
-    'You may call one or more functions to assist with the user query. Don\'t make assumptions about what values to plug into functions. Here are the available tools:',
+    # 'You may call one or more functions to assist with the user query. Don\'t make assumptions about what values to plug into functions. Here are the available tools:',
+    'Call one or more functions to assist with the user query, every time this is possible. Don\'t make assumptions about what values to plug into functions. Here are the available tools:',
    '<tools>',
    '{tools}',
    '</tools>',
@ -564,7 +581,7 @@ def get_chat_handler(args: ChatHandlerArgs, allow_parallel_calls=False) -> ChatH
    if not args.tools:
        return NoToolsChatHandler(args)
    elif args.chat_template.tool_style == ToolsPromptStyle.TYPESCRIPT_FUNCTIONARY_V2:
-        return FunctionaryToolsChatHandler(args)
+        return FunctionaryToolsChatHandler(args, allow_parallel_calls=False)
    elif args.chat_template.tool_style == ToolsPromptStyle.TOOLS_SHORT:
        return TemplatedToolsChatHandler(args, _SHORT_TEMPLATE, allow_parallel_calls=allow_parallel_calls)
    elif args.chat_template.tool_style == ToolsPromptStyle.TOOLS_LONG:
--- a/examples/openai/reactor.py
+++ b/examples/openai/reactor.py
@ -1,344 +0,0 @@
-# Usage:
-#! ./server -m some-model.gguf &
-#! pip install pydantic
-#! python examples/json-schema-pydantic-example.py
-#
-# TODO:
-# - https://github.com/NousResearch/Hermes-Function-Calling
-#
-# <|im_start|>system
-# You are a function calling AI model. You are provided with function signatures within <tools></tools> XML tags
-# You may call one or more functions to assist with the user query. Don't make assumptions about what values to plug into functions. Here are the available tools:
-# <tools> {'type': 'function', 'function': {'name': 'get_stock_fundamentals',
-# 'description': 'get_stock_fundamentals(symbol: str) -> dict - Get fundamental data for a given stock symbol using yfinance API.\n\n    Args:\n    symbol (str): The stock symbol.\n\n    Returns:\n    dict: A dictionary containing fundamental data.', 'parameters': {'type': 'object', 'properties': {'symbol': {'type': 'string'}}, 'required': ['symbol']}}} 
-# </tools> Use the following pydantic model json schema for each tool call you will make: {'title': 'FunctionCall', 'type': 'object', 'properties': {'arguments': {'title': 'Arguments', 'type': 'object'}, 'name': {'title': 'Name', 'type': 'string'}}, 'required': ['arguments', 'name']} For each function call return a json object with function name and arguments within <tool_call></tool_call> XML tags as follows:
-# <tool_call>
-# {'arguments': <args-dict>, 'name': <function-name>}
-# </tool_call><|im_end|>
-
-from dataclasses import dataclass
-import subprocess
-import sys
-from pydantic import BaseModel, TypeAdapter
-from annotated_types import MinLen
-from typing import Annotated, Callable, List, Union, Literal, Optional, Type, get_args, get_origin
-import json, requests
-
-from examples.openai.api import ToolCallsTypeAdapter
-
-def type_to_str(t):
-    origin = get_origin(t)
-    if origin is None:
-        return t.__name__
-    args = get_args(t)
-    return origin.__name__ + (
-        f'[{", ".join(type_to_str(a) for a in args)}]' if args else ''
-    )
-
-def build_union_type_adapter(*types):
-    src = '\n'.join([
-        'from pydantic import TypeAdapter',
-        'from typing import Union',
-        f'_out = TypeAdapter(Union[{", ".join(type_to_str(t) for t in types)}])',
-    ])
-    globs = {
-        **globals(),
-        **{t.__name__: t for t in types},
-    }
-    exec(src, globs)
-    return globs['_out']
-
-class Thought(BaseModel):
-    thought: str
-
-
-def build_tool_call_adapter2(final_output_type, *tools):
-    lines = [
-        'from pydantic import BaseModel, TypeAdapter',
-        'from typing import Literal, Union',
-    ]
-    globs = {
-        **globals(),
-        **locals(),
-        final_output_type.__name__: final_output_type,
-    }
-    tool_calls = []
-    for fn in tools:
-        # TODO: escape fn.__doc__ and fn.__doc__ to avoid comment or metadata injection!
-        fn_name = fn.__name__
-        fn_doc = fn.__doc__.replace('"""', "'''") if fn.__doc__ else None
-        name = fn_name.replace('_', ' ').title().replace(' ', '')
-        lines += [
-            f'class {name}ToolArgs(BaseModel):',
-            *(f'  {k}: {type_to_str(v)}' for k, v in fn.__annotations__.items() if k != 'return'),
-            f'class {name}ToolCall(BaseModel):',
-            *([f'  """{fn_doc}"""'] if fn_doc else []),
-            f'  name: Literal["{fn_name}"]',
-            f'  arguments: {name}ToolArgs',
-            f'class {name}Tool(BaseModel):',
-            # *([f'  """{fn_doc}"""'] if fn_doc else []),
-            f'  id: str',
-            f'  type: Literal["function"]',
-            f'  function: {name}ToolCall',
-            f'  def __call__(self) -> {type_to_str(fn.__annotations__.get("return"))}:',
-            f'    return {fn_name}(**self.function.arguments.dict())',
-        ]
-        tool_calls.append(f'{name}Tool')
-    
-    lines += [
-        # 'class FinalResult(BaseModel):',
-        # f'  result: {type_to_str(final_output_type)}',
-        # 'class Response(BaseModel):',
-        # f'  """A response that starts with a thought about whether we need tools or not, the plan about tool usage (maybe a sequence of tool calls), and then either a final result (of type {final_output_type.__name__}) or a first tool call"""',
-        # f'  original_goal: str',
-        # f'  thought_process: str',
-        # # f'  thought: str',
-        # f'  next_step: Union[FinalResult, {", ".join(tool_calls)}]',
-        # f'response_adapter = TypeAdapter(Response)'
-        f'response_adapter = TypeAdapter(Union[{", ".join(tool_calls)}])',
-    ]
-
-    exec('\n'.join(lines), globs)
-    return globs['response_adapter']
-
-def create_completion2(*, response_model=None, max_tool_iterations=None, tools=[], endpoint="http://localhost:8080/v1/chat/completions", messages, **kwargs):
-    '''
-    Creates a chat completion using an OpenAI-compatible endpoint w/ JSON schema support
-    (llama.cpp server, llama-cpp-python, Anyscale / Together...)
-
-    The response_model param takes a type (+ supports Pydantic) and behaves just as w/ Instructor (see below)
-    '''
-    if response_model:
-        type_adapter = TypeAdapter(response_model)
-        schema = type_adapter.json_schema()
-        # messages = [{
-        #     "role": "system",
-        #     "content": f"Respond in JSON format with the following schema: {json.dumps(schema, indent=2)}"
-        # }] + messages
-        # print("Completion: ", json.dumps(messages, indent=2))
-        # print("SCHEMA: " + json.dumps(schema, indent=2))
-        response_format={"type": "json_object", "schema": schema }
-
-    tool_call_adapter = build_tool_call_adapter2(response_model, *tools)
-    tool_adapters = [(fn, TypeAdapter(fn)) for fn in tools]
-    tools_schemas = [{
-        "type": "function",
-        "function": {
-            "name": fn.__name__,
-            "description": fn.__doc__,
-            "parameters": ta.json_schema()
-        }
-    } for (fn, ta) in tool_adapters]
-
-    # messages = [{
-    #     "role": "system",
-    #     "content": '\n'.join([
-    # #         "You are a function calling AI model. You are provided with function signatures within <tools></tools> XML tags.",
-    # #         "You may call one or more functions to assist with the user query. Don't make assumptions about what values to plug into functions. Here are the available tools:",
-    # #         f'<tools>{json.dumps(tools_schemas)}</tools>',
-    #         'Before calling each tool, you think clearly and briefly about why and how you are using the tool.',
-    #         f"Respond in JSON format with the following schema: {json.dumps(schema, indent=2)}" if schema else "",
-    #     ])
-    # }] + messages
-
-    i = 0
-    while (max_tool_iterations is None or i < max_tool_iterations):
-        body=dict(
-            messages=messages,
-            response_format=response_format,
-            tools=tools_schemas,
-            **kwargs
-        )
-        # sys.stderr.write(f'# REQUEST: {json.dumps(body, indent=2)}\n')
-        response = requests.post(
-            endpoint,
-            headers={"Content-Type": "application/json"},
-            json=body,
-        )
-        if response.status_code != 200:
-            raise Exception(f"Request failed ({response.status_code}): {response.text}")
-
-        # sys.stderr.write(f"\n# RESPONSE:\n\n<<<{response.text}>>>\n\n")
-        data = response.json()
-        if 'error' in data:
-            raise Exception(data['error']['message'])
-
-        # sys.stderr.write(f"\n# RESPONSE DATA:\n\n{json.dumps(data, indent=2)}\n\n")
-        # print(json.dumps(data, indent=2))
-        choice = data["choices"][0]
-
-        content = choice["message"].get("content")
-        if choice.get("finish_reason") == "tool_calls":
-            # sys.stderr.write(f'\n# TOOL CALLS:\n{json.dumps(choice["message"]["tool_calls"], indent=2)}\n\n')
-            # tool_calls =ToolCallsTypeAdapter.validate_json(json.dumps(choice["tool_calls"]))
-            messages.append(choice["message"])
-            for tool_call in choice["message"]["tool_calls"]:
-                # id = tool_call.get("id")
-                # if id:
-                #     del tool_call["id"]
-
-                if content:
-                    print(f'💭 {content}')
-
-                tc = tool_call_adapter.validate_json(json.dumps(tool_call))
-                
-                pretty_call = f'{tc.function.name}({", ".join(f"{k}={v}" for k, v in tc.function.arguments.model_dump().items())})'
-                sys.stdout.write(f'⚙️  {pretty_call}')
-                result = tc()
-                sys.stdout.write(f" -> {result}\n")
-                messages.append({
-                    "tool_call_id": tc.id,
-                    "role": "tool",
-                    "name": tc.function.name,
-                    # "content": f'{result}',
-                    "content": f'{pretty_call} = {result}',
-                })
-        else:
-            assert content
-            # print(content)
-            # print(json.dumps(json.loads(content), indent=2))
-            result = type_adapter.validate_json(content) if type_adapter else content
-            # if isinstance(result, Thought):
-            #     print(f'💭 {result.thought}')
-            #     messages.append({
-            #         "role": "assistant",
-            #         "content": json.dumps(result.model_dump(), indent=2),
-            #     })
-            # else:
-            return result
-
-        i += 1
-
-    if max_tool_iterations is not None:
-        raise Exception(f"Failed to get a valid response after {max_tool_iterations} tool calls")
-
-if __name__ == '__main__':
-
-    class QAPair(BaseModel):
-        question: str
-        concise_answer: str
-        justification: str
-
-    class PyramidalSummary(BaseModel):
-        title: str
-        summary: str
-        question_answers: Annotated[List[QAPair], MinLen(2)]
-        sub_sections: Optional[Annotated[List['PyramidalSummary'], MinLen(2)]]
-
-    # print("# Summary\n", create_completion(
-    #     model="...",
-    #     response_model=PyramidalSummary,
-    #     messages=[{
-    #         "role": "user",
-    #         "content": f"""
-    #             You are a highly efficient corporate document summarizer.
-    #             Create a pyramidal summary of an imaginary internal document about our company processes
-    #             (starting high-level, going down to each sub sections).
-    #             Keep questions short, and answers even shorter (trivia / quizz style).
-    #         """
-    #     }]))
-    
-    import math
-
-    def eval_python_expression(expr: str) -> float:
-        """
-            Evaluate a Python expression reliably.
-            This can be used to compute complex nested mathematical expressions, or any python, really.
-        """
-        print("# Evaluating expression: ", expr)
-        return "0.0"
-
-    def add(a: float, b: float) -> float:
-        """
-            Add a and b reliably.
-            Don't use this tool to compute the square of a number (use multiply or pow instead)
-        """
-        return a + b
-    
-    # def say(something: str) -> str:
-    #     """
-    #         Just says something. Used to say each thought out loud
-    #     """
-    #     return subprocess.check_call(["say", something])
-
-    def multiply(a: float, b: float) -> float:
-        """Multiply a with b reliably"""
-        return a * b
-
-    def divide(a: float, b: float) -> float:
-        """Divide a by b reliably"""
-        return a / b
-
-    def pow(value: float, power: float) -> float:
-        """
-            Raise a value to a power (exponent) reliably.
-            The square of x is pow(x, 2), its cube is pow(x, 3), etc.
-        """
-        return math.pow(value, power)
-
-    result = create_completion2(
-        model="...",
-        response_model=str,
-        tools=[add, multiply, divide, pow], #, say],#, eval_python_expression],
-        # tools=[eval_python_expression],
-        temperature=0.0,
-        # repetition_penalty=1.0,
-        n_predict=1000,
-        top_k=1,
-        top_p=0.0,
-        # logit_bias={
-        #     i: 10.0
-        #     for i in range(1, 259)
-        # },
-        messages=[{
-        #     "role": "system",
-        #     "content": f"""
-        #         You are a reliable assistant. You think step by step and think before using tools
-        #     """
-        # }, {
-            "role": "user",
-            # "content": f"""
-            #     What is 10 squared?
-            # """
-            "content": f"""
-                What is the sum of 2535 squared and 32222000403 then multiplied by one and a half. What's a third of the result?
-
-                Keep your goal in mind at every step.
-            """
-                # Think step by step, start expressing the problem as an arithmetic expression
-        }])
-    
-    # result = create_completion(
-    #     model="...",
-    #     response_model=float,
-    #     tools=[add, multiply, divide, pow], #, say],#, eval_python_expression],
-    #     temperature=0.0,
-    #     # logit_bias={
-    #     #     i: 10.0
-    #     #     for i in range(1, 259)
-    #     # },
-    #     messages=[{
-    #         "role": "user",
-    #         # "content": f"""
-    #         #     What is 10 squared?
-    #         # """
-    #         "content": f"""
-    #             What is the sum of 2535 squared and 32222000403 then multiplied by one and a half. What's a third of the result?
-    #         """
-    #             # Think step by step, start expressing the problem as an arithmetic expression
-    #     }])
-    
-    # 💭 First, I need to square the number 2535. For this, I will use the 'pow' tool.
-    # ⚙️  pow(args={'value': 2535.0, 'power': 2.0})-> 6426225.0
-    # 💭 Now that I have the square of 2535, I need to add it to 32222000403.0 and store the result.
-    # ⚙️  add(args={'a': 6426225.0, 'b': 32222000403.0})-> 32228426628.0
-    # 💭 Now that I have the sum of 2535 squared and 32222000403, I need to multiply it by 1.5.
-    # ⚙️  pow(args={'value': 32228426628.0, 'power': 1.5})-> 5785736571757004.0
-    # 💭 Now that I have the result of the sum multiplied by 1.5, I need to divide it by 3 to get a third of the result.
-    # ⚙️  divide(args={'a': 5785736571757004.0, 'b': 3.0})-> 1928578857252334.8
-    # 💭 I have now calculated a third of the result, which is 1928578857252334.8. I can now share this as the final answer.
-    # Result:  1928578857252334.8
-
-    expected_result = (2535 ** 2 + 32222000403) * 1.5 / 3.0
-    print("➡️", result)
-    assert math.fabs(result - expected_result) < 0.0001, f"Expected {expected_result}, got {result}"
--- a/examples/openai/server.py
+++ b/examples/openai/server.py
@ -21,39 +21,56 @@ import random
 from starlette.responses import StreamingResponse
 from typing import Annotated, Optional
 import typer
-from typeguard import typechecked

 def generate_id(prefix):
    return f"{prefix}{random.randint(0, 1 << 32)}"

 def main(
    model: Annotated[Optional[Path], typer.Option("--model", "-m")] = "models/7B/ggml-model-f16.gguf",
-    # model: Path = Path("/Users/ochafik/AI/Models/Hermes-2-Pro-Mistral-7B.Q8_0.gguf"),
+    template_hf_model_id_fallback: Annotated[Optional[str], typer.Option(help="If the GGUF model does not contain a chat template, get it from this HuggingFace tokenizer")] = 'meta-llama/Llama-2-7b-chat-hf',
    # model_url: Annotated[Optional[str], typer.Option("--model-url", "-mu")] = None,
    host: str = "localhost",
    port: int = 8080,
-    cpp_server_endpoint: Optional[str] = None,
-    cpp_server_host: str = "localhost",
-    cpp_server_port: Optional[int] = 8081,
+    auth: Optional[str] = None,
+    verbose: bool = False,
+    context_length: Optional[int] = None,
+    endpoint: Optional[str] = None,
+    server_host: str = "localhost",
+    server_port: Optional[int] = 8081,
 ):
    import uvicorn

-    metadata = GGUFKeyValues(model)
-    context_length = metadata[Keys.LLM.CONTEXT_LENGTH]
-    chat_template = ChatTemplate.from_gguf(metadata)
-    # print(chat_template)
+    if endpoint:
+        sys.stderr.write(f"# WARNING: Unsure which model we're talking to, fetching its chat template from HuggingFace tokenizer of {template_hf_model_id_fallback}\n")
+        chat_template = ChatTemplate.from_huggingface(template_hf_model_id_fallback)
+        
+    else:
+        metadata = GGUFKeyValues(model)

-    if not cpp_server_endpoint:
-        sys.stderr.write(f"# Starting C++ server with model {model} on {cpp_server_host}:{cpp_server_port}\n")
+        if not context_length:
+            context_length = metadata[Keys.LLM.CONTEXT_LENGTH]
+    
+        if Keys.Tokenizer.CHAT_TEMPLATE in metadata:
+            chat_template = ChatTemplate.from_gguf(metadata)
+        else:
+            sys.stderr.write(f"# WARNING: Model does not contain a chat template, fetching it from HuggingFace tokenizer of {template_hf_model_id_fallback}\n")
+            chat_template = ChatTemplate.from_huggingface(template_hf_model_id_fallback)
+
+        if verbose:
+            sys.stderr.write(f"# CHAT TEMPLATE:\n\n{chat_template}\n\n")
+
+        if verbose:
+            sys.stderr.write(f"# Starting C++ server with model {model} on {server_host}:{server_port}\n")
        server_process = subprocess.Popen([
            "./server", "-m", model,
-            "--host", cpp_server_host, "--port", f'{cpp_server_port}',
+            "--host", server_host, "--port", f'{server_port}',
+            # TODO: pass these from JSON / BaseSettings?
            '-ctk', 'q4_0', '-ctv', 'f16',
-            "-c", f"{2*8192}",
-            # "-c", f"{context_length}",
+            "-c", f"{context_length}",
+            *([] if verbose else ["--log-disable"]),
        ], stdout=sys.stderr)
        atexit.register(server_process.kill)
-        cpp_server_endpoint = f"http://{cpp_server_host}:{cpp_server_port}"
+        endpoint = f"http://{server_host}:{server_port}/completions"

    app = FastAPI()

@ -62,8 +79,8 @@ def main(
        headers = {
            "Content-Type": "application/json",
        }
-        if (auth := request.headers.get("Authorization")):
-            headers["Authorization"] = auth
+        if (auth_value := request.headers.get("Authorization", auth)):
+            headers["Authorization"] = auth_value

        if chat_request.response_format is not None:
            assert chat_request.response_format.type == "json_object", f"Unsupported response format: {chat_request.response_format.type}"
@ -79,9 +96,12 @@ def main(

        prompt = chat_template.render(messages, add_generation_prompt=True)
        
-        sys.stderr.write(f'\n# MESSAGES:\n\n{TypeAdapter(list[Message]).dump_json(messages)}\n\n')
-        sys.stderr.write(f'\n# PROMPT:\n\n{prompt}\n\n')
-        sys.stderr.write(f'\n# GRAMMAR:\n\n{chat_handler.grammar}\n\n')
+        
+        if verbose:
+            sys.stderr.write(f'\n# REQUEST:\n\n{chat_request.model_dump_json(indent=2)}\n\n')
+            # sys.stderr.write(f'\n# MESSAGES:\n\n{TypeAdapter(list[Message]).dump_json(messages)}\n\n')
+            sys.stderr.write(f'\n# PROMPT:\n\n{prompt}\n\n')
+            sys.stderr.write(f'\n# GRAMMAR:\n\n{chat_handler.grammar}\n\n')
        
        data = LlamaCppServerCompletionRequest(
            **{
@ -101,7 +121,7 @@ def main(

        async with httpx.AsyncClient() as client:
            response = await client.post(
-                f"{cpp_server_endpoint}/completions",
+                f"{endpoint}",
                json=data,
                headers=headers,
                timeout=None)
@ -112,7 +132,8 @@ def main(
            return StreamingResponse(generate_chunks(response), media_type="text/event-stream")
        else:
            result = response.json()
-            sys.stderr.write("# RESULT:\n\n" + json.dumps(result, indent=2) + "\n\n")
+            if verbose:
+                sys.stderr.write("# RESULT:\n\n" + json.dumps(result, indent=2) + "\n\n")
            if 'content' not in result:
                # print(json.dumps(result, indent=2))
                return JSONResponse(result)