Skip to main content

Documentation Index

Fetch the complete documentation index at: https://sambanova-systems.mintlify.dev/docs/llms.txt

Use this file to discover all available pages before exploring further.

When you call the SambaNova Chat Completions API, the platform applies the model’s default Jinja-based chat template server-side, formatting your messages into the raw prompt the model receives. For most use cases this is the right behavior. However, some scenarios require you to take control of prompt formatting and output parsing on the client side. This page explains when and why to use the Completions API with custom templates, and how to implement custom output parsers. For a complete interactive walkthrough, see the Custom Chat Templates AI Starter Kit.

When to use custom chat templates

Use the Completions API with a custom chat template instead of the Chat Completions API when:
  • You need full control over prompt structure. Some workflows require injecting custom variables, special tokens, or instructions that are not exposed through the Chat Completions API parameters. For standard models available on SambaCloud with no customization, the Chat Completions API with built-in function calling is the recommended approach. See Function calling and JSON mode.
  • You are using a BYOC (Bring Your Own Checkpoint) model. Fine-tuned checkpoints deployed on SambaStack may use a different chat template than the base model. Letting the server apply the base model’s default template produces incorrect prompts for these checkpoints.
  • Your model uses a non-standard tool-call output format. Fine-tuned models may emit tool calls in a format the default parsers do not handle, for example XML markers instead of JSON.

How it works

Instead of calling /v1/chat/completions, you render the prompt string yourself and send it directly to /v1/completions. The server receives a raw string and continues generation from it, applying no template of its own. The workflow has four steps:
  1. Load a chat template. Either pull the Jinja template from a Hugging Face tokenizer or write a custom one.
  2. Render the prompt. Apply the template to your messages and tool definitions to produce a raw prompt string.
  3. Call the Completions API. Send the rendered string to /v1/completions.
  4. Parse the output. Convert the raw text response into a structured assistant message with tool calls.

Load a chat template

From a Hugging Face model

Use the transformers library to load the tokenizer for your base model and extract its built-in chat template.
from transformers import AutoTokenizer
 
tokenizer = AutoTokenizer.from_pretrained(
    "meta-llama/Llama-3.1-8B-Instruct",
    use_auth_token="<your-hf-token>"
)
chat_template = tokenizer.chat_template
A Hugging Face token is required for gated models such as the Llama families. See Hugging Face token settings.

Define a custom Jinja template

If your checkpoint uses a different template than the base model, write a Jinja2 template directly. Your template must handle the messages, tools, and add_generation_prompt variables at minimum.
from jinja2 import Environment, TemplateSyntaxError
 
custom_template = """
{{ bos_token }}
{%- for message in messages %}
    {%- if message['role'] == 'user' %}
        {{ '<|User|>' + message['content'] }}
    {%- elif message['role'] == 'assistant' %}
        {{ '<|Assistant|>' + message['content'] + eos_token }}
    {%- endif %}
{%- endfor %}
{% if add_generation_prompt %}{{ '<|Assistant|>' }}{% endif %}
"""
 
# Validate syntax before use
try:
    Environment().parse(custom_template)
except TemplateSyntaxError as e:
    raise ValueError(f"Invalid Jinja template at line {e.lineno}: {e.message}")

Render the prompt

Apply the template to your messages and tool definitions using Jinja2. Pass tokenizer attributes such as bos_token and eos_token as context variables when using a template loaded from a tokenizer. For custom templates, supply these values explicitly.
from jinja2 import Template
from datetime import datetime
 
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is the population of Bogota?"}
]
 
tools = [
    {
        "type": "function",
        "function": {
            "name": "get_population",
            "description": "Returns the population of a city.",
            "parameters": {
                "type": "object",
                "properties": {
                    "city": {"type": "string", "description": "Name of the city"}
                },
                "required": ["city"]
            }
        }
    }
]
 
context = {
    "messages": messages,
    "tools": tools,
    "add_generation_prompt": True,
    "bos_token": "<|begin_of_text|>",
    "eos_token": "<|eot_id|>",
    "date_string": datetime.now().strftime("%d %b %Y"),
}
 
rendered_prompt = Template(chat_template).render(**context).strip()

Call the Completions API

Send the rendered prompt string to the /v1/completions endpoint. This endpoint accepts a raw string and returns a raw string — no template is applied server-side.
from sambanova import SambaNova
 
client = SambaNova(
    api_key="<your-sambanova-api-key>",
    base_url="https://api.sambanova.ai/v1"
)
 
response = client.completions.create(
    model="Meta-Llama-3.1-8B-Instruct",
    prompt=rendered_prompt,
    max_tokens=2048,
    temperature=0.0,
)
 
raw_output = response.choices[0].text

Parse model output

The raw text response must be parsed into a structured assistant message. The correct parser depends on the tool-call format your model emits.

JSON format (Llama-style)

Llama instruction-tuned models emit tool calls as JSON objects in the response text.
import json
 
def parse_llama_output(response: str) -> list[dict]:
    """Extract JSON tool calls from a Llama-style response."""
    tool_calls = []
    brace_count, start = 0, None
 
    for i, ch in enumerate(response):
        if ch == "{":
            if brace_count == 0:
                start = i
            brace_count += 1
        elif ch == "}" and start is not None:
            brace_count -= 1
            if brace_count == 0:
                block = response[start:i + 1]
                obj = json.loads(block)
                tool_calls.append({
                    "type": "function",
                    "function": {
                        "name": obj["name"],
                        "arguments": json.dumps(obj["parameters"])
                    }
                })
                start = None
    return tool_calls

XML format (DeepSeek-style)

DeepSeek models use XML markers to delimit tool calls.
import re
import json
 
def parse_deepseek_output(response: str) -> list[dict]:
    """Extract XML-delimited tool calls from a DeepSeek-style response."""
    tool_calls = []
    pattern = r"<|tool▁call▁begin|>(.*?)<|tool▁sep|>(.*?)<|tool▁call▁end|>"
 
    for name, args in re.findall(pattern, response, re.DOTALL):
        tool_calls.append({
            "type": "function",
            "function": {
                "name": name.strip(),
                "arguments": args.strip()
            }
        })
    return tool_calls

Build the assistant message

Once tool calls are extracted, assemble the final assistant message in OpenAI-compatible format.
def build_assistant_message(response: str, tool_calls: list) -> dict:
    if tool_calls:
        return {"role": "assistant", "content": None, "tool_calls": tool_calls}
    return {"role": "assistant", "content": response.strip(), "tool_calls": []}

Custom parsers

If your model emits tool calls in a format other than the default one, implement a custom parser. Your parser must accept the raw response string and return a list of tool-call dicts in OpenAI-compatible format.
def parse(response: str) -> list[dict]:
    """
    Custom parser template.
    Returns a list of tool-call dicts:
    [{"type": "function", "function": {"name": str, "arguments": str}}, ...]
    """
    # Implement your extraction logic here
    return []
Note: Custom parsers execute user-supplied code. Only run code you trust. This is not a sandboxed execution environment.

Next steps