Text Generation Implementation Guide - SambaNova Documentation

This guide covers different aspects of text generation, including types of generation, model selection, creating prompts, and managing multi-turn conversations.

Types of generations

You can use various methods to generate text, including non-streaming, streaming, and asynchronous completions.

Simple generation (non-streaming)

Use the following code to perform text generation with the SambaNova or OpenAI Python client in a non-streaming manner.

from sambanova import SambaNova
client = SambaNova(
    base_url="your-sambanova-base-url",
    api_key="your-sambanova-api-key"
)
completion = client.chat.completions.create(
    model="Meta-Llama-3.3-70B-Instruct",
    messages = [
        {"role": "system", "content": "Answer the question in a couple sentences."},
        {"role": "user", "content": "Share a happy story with me"}
    ]
)
print(completion.choices[0].message.content)

Asynchronous generation (non-streaming)

For asynchronous completions, use the AsyncSambaNova or AsyncOpenAI Python client.

from sambanova import AsyncSambaNova
import asyncio
async def main():
    client = AsyncSambaNova(
        base_url="your-sambanova-base-url",
        api_key="your-sambanova-api-key"
    )
    completion = await client.chat.completions.create(
        model="Meta-Llama-3.3-70B-Instruct",
        messages = [
            {"role": "system", "content": "Answer the question in a couple sentences."},
            {"role": "user", "content": "Share a happy story with me"}
        ]
    )
    print(completion.choices[0].message.content)
asyncio.run(main())

Streaming response

For real-time streaming completions, use the following approach with the SambaNova or OpenAI client.

from sambanova import SambaNova
client = SambaNova(
    base_url="your-sambanova-base-url",
    api_key="your-sambanova-api-key"
)
completion = client.chat.completions.create(
    model="Meta-Llama-3.3-70B-Instruct",
    messages = [
        {"role": "system", "content": "Answer the question in a couple sentences."},
        {"role": "user", "content": "Share a happy story with me"}
    ],
    stream = True
)
for chunk in completion:
  print(chunk.choices[0].delta.content, end="")

Multiple completions

Use the n parameter to generate multiple independent completions for a single prompt. The response includes each completion in choices[0] through choices[n-1].

Parameter	Type	Default	Valid range
`n`	integer	1	1–8

Set temperature greater than 0 to get varied outputs across completions. With temperature: 0, all completions will be identical.

n greater than 1 is not supported when using function calling or tools. Combining them returns a 400 error.

from sambanova import SambaNova

client = SambaNova(
    base_url="your-sambanova-base-url",
    api_key="your-sambanova-api-key"
)

completion = client.chat.completions.create(
    model="Meta-Llama-3.1-8B-Instruct",
    messages=[
        {"role": "user", "content": "Write a one-sentence tagline for a coffee shop."}
    ],
    n=3,
    temperature=0.7
)

for i, choice in enumerate(completion.choices):
    print(f"Completion {i + 1}: {choice.message.content}")

This applies to both SambaCloud and SambaStack.

Model selection

Models differ in their architecture, impacting their speed and response quality. Selecting a model depends on the factors shown below.

Factor	Consideration
Task complexity	Larger models are better suited for complex tasks.
Accuracy requirements	Larger models generally offer higher accuracy.
Cost and resources	Larger models come with increased costs and resource demands.

Experiment with various models to find the one that best fits your specific use case.

Creating effective prompts

Prompt engineering is the practice of designing and refining prompts to optimize responses from large language models (LLMs). This process is iterative and requires experimentation to achieve the best possible outcomes.

Building a prompt

A basic prompt can be as simple as a few words to elicit a response from the LLM. However, for more complex use cases, you may need additional elements:

Element	Description
Defining a persona	Assigning a specific role to the model (e.g., “You are a financial advisor”).
Providing context	Supplying background information to guide the model’s response.
Specifying output format	Instructing the model to respond in a particular style (e.g., JSON, bullet points, structured text).
Describing a use case	Clarifying the goal of the interaction.

Advanced prompting techniques

To improve response quality and reasoning, more advanced techniques can be used.

Technique	Description
In-context learning	Providing examples of desired outputs to guide the model.
Chain-of-Thought (CoT) prompting	Encouraging the model to articulate its reasoning before delivering a response.

For more details about prompt engineering, see A Systematic Survey of Prompt Engineering in Large Language Models: Techniques and Applications.

Messages and roles

In chat-based interactions, messages are represented as dictionaries with specific roles and content.

Element	Description
`role`	Specifies who is sending the message.
`content`	Contains the message text.

Common roles

Roles are typically categorized as system, user, or assistant.

Role	Description
`system`	Provides general instructions to the model.
`user`	Represents user input.
`assistant`	Contains the model’s response.
`tool`	Contains a tool execution result.

Multi-turn conversation

To maintain context across multiple exchanges, messages in a conversational AI system are typically stored as a list of dictionaries. Each dictionary contains keys that specify the sender’s role and the message content. This structure helps the system track context across multiple turns in a conversation. Below is an example of how a multi-turn conversation is structured using the Meta-Llama-3.3-70B-Instruct model:

Structuring multi-turn conversations using Meta-Llama-3.3-70B-Instruct

completion = client.chat.completions.create(
    model="Meta-Llama-3.3-70B-Instruct",
    messages = [
        {"role": "user", "content": "Hi! My name is Peter and I am 31 years old. What is 1+1?"},
        {"role": "assistant", "content": "Nice to meet you, Peter. 1 + 1 is equal to 2"},
        {"role": "user", "content": "What is my age?"}
    ],
    stream = True
)
for chunk in completion:
  print(chunk.choices[0].delta.content, end="")

After running the program, you should see an output similar to the following.

Example output

You told me earlier, Peter. You're 31 years old.

By structuring conversations this way, the model can maintain context, recall prior user inputs, and provide more coherent responses.

Considerations for long conversations

When engaging in long conversations with LLMs, certain factors such as token limits and memory constraints must be considered to ensure accuracy and coherence.

Token limits - LLMs have a fixed context window, limiting the number of tokens they can process in a single request. If the input exceeds this limit, the system might truncate it, leading to incomplete or incoherent responses.
Memory constraints - The model does not retain context beyond its input window. To preserve context, past messages should be re-included in prompts.

By structuring prompts effectively and managing conversation history, you can optimize interactions with LLMs for better accuracy and coherence.

Get started

Models

Features

Build

Resources

Implement text generation features

Types of generations

Simple generation (non-streaming)

Asynchronous generation (non-streaming)

Streaming response

Multiple completions

Model selection

Creating effective prompts

Building a prompt

Advanced prompting techniques

Messages and roles

Common roles

Multi-turn conversation

Considerations for long conversations

Get started

Models

Features

Build

Resources

Documentation Index

​Types of generations

​Simple generation (non-streaming)

​Asynchronous generation (non-streaming)

​Streaming response

​Multiple completions

​Model selection

​Creating effective prompts

​Building a prompt

​Advanced prompting techniques

​Messages and roles

​Common roles

​Multi-turn conversation

​Considerations for long conversations

Types of generations

Simple generation (non-streaming)

Asynchronous generation (non-streaming)

Streaming response

Multiple completions

Model selection

Creating effective prompts

Building a prompt

Advanced prompting techniques

Messages and roles

Common roles

Multi-turn conversation

Considerations for long conversations