Skip to main content

Documentation Index

Fetch the complete documentation index at: https://sambanova-systems.mintlify.dev/docs/llms.txt

Use this file to discover all available pages before exploring further.

SambaStack supports two deployment configurations for supported models: high-interactivity and high-throughput. Both use the same model weights and the same API — they differ only in how the system handles requests. Use high-interactivity for low-latency, user-facing applications. Use high-throughput for batch or high-concurrency workloads where aggregate output matters more than per-user latency. The rest of this page covers the trade-offs, the PEF configurations for each, and how to build a bundle.
High-throughput and high-interactivity configurations require dedicated systems. Models deployed in either configuration cannot be bundled with other models. If you are unfamiliar with bundles or bundle templates, see Deploying model bundles.

Deployment configurations

ConfigurationProfileBest when
High-throughputAggregate token throughput across concurrent requestsBatch processing, asynchronous workloads, or large user volumes where total system output matters more than per-user latency
High-interactivityPer-request latency and time-to-first-tokenReal-time, user-facing applications
Both configurations use the same model name in API calls. The same request works against either configuration:
{ "model": "DeepSeek-R1", "messages": [...] }
The configuration controls request handling on the server side; no client-side changes are required.

When to use each configuration

High-throughput

Use the high-throughput configuration when:
  • You are serving many concurrent users and aggregate throughput matters more than per-user latency
  • Your workload is asynchronous or batch-oriented (for example, document processing or offline inference pipelines)
  • End-to-end latency per request is not a constraint

High-interactivity

Use the high-interactivity configuration when:
  • You are building real-time, user-facing applications
  • Per-user time-to-first-token and tokens-per-second are the primary metrics
  • Your deployment has fewer nodes, or your users have tight latency budgets

Supported models

Both configurations are available for the following models:
  • DeepSeek-R1
  • DeepSeek-V3-0324
  • DeepSeek-V3.1
  • DeepSeek-V3.1-Terminus
  • DeepSeek-V3.2

Architecture

The high-throughput configuration uses continuous batching, separating the prefill and decode phases into a dedicated pipeline. Two modes are available:
  • Aggregated (ACB): Prefill and decode run collocated on the same nodes.
  • Disaggregated (DCB): Prefill and decode run on separate dedicated nodes, so each phase can be sized independently. The recommended node split is more prefill nodes than decode nodes — for example, three prefill nodes and one decode node.
DCB has not yet been internally validated on SambaStack. Use ACB for SambaStack deployments until DCB validation is published.
  • Prefill nodes process the input prompt
  • Decode nodes generate output tokens
The high-throughput configuration requires a minimum of 4 nodes in disaggregated mode. For single-node or small deployments, use the high-interactivity configuration instead.

Requirements and limitations

ConstraintDetails
Minimum nodes (high-throughput)4 nodes in a multinode setup
Dedicated systemsHigh-throughput and high-interactivity configurations require dedicated systems — bundling with other models is not supported
Node configuration (disaggregated mode)More prefill nodes than decode nodes required (for example, 3 prefill : 1 decode)
Checkpoint versionUse the latest checkpoint version listed in the Model CR — older versions are not compatible with high-throughput PEFs

PEF configurations

Use the following PEF CR identifiers when building your bundles. See Custom Bundle Deployment for the full bundle-building procedure.
Custom Resource (CR): A Kubernetes extension object. Model CRs and PEF CRs define model and PEF configurations in the cluster.
When referencing a PEF CR in a BundleTemplate, append the version number: for example, deepseek-ss8192-bs1:1. Use version 1 unless kubectl describe pef <pef-name> shows a higher stable version is available.

High-throughput PEFs

PEF CRSequence lengthBatch size
deepseek-ss32768-bs1-cb2-6432768 (32K)64
deepseek-ss16384-bs1-cb2-12816384 (16K)128
deepseek-ss8192-bs1-cb2-2568192 (8K)256
Picking a high-throughput PEF:
  • Choose the sequence length (ss) that fits your longest prompt plus expected output tokens.
  • Higher batch sizes serve more concurrent decode requests per node but require more RDU memory. The table lists the supported combinations.

High-interactivity PEFs

PEF CRSequence lengthBatch size
deepseek-ss4096-bs14096 (4K)1
deepseek-ss4096-bs44096 (4K)4
deepseek-ss8192-bs18192 (8K)1
deepseek-ss8192-bs48192 (8K)4
deepseek-ss16384-bs116384 (16K)1
deepseek-ss32768-bs132768 (32K)1
deepseek-ss131072-bs1131072 (128K)1
Picking a high-interactivity PEF:
  • Match the sequence length to your prompt plus expected output budget.
  • bs1 minimizes per-user latency. bs4 trades a small latency increase for higher per-node throughput when you have multiple concurrent users.

Build a bundle

No prebuilt bundles ship for these configurations — you create a custom bundle using the PEF CRs listed above. Follow the Custom Bundle Deployment guide and reference the relevant PEF CR when defining your BundleTemplate. When using a high-throughput PEF in your BundleTemplate, set continuous_batching: true in the expert definition:
DeepSeek-V3-0324:
  experts:
    8k:
      configs:
      - continuous_batching: true
        pef: deepseek-ss8192-bs1-cb2-256:1
Then configure your BundleDeployment for the appropriate mode. Aggregated mode (ACB):
groups:
  - name: default
    continuous_batching:
      mode: aggregate
    minReplicas: 1
    qosList:
    - web
    - free
Disaggregated mode (DCB):
groups:
  - name: default
    continuous_batching:
      use_mpi: true
      prefill:
        minReplicas: 3
      decode:
        minReplicas: 1
    minReplicas: 1
    qosList:
    - web
    - free

Verify your deployment

After deploying the bundle, confirm the configuration is active:
kubectl describe bundledeployment <bundle-deployment-name>
In the output, look for continuous_batching.mode set to aggregate (ACB) or disaggregate (DCB), and confirm the replica counts under prefill and decode match what you configured.

Switch between configurations

To switch between high-throughput and high-interactivity, redeploy the bundle with the appropriate PEF and BundleTemplate settings. When switching to high-interactivity, remove continuous_batching: true from the expert and remove the continuous_batching block from the BundleDeployment. The model name in API calls does not change.

Monitor your deployment

The SambaStack logging system emits per-request metrics relevant to these deployments:
MetricLog keyWhat it tells you
Decode queue timedecode_queue_timeTime spent waiting in continuous batching queues — high values indicate decode saturation
Time to first tokentime_to_first_tokenPrefill latency per request — key indicator for high-interactivity deployments
Completion tokens/seccompletion_tokens_per_secAggregate throughput — key indicator for high-throughput deployments
See Logs for the full list of available metrics and example queries.