SambaStack Deploying with speculative decoding

SambaStack allows you to use custom draft models with both user-provided checkpoints and SambaNova-provided checkpoints to improve inference performance in certain deployment scenarios. These draft models enable speculative decoding, which can reduce latency and increase throughput when configured appropriately. This page explains how to deploy models using speculative decoding. It covers when speculative decoding is appropriate, how to validate compatibility between draft and target models, and how to configure custom draft checkpoints.

If you are unfamiliar with Models vs. Experts, Bundles vs. Bundle Templates, Model Manifests, or model deployment, see the Model deployment page.

What is speculative decoding

Speculative decoding is an inference acceleration technique that pairs a fast, lightweight “draft” model with a larger, higher-quality “target” model:

The draft model proposes multiple next tokens
The target model quickly verifies or rejects them in parallel
This approach significantly speeds up inference decode time, especially when the draft model is closely “aligned” to the target model

Key benefit: Speculative decoding accelerates inference without affecting the output distributions (and thereby accuracy) of the target model.

When NOT to use speculative decoding

Speculative decoding is most effective when the draft model can predict a meaningful portion of the target model’s next tokens. In some situations, however, it can provide little benefit—or even reduce overall performance:

Scenario	Why it hurts performance
Long inputs with short outputs	Both the draft and target models must process the entire input before generation begins. The added overhead of running the draft model can outweigh any speedup.
Very large gaps between model sizes	When the draft model is much smaller (or otherwise poorly aligned) compared to the target model, acceptance rates tend to drop. Low acceptance rates mean more rejected tokens, more corrective forward passes on the target model, and ultimately decreased performance.

Using custom draft checkpoints for speculative decoding

Some model experts in SambaStack are configured to use speculative decoding. When speculative decoding is enabled for a model expert, both the draft and target models must be deployed together—and both checkpoints for the target and draft model may be customized. This section walks through the full workflow for deploying a custom draft checkpoint, including:

How to identify which deployable Bundles use speculative decoding
How to validate draft–target compatibility
How to update your Bundle

Workflow overview

Deploying a custom draft checkpoint for speculative decoding follows these high-level steps:

Step	Action
1	Confirm your deployment uses speculative decoding by checking the Bundle Template
2	Convert your custom draft checkpoint into the SambaNova-compatible format
3	Validate compatibility between your draft and target checkpoints (recommended)
4	Upload the converted draft checkpoint to your GCS bucket
5	Register your draft checkpoint by creating a Model Manifest
6	Update your Bundle configuration to reference the custom draft checkpoint

Steps to deploy a custom draft checkpoint

Confirm your deployment uses speculative decoding

To deploy a custom draft checkpoint, first confirm that the model experts in the Bundle you would like to deploy are configured for speculative decoding. This information is found in the Bundle Template definition.In the example below, the Meta-Llama-3.3-70B-Instruct model uses speculative decoding and is bundled together with gpt-oss-120b, which does not use speculative decoding.

apiVersion: sambanova.ai/v1alpha1
kind: BundleTemplate
metadata:
  name: bt-gpt120-llama70sd8-llama8
spec:
  models:
    gpt-oss-120b:
      experts:
        8k:
          configs:
          - pef: gpt-oss-fp8-ss8192-bs2:1
        32k:
          configs:
          - pef: gpt-oss-fp8-ss32768-bs2:1
    Meta-Llama-3.3-70B-Instruct:
      experts:
        4k:
          configs:
          - pef: llama-3p1-70b-ss4096-bs4-sd5:3
            spec_decoding:
              draft_model: Meta-Llama-3.1-8B-Instruct
          - pef: llama-3p1-70b-ss4096-bs32-sd5:2
        8k:
          configs:
          - pef: llama-3p1-70b-ss8192-bs1-sd5:1
          - pef: llama-3p1-70b-ss8192-bs8-sd5:2
          default_config_values:
            spec_decoding:
              draft_model: Meta-Llama-3.1-8B-Instruct
        128k:
          configs:
          - pef: llama-3p1-70b-ss131072-bs1-sd5:2
    Meta-Llama-3.1-8B-Instruct:
      experts:
        4k:
          configs:
          - pef: llama-3p1-8b-ss4096-bs4:1
        8k:
          configs:
          - pef: llama-3p1-8b-ss8192-bs1:1
          - pef: llama-3p1-8b-ss8192-bs8:1
  owner: no-reply@sambanova.ai
  secretNames:
  - sambanova-artifact-reader
  usePefCRs: true

In the above example,

Meta-Llama-3.3-70B-Instruct is the target model.
Meta-Llama-3.1-8B-Instruct is the draft model used for speculative decoding
For the sake of illustration, we have shown three different ways in which speculative decoding can be configured for the Meta-Llama-3.3-70B-Instruct target model:
- Draft model for individual config in an expert: the 4k expert has spec_decoding configured with the draft_model: Meta-Llama-3.1-8B-Instruct for the batch size 4 config. Speculative decoding is not configured for the batch size 32 config within the same expert.
- Draft model for all configs in an expert: the 8k expert has spec_decoding configured under default_config_values. That applies the draft_model: Meta-Llama-3.1-8B-Instruct for all the configs in the 8k expert.
- No draft model: The 128k expert has no speculative decoding configured. So if a user makes a request that requires this expert to run, speculative decoding will not be used. That is also true for the 4k expert with batch size 32 config.

It is important to note the following:

Both the target and draft models need to appear in the bundle template YAML file.
Both the target and draft model templates names must be referenced in your bundle YAML file.
Speculative decoding is specified in the bundle template YAML file, but not in the bundle YAML file.
Custom checkpoints are specified in the bundle YAML file, but not in the bundle template YAML file.

Convert your custom draft checkpoint

If you are replacing the draft model’s checkpoint with your own fine-tuned or custom version, convert it into the SambaNova-compatible format using the Checkpoint Conversion Tool.

If you have not already set up the tool, follow the Download and set up instructions on the Checkpoint Conversion Tool documentation page.
Then convert the draft checkpoint using the steps in Convert and validate checkpoint.

SambaStack-provided checkpoints are already in the correct format. If you are using a SambaStack-provided draft checkpoint, skip this step.

Validate draft–target checkpoint compatibility

When deploying custom checkpoints with speculative decoding, we recommend verifying that a draft checkpoint is compatible with the target checkpoint. This helps prevent unexpected errors during inference.The Checkpoint Conversion Tool provides a built-in utility that checks draft–target compatibility. Use the validation utility to confirm:

The draft and target models align structurally
Their tokenizers are compatible
Speculative decoding can run safely with this pair

Command template

docker run -v $HOST_WORKING_DIR:$DOCKER_WORKING_DIR --rm -it \
    --platform linux/amd64 \
    $IMAGE_NAME \
    validate-sd \
    --target_model $TARGET_MODEL \
    --target_checkpoint_path "$DOCKER_WORKING_DIR/$TARGET_CHECKPOINT_DIR" \
    --draft_model $DRAFT_MODEL \
    --draft_checkpoint_path "$DOCKER_WORKING_DIR/$DRAFT_CHECKPOINT_DIR" \
    --server $SERVER \
    --cache_location $CACHE_LOCATION

Parameters

These are the primary input flags to the validate-sd command:

Flag / Variable	Type	Description
`--target_model` / `TARGET_MODEL`	`str`	The target model family to validate against (e.g., `"llama3-70b"`). This should match the model family used in your target deployment.
`--target_checkpoint_path`	`str`	Path inside the container to the directory containing the converted target checkpoint. Typically `$DOCKER_WORKING_DIR/<subdir>` mounted from `HOST_WORKING_DIR`.
`--draft_model` / `DRAFT_MODEL`	`str`	The draft model family to validate (e.g., `"llama3-8b"`). This should match the model family used as the draft model in your speculative decoding configuration.
`--draft_checkpoint_path`	`str`	Path inside the container to the directory containing the converted draft checkpoint. Typically `$DOCKER_WORKING_DIR/<subdir>` mounted from `HOST_WORKING_DIR`.
`--server` / `SERVER`	`str`	Source of serving metadata. Can be `embedded`, a local path, or a remote URL such as `https://api.sambanova.ai/`. Typically this is the base endpoint URL for your SambaStack instance.
`--cache_location` / `CACHE_LOCATION`	`str`	Location (inside the container) where serving metadata/configs are stored. Usually a subdirectory of `$DOCKER_WORKING_DIR` and visible on the host under `$HOST_WORKING_DIR/$CACHE_LOCATION`.

Host-level variables

In addition to the flags above, you will typically set the following environment variables for the Docker command:

Variable	Description
`HOST_WORKING_DIR`	Directory on the host that contains your converted target and draft checkpoints. Must be writable.
`DOCKER_WORKING_DIR`	Directory inside the container where `HOST_WORKING_DIR` is mounted (matches the right side of `-v`).
`IMAGE_NAME`	Full image name of the Checkpoint Conversion Tool container.
`TARGET_CHECKPOINT_DIR`	Subdirectory under `HOST_WORKING_DIR` containing the target checkpoint directory (mirrored under `DOCKER_WORKING_DIR`).
`DRAFT_CHECKPOINT_DIR`	Subdirectory under `HOST_WORKING_DIR` containing the draft checkpoint directory (mirrored under `DOCKER_WORKING_DIR`).

Important: Both draft and target checkpoints must be converted before validation. If you are using a SambaStack-provided target checkpoint, you do not need to convert it—all SambaStack-provided checkpoints are already in the correct format.

This step is optional but strongly recommended to avoid runtime errors.

Upload your custom draft checkpoint

Upload the converted draft checkpoint directory to your GCS bucket and make sure it is readable by your SambaStack service account.

gcloud storage cp -r <LOCAL_DRAFT_CHECKPOINT_DIR> gs://<BUCKET_NAME>/<PATH>/

For detailed upload instructions and GCS permission configuration, see Deploying custom checkpoints → Upload your custom checkpoint.

Register your draft checkpoint

Similar to Step 3 in the workflow to deploy a custom checkpoint, register your checkpoint by creating a Model Manifest.The Model Manifest stores relevant information about your checkpoint such as:

The name to use in API requests or other YAMLs
Supported languages
Description of the checkpoint

Example Model Manifest

apiVersion: sambanova.ai/v1alpha1
kind: Model
metadata:
  name: My-Custom-Llama3.1-8B
spec:
  aliases:
  - my-custom-llama3.1-8b
  - My-Custom-Llama3.1-8B
  metadata:
    architecture: Llama 3.1
    category:
    - General
    - Instruct
    github_link: https://huggingface.co/path/to/My-Custom-Llama3.1-8B
    hf_link: 'https://huggingface.co/path/to/My-Custom-Llama3.1-8B'
    languages:
    - English
    - German
    - French
    - Italian
    - Portuguese
    - Hindi
    - Spanish
    - Thai
    license: llama3.1
    name: Salesforce/My-Custom-Llama3.1-8B
    overview: A description or overview of My-Custom-Llama3.1-8B. Typically this is the description found in a modelcard.
    status: active
    vocabulary_size: 128256
  name: My-Custom-Llama3.1-8B
  owner: jane@doe.ai
  price:
    input_tokens: 10
    output_tokens: 20
  public: true
  tokenizer:
    endpointUrl: ''
    path: ./Meta-Llama-3.1-8B-Instruct_tokenizer

The name field (My-Custom-Llama3.1-8B in the above example) will become a relevant field in the subsequent steps.

Select tokenizer fields

The tokenizer field in the Model Manifest can be set to the base model used for your custom checkpoint. For instance, if your custom checkpoint is fine-tuned from Meta-Llama-3.1-70B-Instruct, then the tokenizer path can be set as follows:

tokenizer:
  endpointUrl: ''
  path: ./Meta-Llama-3.1-70B-Instruct_tokenizer

The tokenizer field in the Model Manifest is only used for running checks on the inputs to calculate sequence length requirements prior to generation time.

Apply the Model Manifest

After you have created your Model Manifest, apply it using:

kubectl apply -f your_model_manifest_filename.yaml

Update your Bundle configuration

Once your draft checkpoint is uploaded and registered, update your Bundle configuration the same way you would for any custom checkpoint.

Before: Using SambaNova-provided draft

Below is a Bundle before introducing a custom draft checkpoint—only the target checkpoint has been customized:

apiVersion: sambanova.ai/v1alpha1
kind: Bundle
metadata:
  name: 70b-3dot3-ss-4-8-64-128k
spec:
  checkpoints:
    LLAMA3_8B_3_1_CKPT:
      source: gs://service-account-gcs-bucket/.../ckpts/meta-llama3-8b-instruct
      toolSupport: true
    CUSTOM_70B_TARGET_CKPT:
      source: gs://your-gcs-bucket/path/to/custom/ckpt/dir
      toolSupport: false
  models:
    Meta-Llama-3.1-8B-Instruct:
      checkpoint: LLAMA3_8B_3_1_CKPT
      template: Meta-Llama-3.1-8B-Instruct
    Custom-Target-Llama-3.3-70B:
      checkpoint: CUSTOM_70B_TARGET_CKPT
      template: Meta-Llama-3.3-70B-Instruct
...

After: Using custom draft

To replace the draft model checkpoint, update the Bundle with your custom draft checkpoint just as you would for the target model by editing both the checkpoints and models sections:

apiVersion: sambanova.ai/v1alpha1
kind: Bundle
metadata:
  name: 70b-3dot3-ss-4-8-64-128k
spec:
  checkpoints:
    CUSTOM_8B_DRAFT_CKPT:
      source: gs://your-gcs-bucket/path/to/custom/draft/ckpt/dir
      toolSupport: true
    CUSTOM_70B_TARGET_CKPT:
      source: gs://your-gcs-bucket/path/to/custom/ckpt/dir
      toolSupport: false
  models:
    Custom-Draft-Llama-3.1-8B:
      checkpoint: CUSTOM_8B_DRAFT_CKPT
      template: Meta-Llama-3.1-8B-Instruct
    Custom-Target-Llama-3.3-70B:
      checkpoint: CUSTOM_70B_TARGET_CKPT
      template: Meta-Llama-3.3-70B-Instruct
...

Apply the Bundle

kubectl apply -f your_bundle_filename.yaml

Important notes

You do not need to call the draft model directly in your inference API requests
When you send requests to the target model (e.g., Custom-Target-Llama-3.3-70B), SambaStack automatically runs speculative decoding using the corresponding draft checkpoint
Both checkpoints must be present in the Bundle for speculative decoding to work
The steps for modifying the deployment to point to your custom draft checkpoint are similar to the steps for custom checkpoints. See Reference the checkpoint in your deployment yaml in the Deploying custom checkpoints page for details.

Verify deployment

After applying the Bundle, verify your deployment:

# Check Bundle status
kubectl get bundles

# Check Bundle details
kubectl describe bundle <your-bundle-name>

# Verify models are registered
kubectl get models

Test by sending an inference request to your target model. SambaStack automatically uses the draft model for speculative decoding.

Troubleshooting

Issue	Possible Cause	Solution
Validation fails	Mismatched vocabulary sizes	Ensure draft and target use compatible tokenizers
Low acceptance rates	Poor draft-target alignment	Use a draft model more closely aligned with target
Performance degradation	Long inputs, short outputs	Consider disabling speculative decoding for this workload
Deployment fails	Missing draft checkpoint in Bundle	Ensure both checkpoints are defined in `spec.checkpoints`
Model not found	Model name mismatch	Verify `models.<name>` matches Model Manifest name

Checkpoint Conversion Tool — Convert checkpoints to SambaNova format
Deploying custom checkpoints — Deploy custom target checkpoints
Model deployment — Bundle and deployment concepts

Overview

Installation

Service Administration

Hardware Administration

Reference Architectures

Resources

Deploying with speculative decoding

What is speculative decoding

When NOT to use speculative decoding

Using custom draft checkpoints for speculative decoding

Workflow overview

Steps to deploy a custom draft checkpoint

Confirm your deployment uses speculative decoding

Convert your custom draft checkpoint

Validate draft–target checkpoint compatibility

Command template

Parameters

Host-level variables

Upload your custom draft checkpoint

Register your draft checkpoint

Example Model Manifest

Select tokenizer fields

Apply the Model Manifest

Update your Bundle configuration

Before: Using SambaNova-provided draft

After: Using custom draft

Apply the Bundle

Important notes

Verify deployment

Troubleshooting

Overview

Installation

Service Administration

Hardware Administration

Reference Architectures

Resources

​What is speculative decoding

​When NOT to use speculative decoding

​Using custom draft checkpoints for speculative decoding

​Workflow overview

​Steps to deploy a custom draft checkpoint

Confirm your deployment uses speculative decoding

Convert your custom draft checkpoint

Validate draft–target checkpoint compatibility

​Command template

​Parameters

​Host-level variables

Upload your custom draft checkpoint

Register your draft checkpoint

​Example Model Manifest

​Select tokenizer fields

​Apply the Model Manifest

Update your Bundle configuration

​Before: Using SambaNova-provided draft

​After: Using custom draft

​Apply the Bundle

​Important notes

​Verify deployment

​Troubleshooting

​Related pages

What is speculative decoding

When NOT to use speculative decoding

Using custom draft checkpoints for speculative decoding

Workflow overview

Steps to deploy a custom draft checkpoint

Command template

Parameters

Host-level variables

Example Model Manifest

Select tokenizer fields

Apply the Model Manifest

Before: Using SambaNova-provided draft

After: Using custom draft

Apply the Bundle

Important notes

Verify deployment

Troubleshooting

Related pages