Skip to main content
SambaStack allows you to use custom draft models with both user-provided checkpoints and SambaNova-provided checkpoints to improve inference performance in certain deployment scenarios. These draft models enable speculative decoding, which can reduce latency and increase throughput when configured appropriately. This page explains how to deploy models using speculative decoding. It covers when speculative decoding is appropriate, how to validate compatibility between draft and target models, and how to configure custom draft checkpoints.
If you are unfamiliar with Models vs. Experts, Bundles vs. Bundle Templates, Model Manifests, or model deployment, see the Model deployment page.

What is speculative decoding

Speculative decoding is an inference acceleration technique that pairs a fast, lightweight “draft” model with a larger, higher-quality “target” model:
  • The draft model proposes multiple next tokens
  • The target model quickly verifies or rejects them in parallel
  • This approach significantly speeds up inference decode time, especially when the draft model is closely “aligned” to the target model
Key benefit: Speculative decoding accelerates inference without affecting the output distributions (and thereby accuracy) of the target model.

When NOT to use speculative decoding

Speculative decoding is most effective when the draft model can predict a meaningful portion of the target model’s next tokens. In some situations, however, it can provide little benefit—or even reduce overall performance:
ScenarioWhy it hurts performance
Long inputs with short outputsBoth the draft and target models must process the entire input before generation begins. The added overhead of running the draft model can outweigh any speedup.
Very large gaps between model sizesWhen the draft model is much smaller (or otherwise poorly aligned) compared to the target model, acceptance rates tend to drop. Low acceptance rates mean more rejected tokens, more corrective forward passes on the target model, and ultimately decreased performance.

Using custom draft checkpoints for speculative decoding

Some model experts in SambaStack are configured to use speculative decoding. When speculative decoding is enabled for a model expert, both the draft and target models must be deployed together—and both checkpoints for the target and draft model may be customized. This section walks through the full workflow for deploying a custom draft checkpoint, including:
  • How to identify which deployable Bundles use speculative decoding
  • How to validate draft–target compatibility
  • How to update your Bundle

Workflow overview

Deploying a custom draft checkpoint for speculative decoding follows these high-level steps:
StepAction
1Confirm your deployment uses speculative decoding by checking the Bundle Template
2Convert your custom draft checkpoint into the SambaNova-compatible format
3Validate compatibility between your draft and target checkpoints (recommended)
4Upload the converted draft checkpoint to your GCS bucket
5Register your draft checkpoint by creating a Model Manifest
6Update your Bundle configuration to reference the custom draft checkpoint

Steps to deploy a custom draft checkpoint

1

Confirm your deployment uses speculative decoding

To deploy a custom draft checkpoint, first confirm that the model experts in the Bundle you would like to deploy are configured for speculative decoding. This information is found in the Bundle Template definition.In the example below, the Meta-Llama-3.3-70B-Instruct model has an expert (8k) whose configuration includes a spec_decoding block. This block identifies both the draft expert and the draft model:
apiVersion: sambanova.ai/v1alpha1
kind: BundleTemplate
metadata:
  name: 70b-ss-4-8k-tk
spec:
  models:
    Meta-Llama-3.3-70B-Instruct:
      experts:
        8k:
          configs:
          - batch_size: 1
            num_tokens_at_a_time: 1
            pef: LLAMA3_70B_8K_PEF_BS1
            resubmit_to: Meta-Llama-3.3-70B-Instruct-16k
            spec_decoding:
              draft_expert: 8k
              draft_model: Meta-Llama-3.2-1B-Instruct
              k: 5
          ...
      ...
    ...
    Meta-Llama-3.2-1B-Instruct:
      experts:
        8k:
          configs:
          - batch_size: 1
            num_tokens_at_a_time: 20
            pef: LLAMA3D2_1B_8K_PEF_BS1
          - batch_size: 2
In this example:
  • Meta-Llama-3.3-70B-Instruct is the target model
  • Meta-Llama-3.2-1B-Instruct is the draft model used for speculative decoding
  • Both Model Templates appear in the Bundle Template because speculative decoding requires both models to be part of the deployment
  • Both Model Template names must appear in your eventual Bundle
2

Convert your custom draft checkpoint

If you are replacing the draft model’s checkpoint with your own fine-tuned or custom version, convert it into the SambaNova-compatible format using the Checkpoint Conversion Tool.
  1. If you have not already set up the tool, follow the Download and set up instructions on the Checkpoint Conversion Tool documentation page.
  2. Then convert the draft checkpoint using the steps in Convert and validate checkpoint.
SambaStack-provided checkpoints are already in the correct format. If you are using a SambaStack-provided draft checkpoint, skip this step.
3

Validate draft–target checkpoint compatibility

When deploying custom checkpoints with speculative decoding, we recommend verifying that a draft checkpoint is compatible with the target checkpoint. This helps prevent unexpected errors during inference.The Checkpoint Conversion Tool provides a built-in utility that checks draft–target compatibility. Use the validation utility to confirm:
  • The draft and target models align structurally
  • Their tokenizers are compatible
  • Speculative decoding can run safely with this pair

Command template

docker run -v $HOST_WORKING_DIR:$DOCKER_WORKING_DIR --rm -it \
    --platform linux/amd64 \
    $IMAGE_NAME \
    validate-sd \
    --target_model $TARGET_MODEL \
    --target_checkpoint_path "$DOCKER_WORKING_DIR/$TARGET_CHECKPOINT_DIR" \
    --draft_model $DRAFT_MODEL \
    --draft_checkpoint_path "$DOCKER_WORKING_DIR/$DRAFT_CHECKPOINT_DIR" \
    --server $SERVER \
    --cache_location $CACHE_LOCATION

Parameters

These are the primary input flags to the validate-sd command:
Flag / VariableTypeDescription
--target_model / TARGET_MODELstrThe target model family to validate against (e.g., "llama3-70b"). This should match the model family used in your target deployment.
--target_checkpoint_pathstrPath inside the container to the directory containing the converted target checkpoint. Typically $DOCKER_WORKING_DIR/<subdir> mounted from HOST_WORKING_DIR.
--draft_model / DRAFT_MODELstrThe draft model family to validate (e.g., "llama3-1b"). This should match the model family used as the draft model in your speculative decoding configuration.
--draft_checkpoint_pathstrPath inside the container to the directory containing the converted draft checkpoint. Typically $DOCKER_WORKING_DIR/<subdir> mounted from HOST_WORKING_DIR.
--server / SERVERstrSource of serving metadata. Can be embedded, a local path, or a remote URL such as https://api.sambanova.ai/. Typically this is the base endpoint URL for your SambaStack instance.
--cache_location / CACHE_LOCATIONstrLocation (inside the container) where serving metadata/configs are stored. Usually a subdirectory of $DOCKER_WORKING_DIR and visible on the host under $HOST_WORKING_DIR/$CACHE_LOCATION.

Host-level variables

In addition to the flags above, you will typically set the following environment variables for the Docker command:
VariableDescription
HOST_WORKING_DIRDirectory on the host that contains your converted target and draft checkpoints. Must be writable.
DOCKER_WORKING_DIRDirectory inside the container where HOST_WORKING_DIR is mounted (matches the right side of -v).
IMAGE_NAMEFull image name of the Checkpoint Conversion Tool container.
TARGET_CHECKPOINT_DIRSubdirectory under HOST_WORKING_DIR containing the target checkpoint directory (mirrored under DOCKER_WORKING_DIR).
DRAFT_CHECKPOINT_DIRSubdirectory under HOST_WORKING_DIR containing the draft checkpoint directory (mirrored under DOCKER_WORKING_DIR).
Important: Both draft and target checkpoints must be converted before validation. If you are using a SambaStack-provided target checkpoint, you do not need to convert it—all SambaStack-provided checkpoints are already in the correct format.
This step is optional but strongly recommended to avoid runtime errors.
4

Upload your custom draft checkpoint

Upload the converted draft checkpoint directory to your GCS bucket and make sure it is readable by your SambaStack service account.
gcloud storage cp -r <LOCAL_DRAFT_CHECKPOINT_DIR> gs://<BUCKET_NAME>/<PATH>/
For detailed upload instructions and GCS permission configuration, see Deploying custom checkpoints → Upload your custom checkpoint.
5

Register your draft checkpoint

Similar to Step 3 in the workflow to deploy a custom checkpoint, register your checkpoint by creating a Model Manifest.The Model Manifest stores relevant information about your checkpoint such as:
  • The name to use in API requests or other YAMLs
  • Supported languages
  • Description of the checkpoint

Example Model Manifest

apiVersion: sambanova.ai/v1alpha1
kind: Model
metadata:
  name: My-Custom-Llama3.1-8B
spec:
  aliases:
  - my-custom-llama3.1-8b
  - My-Custom-Llama3.1-8B
  metadata:
    architecture: Llama 3.1
    category:
    - General
    - Instruct
    github_link: https://huggingface.co/path/to/My-Custom-Llama3.1-8B
    hf_link: 'https://huggingface.co/path/to/My-Custom-Llama3.1-8B'
    languages:
    - English
    - German
    - French
    - Italian
    - Portuguese
    - Hindi
    - Spanish
    - Thai
    license: llama3.1
    name: Salesforce/My-Custom-Llama3.1-8B
    overview: A description or overview of My-Custom-Llama3.1-8B. Typically this is the description found in a modelcard.
    status: active
    vocabulary_size: 128256
  name: My-Custom-Llama3.1-8B
  owner: jane@doe.ai
  price:
    input_tokens: 10
    output_tokens: 20
  public: true
  tokenizer:
    endpointUrl: ''
    path: ./Meta-Llama-3.1-8B-Instruct_tokenizer
The name field (My-Custom-Llama3.1-8B in the above example) will become a relevant field in the subsequent steps.

Select tokenizer fields

The tokenizer field in the Model Manifest can be set to the base model used for your custom checkpoint. For instance, if your custom checkpoint is fine-tuned from Meta-Llama-3.1-70B-Instruct, then the tokenizer path can be set as follows:
tokenizer:
  endpointUrl: ''
  path: ./Meta-Llama-3.1-70B-Instruct_tokenizer
The tokenizer field in the Model Manifest is only used for running checks on the inputs to calculate sequence length requirements prior to generation time.

Apply the Model Manifest

After you have created your Model Manifest, apply it using:
kubectl apply -f your_model_manifest_filename.yaml
6

Update your Bundle configuration

Once your draft checkpoint is uploaded and registered, update your Bundle configuration the same way you would for any custom checkpoint.

Before: Using SambaNova-provided draft

Below is a Bundle before introducing a custom draft checkpoint—only the target checkpoint has been customized:
apiVersion: sambanova.ai/v1alpha1
kind: Bundle
metadata:
  name: 70b-3dot3-ss-4-8-64-128k
spec:
  checkpoints:
    LLAMA3D2_1B_CKPT:
      source: gs://service-account-gcs-bucket/.../ckpts/meta-llama3-1b-instruct
      toolSupport: true
    CUSTOM_70B_TARGET_CKPT:
      source: gs://your-gcs-bucket/path/to/custom/ckpt/dir
      toolSupport: false
  models:
    Meta-Llama-3.1-1B-Instruct:
      checkpoint: LLAMA3_8B_3_1_CKPT
      template: Meta-Llama-3.1-8B-Instruct
    Custom-Target-Llama-3.3-70B:
      checkpoint: CUSTOM_70B_TARGET_CKPT
      template: Meta-Llama-3.3-70B-Instruct
...

After: Using custom draft

To replace the draft model checkpoint, update the Bundle with your custom draft checkpoint just as you would for the target model by editing both the checkpoints and models sections:
apiVersion: sambanova.ai/v1alpha1
kind: Bundle
metadata:
  name: 70b-3dot3-ss-4-8-64-128k
spec:
  checkpoints:
    CUSTOM_1B_DRAFT_CKPT:
      source: gs://your-gcs-bucket/path/to/custom/draft/ckpt/dir
      toolSupport: true
    CUSTOM_70B_TARGET_CKPT:
      source: gs://your-gcs-bucket/path/to/custom/ckpt/dir
      toolSupport: false
  models:
    Custom-Draft-Llama-3.1-1B:
      checkpoint: CUSTOM_1B_DRAFT_CKPT
      template: Meta-Llama-3.1-1B-Instruct
    Custom-Target-Llama-3.3-70B:
      checkpoint: CUSTOM_70B_TARGET_CKPT
      template: Meta-Llama-3.3-70B-Instruct
...

Apply the Bundle

kubectl apply -f your_bundle_filename.yaml

Important notes

  • You do not need to call the draft model directly in your inference API requests
  • When you send requests to the target model (e.g., Custom-Target-Llama-3.3-70B), SambaStack automatically runs speculative decoding using the corresponding draft checkpoint
  • Both checkpoints must be present in the Bundle for speculative decoding to work
  • The steps for modifying the deployment to point to your custom draft checkpoint are similar to the steps for custom checkpoints. See Reference the checkpoint in your deployment yaml in the Deploying custom checkpoints page for details.

Verify deployment

After applying the Bundle, verify your deployment:
# Check Bundle status
kubectl get bundles

# Check Bundle details
kubectl describe bundle <your-bundle-name>

# Verify models are registered
kubectl get models
Test by sending an inference request to your target model. SambaStack automatically uses the draft model for speculative decoding.

Troubleshooting

IssuePossible CauseSolution
Validation failsMismatched vocabulary sizesEnsure draft and target use compatible tokenizers
Low acceptance ratesPoor draft-target alignmentUse a draft model more closely aligned with target
Performance degradationLong inputs, short outputsConsider disabling speculative decoding for this workload
Deployment failsMissing draft checkpoint in BundleEnsure both checkpoints are defined in spec.checkpoints
Model not foundModel name mismatchVerify models.<name> matches Model Manifest name