If you are unfamiliar with Models vs. Experts, Bundles vs. Bundle Templates, Model Manifests, or model deployment, see the Model deployment page.
What is speculative decoding
Speculative decoding is an inference acceleration technique that pairs a fast, lightweight “draft” model with a larger, higher-quality “target” model:- The draft model proposes multiple next tokens
- The target model quickly verifies or rejects them in parallel
- This approach significantly speeds up inference decode time, especially when the draft model is closely “aligned” to the target model
When NOT to use speculative decoding
Speculative decoding is most effective when the draft model can predict a meaningful portion of the target model’s next tokens. In some situations, however, it can provide little benefit—or even reduce overall performance:| Scenario | Why it hurts performance |
|---|---|
| Long inputs with short outputs | Both the draft and target models must process the entire input before generation begins. The added overhead of running the draft model can outweigh any speedup. |
| Very large gaps between model sizes | When the draft model is much smaller (or otherwise poorly aligned) compared to the target model, acceptance rates tend to drop. Low acceptance rates mean more rejected tokens, more corrective forward passes on the target model, and ultimately decreased performance. |
Using custom draft checkpoints for speculative decoding
Some model experts in SambaStack are configured to use speculative decoding. When speculative decoding is enabled for a model expert, both the draft and target models must be deployed together—and both checkpoints for the target and draft model may be customized. This section walks through the full workflow for deploying a custom draft checkpoint, including:- How to identify which deployable Bundles use speculative decoding
- How to validate draft–target compatibility
- How to update your Bundle
Workflow overview
Deploying a custom draft checkpoint for speculative decoding follows these high-level steps:| Step | Action |
|---|---|
| 1 | Confirm your deployment uses speculative decoding by checking the Bundle Template |
| 2 | Convert your custom draft checkpoint into the SambaNova-compatible format |
| 3 | Validate compatibility between your draft and target checkpoints (recommended) |
| 4 | Upload the converted draft checkpoint to your GCS bucket |
| 5 | Register your draft checkpoint by creating a Model Manifest |
| 6 | Update your Bundle configuration to reference the custom draft checkpoint |
Steps to deploy a custom draft checkpoint
1
Confirm your deployment uses speculative decoding
To deploy a custom draft checkpoint, first confirm that the model experts in the Bundle you would like to deploy are configured for speculative decoding. This information is found in the Bundle Template definition.In the example below, the In this example:
Meta-Llama-3.3-70B-Instruct model has an expert (8k) whose configuration includes a spec_decoding block. This block identifies both the draft expert and the draft model:- Meta-Llama-3.3-70B-Instruct is the target model
- Meta-Llama-3.2-1B-Instruct is the draft model used for speculative decoding
- Both Model Templates appear in the Bundle Template because speculative decoding requires both models to be part of the deployment
- Both Model Template names must appear in your eventual Bundle
2
Convert your custom draft checkpoint
If you are replacing the draft model’s checkpoint with your own fine-tuned or custom version, convert it into the SambaNova-compatible format using the Checkpoint Conversion Tool.
- If you have not already set up the tool, follow the Download and set up instructions on the Checkpoint Conversion Tool documentation page.
- Then convert the draft checkpoint using the steps in Convert and validate checkpoint.
SambaStack-provided checkpoints are already in the correct format. If you are using a SambaStack-provided draft checkpoint, skip this step.
3
Validate draft–target checkpoint compatibility
When deploying custom checkpoints with speculative decoding, we recommend verifying that a draft checkpoint is compatible with the target checkpoint. This helps prevent unexpected errors during inference.The Checkpoint Conversion Tool provides a built-in utility that checks draft–target compatibility. Use the validation utility to confirm:
This step is optional but strongly recommended to avoid runtime errors.
- The draft and target models align structurally
- Their tokenizers are compatible
- Speculative decoding can run safely with this pair
Command template
Parameters
These are the primary input flags to thevalidate-sd command:| Flag / Variable | Type | Description |
|---|---|---|
--target_model / TARGET_MODEL | str | The target model family to validate against (e.g., "llama3-70b"). This should match the model family used in your target deployment. |
--target_checkpoint_path | str | Path inside the container to the directory containing the converted target checkpoint. Typically $DOCKER_WORKING_DIR/<subdir> mounted from HOST_WORKING_DIR. |
--draft_model / DRAFT_MODEL | str | The draft model family to validate (e.g., "llama3-1b"). This should match the model family used as the draft model in your speculative decoding configuration. |
--draft_checkpoint_path | str | Path inside the container to the directory containing the converted draft checkpoint. Typically $DOCKER_WORKING_DIR/<subdir> mounted from HOST_WORKING_DIR. |
--server / SERVER | str | Source of serving metadata. Can be embedded, a local path, or a remote URL such as https://api.sambanova.ai/. Typically this is the base endpoint URL for your SambaStack instance. |
--cache_location / CACHE_LOCATION | str | Location (inside the container) where serving metadata/configs are stored. Usually a subdirectory of $DOCKER_WORKING_DIR and visible on the host under $HOST_WORKING_DIR/$CACHE_LOCATION. |
Host-level variables
In addition to the flags above, you will typically set the following environment variables for the Docker command:| Variable | Description |
|---|---|
HOST_WORKING_DIR | Directory on the host that contains your converted target and draft checkpoints. Must be writable. |
DOCKER_WORKING_DIR | Directory inside the container where HOST_WORKING_DIR is mounted (matches the right side of -v). |
IMAGE_NAME | Full image name of the Checkpoint Conversion Tool container. |
TARGET_CHECKPOINT_DIR | Subdirectory under HOST_WORKING_DIR containing the target checkpoint directory (mirrored under DOCKER_WORKING_DIR). |
DRAFT_CHECKPOINT_DIR | Subdirectory under HOST_WORKING_DIR containing the draft checkpoint directory (mirrored under DOCKER_WORKING_DIR). |
Important: Both draft and target checkpoints must be converted before validation. If you are using a SambaStack-provided target checkpoint, you do not need to convert it—all SambaStack-provided checkpoints are already in the correct format.
4
Upload your custom draft checkpoint
Upload the converted draft checkpoint directory to your GCS bucket and make sure it is readable by your SambaStack service account.For detailed upload instructions and GCS permission configuration, see Deploying custom checkpoints → Upload your custom checkpoint.
5
Register your draft checkpoint
Similar to Step 3 in the workflow to deploy a custom checkpoint, register your checkpoint by creating a Model Manifest.The Model Manifest stores relevant information about your checkpoint such as:The
- The name to use in API requests or other YAMLs
- Supported languages
- Description of the checkpoint
Example Model Manifest
name field (My-Custom-Llama3.1-8B in the above example) will become a relevant field in the subsequent steps.Select tokenizer fields
Thetokenizer field in the Model Manifest can be set to the base model used for your custom checkpoint. For instance, if your custom checkpoint is fine-tuned from Meta-Llama-3.1-70B-Instruct, then the tokenizer path can be set as follows:The tokenizer field in the Model Manifest is only used for running checks on the inputs to calculate sequence length requirements prior to generation time.
Apply the Model Manifest
After you have created your Model Manifest, apply it using:6
Update your Bundle configuration
Once your draft checkpoint is uploaded and registered, update your Bundle configuration the same way you would for any custom checkpoint.
Before: Using SambaNova-provided draft
Below is a Bundle before introducing a custom draft checkpoint—only the target checkpoint has been customized:After: Using custom draft
To replace the draft model checkpoint, update the Bundle with your custom draft checkpoint just as you would for the target model by editing both thecheckpoints and models sections:Apply the Bundle
Important notes
- You do not need to call the draft model directly in your inference API requests
- When you send requests to the target model (e.g.,
Custom-Target-Llama-3.3-70B), SambaStack automatically runs speculative decoding using the corresponding draft checkpoint - Both checkpoints must be present in the Bundle for speculative decoding to work
- The steps for modifying the deployment to point to your custom draft checkpoint are similar to the steps for custom checkpoints. See Reference the checkpoint in your deployment yaml in the Deploying custom checkpoints page for details.
Verify deployment
After applying the Bundle, verify your deployment:Troubleshooting
| Issue | Possible Cause | Solution |
|---|---|---|
| Validation fails | Mismatched vocabulary sizes | Ensure draft and target use compatible tokenizers |
| Low acceptance rates | Poor draft-target alignment | Use a draft model more closely aligned with target |
| Performance degradation | Long inputs, short outputs | Consider disabling speculative decoding for this workload |
| Deployment fails | Missing draft checkpoint in Bundle | Ensure both checkpoints are defined in spec.checkpoints |
| Model not found | Model name mismatch | Verify models.<name> matches Model Manifest name |
Related pages
- Checkpoint Conversion Tool — Convert checkpoints to SambaNova format
- Deploying custom checkpoints — Deploy custom target checkpoints
- Model deployment — Bundle and deployment concepts
