Managing Training Artifacts

Learn how to handle training outputs, model files, and other artifacts in Trainwave.

Quick Start


# List artifacts from a job
wave storage list j-xyz789
 
# Download artifacts
wave storage download j-xyz789 --output ./results

Artifact Storage

Storage Structure

Trainwave automatically manages artifacts in the following directory structure within your job’s container:


/workspace/
├── artifacts/              # Main artifacts directory
│   ├── models/            # Trained models
│   ├── checkpoints/       # Training checkpoints
│   ├── logs/             # Training logs
│   └── results/          # Evaluation results
├── data/                  # Input data
└── src/                  # Your source code

Saving Artifacts

Save your training outputs to the appropriate directories:


# PyTorch example
import torch
 
# Save model
torch.save(model.state_dict(), '/workspace/artifacts/models/model.pt')
 
# Save checkpoint
torch.save({
    'epoch': epoch,
    'model_state_dict': model.state_dict(),
    'optimizer_state_dict': optimizer.state_dict(),
    'loss': loss,
}, '/workspace/artifacts/checkpoints/checkpoint.pt')


# TensorFlow example
import tensorflow as tf
 
# Save model
model.save('/workspace/artifacts/models/model')
 
# Save checkpoint
checkpoint = tf.train.Checkpoint(model=model, optimizer=optimizer)
checkpoint.save('/workspace/artifacts/checkpoints/ckpt')

Artifact Management

CLI Commands


# List artifacts
wave storage list j-xyz789
 
# Download specific artifacts
wave storage download j-xyz789 \
  --include "*.pt" \
  --output ./models
 
# Download all artifacts
wave storage download j-xyz789 \
  --output ./results

Automatic Artifact Collection

Trainwave automatically collects:

Training logs (/workspace/artifacts/logs/)
Model files (/workspace/artifacts/models/)
Metrics and results (/workspace/artifacts/results/)
Environment information
Resource usage statistics

Integration with ML Frameworks

PyTorch


import torch
from pathlib import Path
 
class ModelCheckpoint:
    def __init__(self, model, optimizer, save_dir):
        self.model = model
        self.optimizer = optimizer
        self.save_dir = Path('/workspace/artifacts/checkpoints') / save_dir
        self.save_dir.mkdir(parents=True, exist_ok=True)
 
    def save(self, epoch, loss):
        checkpoint_path = self.save_dir / f'checkpoint_epoch_{epoch}.pt'
        torch.save({
            'epoch': epoch,
            'model_state_dict': self.model.state_dict(),
            'optimizer_state_dict': self.optimizer.state_dict(),
            'loss': loss,
        }, checkpoint_path)
 
    def load(self, checkpoint_path):
        checkpoint = torch.load(checkpoint_path)
        self.model.load_state_dict(checkpoint['model_state_dict'])
        self.optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
        return checkpoint['epoch'], checkpoint['loss']

TensorFlow


import tensorflow as tf
import os
 
class TrainingCallback(tf.keras.callbacks.Callback):
    def __init__(self, checkpoint_dir):
        super().__init__()
        self.checkpoint_dir = os.path.join('/workspace/artifacts/checkpoints', checkpoint_dir)
        os.makedirs(self.checkpoint_dir, exist_ok=True)
 
    def on_epoch_end(self, epoch, logs=None):
        checkpoint_path = os.path.join(self.checkpoint_dir, f'checkpoint_epoch_{epoch}')
        self.model.save_weights(checkpoint_path)

Best Practices

Organization

Use consistent directory structure
Follow clear naming conventions
Separate different types of artifacts

Storage Efficiency

Compress large files where possible
Implement retention policies to clean up old artifacts
Use appropriate file formats (e.g., safetensors over raw pickle for models)

Versioning

Include version information in artifact filenames or metadata
Save the training configuration alongside model weights

Troubleshooting

Storage Space


# Check storage usage
wave storage list j-xyz789
 
# Ensure your hdd_size_mb in trainwave.toml is large enough for your artifacts

Missing Artifacts


# Verify artifact paths
wave storage list j-xyz789
 
# Check job logs for save errors
wave jobs logs j-xyz789

Support

[email protected]