Skip to Content
Managing Training Artifacts

Managing Training Artifacts

Learn how to handle training outputs, model files, and other artifacts in Trainwave.

Quick Start

# List artifacts from a job wave storage list j-xyz789 # Download artifacts wave storage download j-xyz789 --output ./results

Artifact Storage

Storage Structure

Trainwave automatically manages artifacts in the following directory structure within your job’s container:

/workspace/ ├── artifacts/ # Main artifacts directory │ ├── models/ # Trained models │ ├── checkpoints/ # Training checkpoints │ ├── logs/ # Training logs │ └── results/ # Evaluation results ├── data/ # Input data └── src/ # Your source code

Saving Artifacts

Save your training outputs to the appropriate directories:

# PyTorch example import torch # Save model torch.save(model.state_dict(), '/workspace/artifacts/models/model.pt') # Save checkpoint torch.save({ 'epoch': epoch, 'model_state_dict': model.state_dict(), 'optimizer_state_dict': optimizer.state_dict(), 'loss': loss, }, '/workspace/artifacts/checkpoints/checkpoint.pt')
# TensorFlow example import tensorflow as tf # Save model model.save('/workspace/artifacts/models/model') # Save checkpoint checkpoint = tf.train.Checkpoint(model=model, optimizer=optimizer) checkpoint.save('/workspace/artifacts/checkpoints/ckpt')

Artifact Management

CLI Commands

# List artifacts wave storage list j-xyz789 # Download specific artifacts wave storage download j-xyz789 \ --include "*.pt" \ --output ./models # Download all artifacts wave storage download j-xyz789 \ --output ./results

Automatic Artifact Collection

Trainwave automatically collects:

  1. Training logs (/workspace/artifacts/logs/)
  2. Model files (/workspace/artifacts/models/)
  3. Metrics and results (/workspace/artifacts/results/)
  4. Environment information
  5. Resource usage statistics

Integration with ML Frameworks

PyTorch

import torch from pathlib import Path class ModelCheckpoint: def __init__(self, model, optimizer, save_dir): self.model = model self.optimizer = optimizer self.save_dir = Path('/workspace/artifacts/checkpoints') / save_dir self.save_dir.mkdir(parents=True, exist_ok=True) def save(self, epoch, loss): checkpoint_path = self.save_dir / f'checkpoint_epoch_{epoch}.pt' torch.save({ 'epoch': epoch, 'model_state_dict': self.model.state_dict(), 'optimizer_state_dict': self.optimizer.state_dict(), 'loss': loss, }, checkpoint_path) def load(self, checkpoint_path): checkpoint = torch.load(checkpoint_path) self.model.load_state_dict(checkpoint['model_state_dict']) self.optimizer.load_state_dict(checkpoint['optimizer_state_dict']) return checkpoint['epoch'], checkpoint['loss']

TensorFlow

import tensorflow as tf import os class TrainingCallback(tf.keras.callbacks.Callback): def __init__(self, checkpoint_dir): super().__init__() self.checkpoint_dir = os.path.join('/workspace/artifacts/checkpoints', checkpoint_dir) os.makedirs(self.checkpoint_dir, exist_ok=True) def on_epoch_end(self, epoch, logs=None): checkpoint_path = os.path.join(self.checkpoint_dir, f'checkpoint_epoch_{epoch}') self.model.save_weights(checkpoint_path)

Best Practices

Organization

  • Use consistent directory structure
  • Follow clear naming conventions
  • Separate different types of artifacts

Storage Efficiency

  • Compress large files where possible
  • Implement retention policies to clean up old artifacts
  • Use appropriate file formats (e.g., safetensors over raw pickle for models)

Versioning

  • Include version information in artifact filenames or metadata
  • Save the training configuration alongside model weights

Troubleshooting

Storage Space

# Check storage usage wave storage list j-xyz789 # Ensure your hdd_size_mb in trainwave.toml is large enough for your artifacts

Missing Artifacts

# Verify artifact paths wave storage list j-xyz789 # Check job logs for save errors wave jobs logs j-xyz789

Support

[email protected]

Last updated on