Skip to main content

Troubleshooting

Common issues and solutions.

Installation Issues

Import Error: No module named 'worldflux'

ModuleNotFoundError: No module named 'worldflux'

Solution: Install WorldFlux in development mode:

cd worldflux
uv sync

Missing training dependencies

ModuleNotFoundError: No module named 'worldflux.training'

Solution: Install with training extras:

uv sync --extra training

CUDA not available

>>> import torch
>>> torch.cuda.is_available()
False

Solution:

  1. Check NVIDIA drivers: nvidia-smi
  2. Reinstall PyTorch with CUDA: uv pip install torch --index-url https://download.pytorch.org/whl/cu118

Plotting Issues

Matplotlib backend errors (headless servers)

If you see errors like cannot connect to X server, use a non-interactive backend:

export MPLBACKEND=Agg
export MPLCONFIGDIR=/tmp/matplotlib

The example scripts set these automatically when plotting.

Memory Issues

CUDA Out of Memory

RuntimeError: CUDA out of memory

Solutions:

  1. Use smaller model:
model = create_world_model("dreamerv3:size12m")  # Instead of size200m
  1. Reduce batch size:
config = TrainingConfig(batch_size=8)  # Instead of 16
  1. Reduce sequence length:
config = TrainingConfig(sequence_length=25)  # Instead of 50
  1. Enable mixed precision on CUDA:
config = TrainingConfig(mixed_precision=True)
  1. Use CPU for smaller experiments:
model = create_world_model("dreamerv3:size12m", device="cpu")

Memory grows during training

Cause: Tensors accumulating in memory.

Solution: Ensure losses are detached for logging:

# Bad
metrics["loss"] = loss

# Good
metrics["loss"] = loss.item()

Training Issues

Loss is NaN

Causes:

  • Learning rate too high
  • Gradient explosion
  • Bad data (NaN/Inf values)

Solutions:

  1. Reduce learning rate:
config = TrainingConfig(learning_rate=1e-4)
  1. Enable gradient clipping:
config = TrainingConfig(grad_clip=100.0)  # Default
  1. Check data for NaN:
import numpy as np
data = np.load("data.npz")
for key in data:
assert not np.isnan(data[key]).any(), f"NaN in {key}"
assert not np.isinf(data[key]).any(), f"Inf in {key}"

Loss not decreasing

Causes:

  • Insufficient data
  • Learning rate too low
  • Model too small

Solutions:

  1. Collect more data (100+ episodes)
  2. Increase learning rate:
config = TrainingConfig(learning_rate=1e-3)
  1. Use larger model:
model = create_world_model("dreamerv3:size50m")
  1. Train longer:
config = TrainingConfig(total_steps=200_000)

KL loss dominates (DreamerV3)

Solution: Adjust KL balancing:

model = create_world_model(
"dreamerv3:size50m",
kl_free=1.0, # Allow some free nats
kl_balance=0.8, # Balance prior/posterior
)

Model Issues

TD-MPC2: stochastic is None

This is expected. TD-MPC2 is an implicit model without stochastic state:

state = tdmpc.encode(obs)
print(state.tensors["latent"]) # SimNorm embedding

# Use latent for features
features = state.tensors["latent"]

TD-MPC2: trajectory.continues is None

This is expected. TD-MPC2 doesn't predict episode continuation:

trajectory = tdmpc.rollout(state, actions)
print(trajectory.continues) # None - this is normal!

DreamerV3: Blurry reconstructions

Causes:

  • Model too small
  • Insufficient training
  • KL too high

Solutions:

  1. Use larger model
  2. Train longer
  3. Reduce KL weight:
model = create_world_model("dreamerv3:size50m", kl_balance=0.5)

Data Issues

ReplayBuffer sample error

ValueError: Not enough data to sample

Cause: Buffer has fewer transitions than batch_size * seq_len.

Solution:

# Check buffer size
print(f"Buffer size: {len(buffer)}")
print(f"Required: {batch_size * seq_len}")

# Collect more data or reduce batch/seq
batch = buffer.sample(batch_size=4, seq_len=10)

Episode boundary issues

Symptom: Model learns wrong transitions at episode boundaries.

Solution: Ensure proper dones array:

buffer.add_episode(
obs=obs_array,
actions=action_array,
rewards=reward_array,
dones=done_array, # True at episode end, False elsewhere
)

Loading Issues

Can't load saved model

FileNotFoundError: config.json not found

Solution: Check model directory structure:

my_model/
├── config.json # Must exist
└── model.pt # Must exist

Version mismatch

KeyError: 'new_parameter'

Cause: Model saved with different WorldFlux version.

Solution:

# Load with strict=False
import torch
state_dict = torch.load("model.pt")
model.load_state_dict(state_dict, strict=False)

Performance Issues

Training too slow

Solutions:

  1. Use GPU:
model = create_world_model(..., device="cuda")
  1. Increase batch size (if memory allows):
config = TrainingConfig(batch_size=64)
  1. Use DataLoader workers:
# In custom training loop
dataloader = DataLoader(dataset, num_workers=4)

Imagination too slow

Solutions:

  1. Use torch.no_grad():
with torch.no_grad():
trajectory = model.rollout(state, actions)
  1. Batch multiple rollouts:
# Instead of: multiple single rollouts
# Do: one batched rollout
states = model.encode(obs_batch) # [B, ...]
trajectory = model.rollout(states, actions) # Batched

Getting Help

If your issue isn't listed here:

  1. Check GitHub Issues
  2. Search existing issues for similar problems
  3. Open a new issue with:
    • WorldFlux version (worldflux.__version__)
    • Python version
    • PyTorch version
    • Full error traceback
    • Minimal reproduction code