Troubleshooting¶

Common issues and solutions.

Installation Issues¶

Import Error: No module named 'worldflux'¶

ModuleNotFoundError: No module named 'worldflux'

Solution: Install WorldFlux in development mode:

cd worldflux
uv sync

Missing training dependencies¶

ModuleNotFoundError: No module named 'worldflux.training'

Solution: Install with training extras:

uv sync --extra training

CUDA not available¶

>>> import torch
>>> torch.cuda.is_available()
False

Solution: 1. Check NVIDIA drivers: nvidia-smi 2. Reinstall PyTorch with CUDA: uv pip install torch --index-url https://download.pytorch.org/whl/cu118

Plotting Issues¶

Matplotlib backend errors (headless servers)¶

If you see errors like cannot connect to X server, use a non-interactive backend:

export MPLBACKEND=Agg
export MPLCONFIGDIR=/tmp/matplotlib

The example scripts set these automatically when plotting.

Memory Issues¶

CUDA Out of Memory¶

RuntimeError: CUDA out of memory

Solutions:

Use smaller model:

model = create_world_model("dreamerv3:size12m")  # Instead of size200m

Reduce batch size:

config = TrainingConfig(batch_size=8)  # Instead of 16

Reduce sequence length:

config = TrainingConfig(sequence_length=25)  # Instead of 50

Enable mixed precision on CUDA:

config = TrainingConfig(mixed_precision=True)

Use CPU for smaller experiments:

model = create_world_model("dreamerv3:size12m", device="cpu")

Memory grows during training¶

Cause: Tensors accumulating in memory.

Solution: Ensure losses are detached for logging:

# Bad
metrics["loss"] = loss

# Good
metrics["loss"] = loss.item()

Training Issues¶

Loss is NaN¶

Causes: - Learning rate too high - Gradient explosion - Bad data (NaN/Inf values)

Solutions:

Reduce learning rate:

config = TrainingConfig(learning_rate=1e-4)

Enable gradient clipping:

config = TrainingConfig(grad_clip=100.0)  # Default

Check data for NaN:

import numpy as np
data = np.load("data.npz")
for key in data:
    assert not np.isnan(data[key]).any(), f"NaN in {key}"
    assert not np.isinf(data[key]).any(), f"Inf in {key}"

Loss not decreasing¶

Causes: - Insufficient data - Learning rate too low - Model too small

Solutions:

Collect more data (100+ episodes)

Increase learning rate:

config = TrainingConfig(learning_rate=1e-3)

Use larger model:

model = create_world_model("dreamerv3:size50m")

Train longer:

config = TrainingConfig(total_steps=200_000)

KL loss dominates (DreamerV3)¶

Solution: Adjust KL balancing:

model = create_world_model(
    "dreamerv3:size50m",
    kl_free=1.0,      # Allow some free nats
    kl_balance=0.8,   # Balance prior/posterior
)

Model Issues¶

TD-MPC2: stochastic is None¶

This is expected. TD-MPC2 is an implicit model without stochastic state:

state = tdmpc.encode(obs)
print(state.tensors["latent"])  # SimNorm embedding

# Use latent for features
features = state.tensors["latent"]

TD-MPC2: trajectory.continues is None¶

This is expected. TD-MPC2 doesn't predict episode continuation:

trajectory = tdmpc.rollout(state, actions)
print(trajectory.continues)  # None - this is normal!

DreamerV3: Blurry reconstructions¶

Causes: - Model too small - Insufficient training - KL too high

Solutions: 1. Use larger model 2. Train longer 3. Reduce KL weight:

model = create_world_model("dreamerv3:size50m", kl_balance=0.5)

Data Issues¶

ReplayBuffer sample error¶

ValueError: Not enough data to sample

Cause: Buffer has fewer transitions than batch_size * seq_len.

Solution:

# Check buffer size
print(f"Buffer size: {len(buffer)}")
print(f"Required: {batch_size * seq_len}")

# Collect more data or reduce batch/seq
batch = buffer.sample(batch_size=4, seq_len=10)

Episode boundary issues¶

Symptom: Model learns wrong transitions at episode boundaries.

Solution: Ensure proper dones array:

buffer.add_episode(
    obs=obs_array,
    actions=action_array,
    rewards=reward_array,
    dones=done_array,  # True at episode end, False elsewhere
)

Loading Issues¶

Can't load saved model¶

FileNotFoundError: config.json not found

Solution: Check model directory structure:

my_model/
├── config.json    # Must exist
└── model.pt       # Must exist

Version mismatch¶

KeyError: 'new_parameter'

Cause: Model saved with different WorldFlux version.

Solution:

# Load with strict=False
import torch
state_dict = torch.load("model.pt")
model.load_state_dict(state_dict, strict=False)

Performance Issues¶

Training too slow¶

Solutions:

Use GPU:

model = create_world_model(..., device="cuda")

Increase batch size (if memory allows):
```
config = TrainingConfig(batch_size=64)
```

Use DataLoader workers:

# In custom training loop
dataloader = DataLoader(dataset, num_workers=4)

Imagination too slow¶

Solutions:

Use torch.no_grad():

with torch.no_grad():
    trajectory = model.rollout(state, actions)

Batch multiple rollouts:

# Instead of: multiple single rollouts
# Do: one batched rollout
states = model.encode(obs_batch)  # [B, ...]
trajectory = model.rollout(states, actions)  # Batched

Getting Help¶

If your issue isn't listed here:

Check GitHub Issues
Search existing issues for similar problems
Open a new issue with:
WorldFlux version (worldflux.__version__)
Python version
PyTorch version
Full error traceback
Minimal reproduction code