Troubleshooting
Common issues and solutions.
Installation Issues
Import Error: No module named 'worldflux'
ModuleNotFoundError: No module named 'worldflux'
Solution: Install WorldFlux in development mode:
cd worldflux
uv sync
Missing training dependencies
ModuleNotFoundError: No module named 'worldflux.training'
Solution: Install with training extras:
uv sync --extra training
CUDA not available
>>> import torch
>>> torch.cuda.is_available()
False
Solution:
- Check NVIDIA drivers:
nvidia-smi - Reinstall PyTorch with CUDA:
uv pip install torch --index-url https://download.pytorch.org/whl/cu118
Plotting Issues
Matplotlib backend errors (headless servers)
If you see errors like cannot connect to X server, use a non-interactive backend:
export MPLBACKEND=Agg
export MPLCONFIGDIR=/tmp/matplotlib
The example scripts set these automatically when plotting.
Memory Issues
CUDA Out of Memory
RuntimeError: CUDA out of memory
Solutions:
- Use smaller model:
model = create_world_model("dreamerv3:size12m") # Instead of size200m
- Reduce batch size:
config = TrainingConfig(batch_size=8) # Instead of 16
- Reduce sequence length:
config = TrainingConfig(sequence_length=25) # Instead of 50
- Enable mixed precision on CUDA:
config = TrainingConfig(mixed_precision=True)
- Use CPU for smaller experiments:
model = create_world_model("dreamerv3:size12m", device="cpu")
Memory grows during training
Cause: Tensors accumulating in memory.
Solution: Ensure losses are detached for logging:
# Bad
metrics["loss"] = loss
# Good
metrics["loss"] = loss.item()
Training Issues
Loss is NaN
Causes:
- Learning rate too high
- Gradient explosion
- Bad data (NaN/Inf values)
Solutions:
- Reduce learning rate:
config = TrainingConfig(learning_rate=1e-4)
- Enable gradient clipping:
config = TrainingConfig(grad_clip=100.0) # Default
- Check data for NaN:
import numpy as np
data = np.load("data.npz")
for key in data:
assert not np.isnan(data[key]).any(), f"NaN in {key}"
assert not np.isinf(data[key]).any(), f"Inf in {key}"
Loss not decreasing
Causes:
- Insufficient data
- Learning rate too low
- Model too small
Solutions:
- Collect more data (100+ episodes)
- Increase learning rate:
config = TrainingConfig(learning_rate=1e-3)
- Use larger model:
model = create_world_model("dreamerv3:size50m")
- Train longer:
config = TrainingConfig(total_steps=200_000)
KL loss dominates (DreamerV3)
Solution: Adjust KL balancing:
model = create_world_model(
"dreamerv3:size50m",
kl_free=1.0, # Allow some free nats
kl_balance=0.8, # Balance prior/posterior
)
Model Issues
TD-MPC2: stochastic is None
This is expected. TD-MPC2 is an implicit model without stochastic state:
state = tdmpc.encode(obs)
print(state.tensors["latent"]) # SimNorm embedding
# Use latent for features
features = state.tensors["latent"]
TD-MPC2: trajectory.continues is None
This is expected. TD-MPC2 doesn't predict episode continuation:
trajectory = tdmpc.rollout(state, actions)
print(trajectory.continues) # None - this is normal!
DreamerV3: Blurry reconstructions
Causes:
- Model too small
- Insufficient training
- KL too high
Solutions:
- Use larger model
- Train longer
- Reduce KL weight:
model = create_world_model("dreamerv3:size50m", kl_balance=0.5)
Data Issues
ReplayBuffer sample error
ValueError: Not enough data to sample
Cause: Buffer has fewer transitions than batch_size * seq_len.
Solution:
# Check buffer size
print(f"Buffer size: {len(buffer)}")
print(f"Required: {batch_size * seq_len}")
# Collect more data or reduce batch/seq
batch = buffer.sample(batch_size=4, seq_len=10)
Episode boundary issues
Symptom: Model learns wrong transitions at episode boundaries.
Solution: Ensure proper dones array:
buffer.add_episode(
obs=obs_array,
actions=action_array,
rewards=reward_array,
dones=done_array, # True at episode end, False elsewhere
)
Loading Issues
Can't load saved model
FileNotFoundError: config.json not found
Solution: Check model directory structure:
my_model/
├── config.json # Must exist
└── model.pt # Must exist
Version mismatch
KeyError: 'new_parameter'
Cause: Model saved with different WorldFlux version.
Solution:
# Load with strict=False
import torch
state_dict = torch.load("model.pt")
model.load_state_dict(state_dict, strict=False)
Performance Issues
Training too slow
Solutions:
- Use GPU:
model = create_world_model(..., device="cuda")
- Increase batch size (if memory allows):
config = TrainingConfig(batch_size=64)
- Use DataLoader workers:
# In custom training loop
dataloader = DataLoader(dataset, num_workers=4)
Imagination too slow
Solutions:
- Use torch.no_grad():
with torch.no_grad():
trajectory = model.rollout(state, actions)
- Batch multiple rollouts:
# Instead of: multiple single rollouts
# Do: one batched rollout
states = model.encode(obs_batch) # [B, ...]
trajectory = model.rollout(states, actions) # Batched
Getting Help
If your issue isn't listed here:
- Check GitHub Issues
- Search existing issues for similar problems
- Open a new issue with:
- WorldFlux version (
worldflux.__version__) - Python version
- PyTorch version
- Full error traceback
- Minimal reproduction code
- WorldFlux version (