Troubleshooting¶
Common issues and solutions.
Installation Issues¶
Import Error: No module named 'worldflux'¶
Solution: Install WorldFlux in development mode:
Missing training dependencies¶
Solution: Install with training extras:
CUDA not available¶
Solution: 1. Check NVIDIA drivers: nvidia-smi 2. Reinstall PyTorch with CUDA: uv pip install torch --index-url https://download.pytorch.org/whl/cu118
Plotting Issues¶
Matplotlib backend errors (headless servers)¶
If you see errors like cannot connect to X server, use a non-interactive backend:
The example scripts set these automatically when plotting.
Memory Issues¶
CUDA Out of Memory¶
Solutions:
-
Use smaller model:
-
Reduce batch size:
-
Reduce sequence length:
-
Enable mixed precision on CUDA:
-
Use CPU for smaller experiments:
Memory grows during training¶
Cause: Tensors accumulating in memory.
Solution: Ensure losses are detached for logging:
Training Issues¶
Loss is NaN¶
Causes: - Learning rate too high - Gradient explosion - Bad data (NaN/Inf values)
Solutions:
-
Reduce learning rate:
-
Enable gradient clipping:
-
Check data for NaN:
Loss not decreasing¶
Causes: - Insufficient data - Learning rate too low - Model too small
Solutions:
- Collect more data (100+ episodes)
- Increase learning rate:
- Use larger model:
- Train longer:
KL loss dominates (DreamerV3)¶
Solution: Adjust KL balancing:
model = create_world_model(
"dreamerv3:size50m",
kl_free=1.0, # Allow some free nats
kl_balance=0.8, # Balance prior/posterior
)
Model Issues¶
TD-MPC2: stochastic is None¶
This is expected. TD-MPC2 is an implicit model without stochastic state:
state = tdmpc.encode(obs)
print(state.tensors["latent"]) # SimNorm embedding
# Use latent for features
features = state.tensors["latent"]
TD-MPC2: trajectory.continues is None¶
This is expected. TD-MPC2 doesn't predict episode continuation:
DreamerV3: Blurry reconstructions¶
Causes: - Model too small - Insufficient training - KL too high
Solutions: 1. Use larger model 2. Train longer 3. Reduce KL weight:
Data Issues¶
ReplayBuffer sample error¶
Cause: Buffer has fewer transitions than batch_size * seq_len.
Solution:
# Check buffer size
print(f"Buffer size: {len(buffer)}")
print(f"Required: {batch_size * seq_len}")
# Collect more data or reduce batch/seq
batch = buffer.sample(batch_size=4, seq_len=10)
Episode boundary issues¶
Symptom: Model learns wrong transitions at episode boundaries.
Solution: Ensure proper dones array:
buffer.add_episode(
obs=obs_array,
actions=action_array,
rewards=reward_array,
dones=done_array, # True at episode end, False elsewhere
)
Loading Issues¶
Can't load saved model¶
Solution: Check model directory structure:
Version mismatch¶
Cause: Model saved with different WorldFlux version.
Solution:
# Load with strict=False
import torch
state_dict = torch.load("model.pt")
model.load_state_dict(state_dict, strict=False)
Performance Issues¶
Training too slow¶
Solutions:
-
Use GPU:
-
Increase batch size (if memory allows):
-
Use DataLoader workers:
Imagination too slow¶
Solutions:
-
Use torch.no_grad():
-
Batch multiple rollouts:
Getting Help¶
If your issue isn't listed here:
- Check GitHub Issues
- Search existing issues for similar problems
- Open a new issue with:
- WorldFlux version (
worldflux.__version__) - Python version
- PyTorch version
- Full error traceback
- Minimal reproduction code