Troubleshooting Guide
Solutions to common problems when installing, running, or building with Atelier.
Contents
- Browser Access
- Installation
- Networking
- Authentication & Password Recovery
- App Build & Deploy Issues
- LLM & Nova Issues
- WebSocket & Long-Lived Connections
- Registry & Images
- Kubernetes & Storage
Browser Access
Can’t reach atelier.home.arpa from my machine
Your machine’s /etc/hosts file must have an entry for the portal domain. On macOS/Linux:
echo "192.168.0.x atelier.home.arpa gitea.atelier.home.arpa" | sudo tee -a /etc/hostsOn Windows, edit C:\Windows\System32\drivers\etc\hosts as administrator.
Replace 192.168.0.x with your server’s IP.
.local domain broken on macOS/Linux
macOS and Linux (with Avahi) reserve .local for mDNS, which can cause unpredictable DNS resolution failures. The default domain has been changed to atelier.home.arpa — if your install is older, patch the ingress hosts:
kubectl patch ingress atelier-ui -n atelier --type json \ -p '[{"op":"replace","path":"/spec/rules/0/host","value":"atelier.home.arpa"}]'On Windows, .local works fine. You can install with --domain atelier.local if preferred.
Blank white screen / app won’t load
Clear browser local storage for the site and log in again:
- Chrome DevTools → Application → Local Storage → delete entries for your portal domain
- Safari → Preferences → Privacy → Manage Website Data → remove
Or use an incognito/private window to rule out stale state.
404 when opening a generated app
Check the app’s ingress and middlewares:
# Verify ingress host and backendkubectl get ingress <app-name> -n atelier-apps -o yaml
# Verify middlewares existkubectl get middleware -n atelier-apps | grep <app-name>Expected: one trailing-slash-<app-name> middleware and one strip-<app-name> middleware, and the ingress annotation must reference both with trailing-slash before strip.
Copy button fails with “Failed to copy to clipboard”
Fixed in newer builds (PR #314). The Clipboard API requires a secure context (HTTPS or localhost), which plain HTTP doesn’t provide; the current build falls back to document.execCommand('copy') via a hidden textarea.
Installation
Disk space requirements
- Base install: ~30 GB (K3s, registry, containerd images, PVCs for atelier-core and Gitea)
- With Kasten K10 backup: add ~70 GB for snapshot storage
- Each generated app: ~1-3 GB depending on images
Check with df -h /var/lib/rancher/k3s and df -h /var/lib/longhorn before installing.
K3s kubeconfig permissions reset after reboot
K3s resets /etc/rancher/k3s/k3s.yaml to 0600 on every restart, breaking non-root kubectl access. The installer fixes this but it needs re-running after K3s restarts:
sudo chmod 644 /etc/rancher/k3s/k3s.yamlOr add to the K3s systemd override:
sudo systemctl edit k3s# Add:# [Service]# ExecStartPost=/bin/chmod 644 /etc/rancher/k3s/k3s.yamlLonghorn not ready — PVCs stay Pending
Apps with PVCs (backend /data mount, Gitea, registry) can’t start until Longhorn is healthy:
# Check Longhorn statuskubectl get pods -n longhorn-system
# All longhorn-manager, longhorn-driver-deployer, csi-* pods must be Runningkubectl describe pvc <pvc-name> -n <namespace>Common causes: insufficient disk space, kernel not supporting iSCSI, nodes not labelled. See the Longhorn docs.
MCP images missing — ImagePullBackOff
From v0.9.8-beta onwards, mcp-fetch is pulled directly from ghcr.io/atelier-project/atelier-mcp-fetch:<version> (matching the running atelier-core’s build version) — no in-cluster registry indirection. If you see ImagePullBackOff on a fresh mcp-fetch deployment, check that the cluster node can reach ghcr.io over HTTPS.
Custom MCP servers brought in via the Settings → MCP Servers → Docker image flow still need a manual import + push (the local-registry path is preserved for user-supplied images):
sudo k3s ctr -n k8s.io images tag \ registry.atelier.local/mcp-my-server:latest \ localhost:32000/mcp-my-server:latestsudo k3s ctr -n k8s.io images push --plain-http \ localhost:32000/mcp-my-server:latestFor pre-v0.9.8-beta clusters that still reference registry.atelier.local/mcp-fetch:latest in their existing deployment: the original tag/push commands above (with mcp-fetch substituted for mcp-my-server) are still valid for that deployment. To migrate to GHCR after upgrading core/ui, delete the deployment via Settings → MCP Servers so the catalog bootstrap recreates it with the GHCR ref on next core start.
Installer missing registries.yaml — apps fail with ImagePullBackOff
The installer creates /etc/rancher/k3s/registries.yaml which tells K3s to mirror registry.atelier.local to localhost:32000. If missing:
sudo tee /etc/rancher/k3s/registries.yaml <<EOFmirrors: "registry.atelier.local": endpoint: - "http://localhost:32000"EOFsudo systemctl restart k3sNetworking
Required ports
| Port | Service | Purpose |
|---|---|---|
| 80 | Traefik (web entrypoint) | Portal, app ingresses |
| 32000 | Registry (NodePort) | Docker image push/pull |
| 32001 | Gitea (NodePort) | Git push/pull, UI |
| 30302 | Gitea (NodePort) | SSH git access (optional) |
| 6443 | K3s API | kubectl access |
Outbound connections required
- LLM provider API —
api.anthropic.com,api.openai.com, or your Ollama host - Docker Hub / ghcr.io — Trivy image, Kasten K10, base images (alpine, node, python)
- Longhorn Helm chart —
charts.longhorn.io - Gitea image —
gitea/gitea:1.22
No inbound internet required — everything serves over LAN.
WebSocket connections fail to upgrade
Traefik generally upgrades WebSocket connections automatically. If your app uses nginx (or another reverse proxy) in front of the WS backend inside the container, ensure that proxy includes these headers:
location /ws { proxy_pass http://backend:3001/ws; proxy_http_version 1.1; proxy_set_header Upgrade $http_upgrade; proxy_set_header Connection "Upgrade"; proxy_read_timeout 3600s; proxy_send_timeout 3600s;}Also ensure the frontend uses window.location.pathname to derive the WS URL — not import.meta.env.BASE_URL. See WebSocket section below.
Authentication & Password Recovery
Locked out of the portal
Reset your password from the server — no UI access needed:
# Interactive (reads from stdin, doesn't leak to shell history)kubectl exec -it -n atelier deployment/atelier-core -- \ atelier-core reset-password --user <username>Requires kubectl access (which implies cluster admin). The new password must meet the standard policy: 10+ characters, uppercase, lowercase, digit.
Forgot Gitea admin password
kubectl exec -n atelier deployment/gitea -- su git -c \ "gitea admin user change-password --username atelier-admin --password <new-password>"Forgot K10 dashboard password
# Generate a new bcrypt htpasswdNEW_PASSWORD='new-password-here'HASHED=$(htpasswd -nbB atelier "$NEW_PASSWORD" | base64 -w 0)
# Patch the secretkubectl patch secret k10-basic-auth -n kasten-io \ -p "{\"data\":{\"auth\":\"$HASHED\"}}"
# Restart the auth servicekubectl rollout restart deployment/auth-svc -n kasten-ioJWT token expired in MCP config
Atelier JWTs expire after 24 hours. Regenerate by logging in again:
curl -s http://atelier.home.arpa/api/auth/login \ -H 'Content-Type: application/json' \ -d '{"username":"<name>","password":"<pass>"}' | jq -r .tokenOr copy from your browser’s DevTools → Application → Local Storage → atelier_token.
App Build & Deploy Issues
Rebuild doesn’t create new pods
Fixed in recent builds. If you’re on an old version, force a rollout:
kubectl rollout restart deployment/<app-name>-backend \ deployment/<app-name>-frontend -n atelier-appsThe newer pipeline injects a atelier.io/build-tag annotation into the pod template, which forces K8s to create a new ReplicaSet on every build.
App in error state even though pods are healthy
Use the Reset Status button in the app detail header (appears when status is error). It checks actual pod health in K8s and resets to running or paused as appropriate.
If the UI button isn’t available, reset via DB:
kubectl exec -n atelier deployment/atelier-core -- sqlite3 /data/atelier.db \ "UPDATE apps SET status='running', last_error=NULL WHERE name='<app-name>'"ErrImagePull on app pods
Common causes:
registries.yamlmissing — see Installation- Image not actually built — check
kubectl logs -n atelier job/<build-job>for build errors - Image pushed to wrong registry — verify with
curl http://localhost:32000/v2/_catalog
PVC stays Pending
Longhorn must be ready. See Longhorn not ready.
Rollout timeout
# Get the actual errorkubectl describe pod -n atelier-apps -l atelier.io/app=<app-name>
# Check recent eventskubectl get events -n atelier-apps --sort-by=.lastTimestamp | tail -20Common causes: image pull failures, PVC pending, failing readiness probe, container crash.
Service selector mismatch — pods running but no endpoints / 502
After an LLM rebuild, pod labels may not match the service selector:
# Comparekubectl get svc <name>-frontend -n atelier-apps -o jsonpath='{.spec.selector}'kubectl get pods -n atelier-apps -l app=<name>-frontend
# Patch selector to match actual pod labelskubectl patch service/<name>-frontend -n atelier-apps \ -p '{"spec":{"selector":{"app":"<actual-label>"}}}'LLM & Nova Issues
Planner or Nova silently does nothing
Symptom: You describe an app and the plan never arrives, or Nova accepts your question but never responds. No error banner appears in the UI.
Diagnose: Tail atelier-core logs while you trigger another request:
kubectl -n atelier logs -f deploy/atelier-core --tail=0 2>&1 | grep -iE 'llm|noop|role'Look for a WARN line that starts with No usable LLM provider could be constructed for this role. The message names:
- The role that tried to resolve (
build,plan,nova,supervisor,code_review) - The underlying error that made the final fallback fail (missing API key, unparseable base URL, unreachable host, etc.)
- A concrete fix hint listing the env vars or profile fields to set
If no WARN appears but the request still hangs, the problem is downstream (provider is returning a 4xx/5xx, or a local endpoint is timing out). Check the profile’s Base URL is correct — for OpenAI-compatible endpoints like Ollama or LM Studio it must include /v1.
Assigned role ignored / using wrong profile
Symptom: You assigned a profile to Nova (for example) but the actual requests go to a different provider.
Check: Role assignments are stored in the settings table under keys like llm_role.nova. Verify via the API (any logged-in token works):
curl -sS -H "Authorization: Bearer $TOKEN" http://atelier.home.arpa/api/settings | jq .llm_rolesThe value for each role is the profile id. If it’s missing, you haven’t assigned a profile for that role — there is no implicit fallback to any Default profile (the migrated Default is just a regular entry in the Profiles panel). In that case runtime resolution falls back to the legacy flat llm.* settings, then to the startup env-var configuration. If it’s present and points at a profile, the role uses that profile on the next LLM call (no pod restart needed).
LM Studio / Ollama / llama.cpp returns 400 Bad Request with context-length error
Symptom: Local OpenAI-compatible backend returns something like:
The number of tokens to keep from the initial prompt is greater than the context length(n_keep: 4348 >= n_ctx: 4096). Try to load the model with a larger context length.Cause: Atelier’s system prompts (especially Nova’s, which embeds the MCP tool catalogue) exceed the model’s configured context window.
Fix: Reload the model in your local runner with a larger context window (n_ctx). Starting point: 8192, bump to 16384 if you still overflow as conversation history grows. For Nova specifically, you can also trim the system prompt by disabling MCP servers Nova doesn’t need (Settings → MCP Servers → uncheck the Chat / Pipeline checkbox on unused tools).
Role-profile mapping out of date after schema changes
Symptom: After an upgrade, some roles have no profile assigned and fall back to env-var config.
Cause: Role assignments are persisted — they survive upgrades — but new roles added by the platform don’t auto-populate for existing installs.
Fix: Open Settings → LLM Profiles → Role Assignments, pick a profile for each unset role. Or create any new profile to trigger the auto-assign-on-create path, which fills every unassigned role at once.
WebSocket & Long-Lived Connections
LLM-generated apps with WebSockets hit two common bugs:
1. Broken WS URL (ws://host./ws)
Cause: LLM used import.meta.env.BASE_URL which is ./ at runtime, concatenated with the host.
Fix: Use window.location.pathname:
const loc = window.location;const proto = loc.protocol === 'https:' ? 'wss:' : 'ws:';const base = loc.pathname.replace(/\/$/, '');const wsUrl = `${proto}//${loc.host}${base}/ws`;Newer LLM prompts include this pattern — if you still hit it, push the fix directly to Gitea and re-deploy.
2. WebSocket closes immediately after connecting
Cause: React useEffect re-fires when connection state changes (e.g. onConnected flips connecting: false), triggering cleanup that closes the WS.
Fix: Use a ref for volatile connection state, not a useEffect dep:
const connectingRef = useRef(false);connectingRef.current = tab.connecting;
useEffect(() => { if (!connectingRef.current) return; const ws = new WebSocket(url); // ... return () => { if (wsRef.current === ws) { ws.close(); wsRef.current = null; } };}, [tab.config]); // NOT tab.connectingRegistry & Images
Can’t push to registry from Mac
dial tcp: lookup registry.atelier.local: no such hostAdd to Mac’s /etc/hosts:
192.168.0.x registry.atelier.localRegistry stays on .local intentionally — it works fine via /etc/hosts and doesn’t have mDNS issues in this path.
Also configure Docker Desktop to treat registry.atelier.local:32000 as an insecure registry (Settings → Docker Engine → add to insecure-registries).
Registry image delete fails with MANIFEST_UNKNOWN
Fixed in PR #178. If you’re on an old build, you can delete directly:
# Find all manifest types and try eachfor accept in \ "application/vnd.oci.image.index.v1+json" \ "application/vnd.docker.distribution.manifest.v2+json" \ "application/vnd.oci.image.manifest.v1+json"; do DIGEST=$(curl -sI -H "Accept: $accept" \ "http://registry.atelier.local:32000/v2/<repo>/manifests/<tag>" \ | grep -i docker-content-digest | awk '{print $2}' | tr -d '\r') [ -n "$DIGEST" ] && curl -X DELETE \ "http://registry.atelier.local:32000/v2/<repo>/manifests/$DIGEST" && breakdoneRegistry growing too large
Run garbage collection from Settings → Registry in the UI, or:
kubectl exec -n atelier deployment/registry -- registry garbage-collect \ --delete-untagged /etc/docker/registry/config.ymlKubernetes & Storage
Check cluster health at a glance
kubectl get nodeskubectl get pods -A --field-selector='status.phase!=Running'kubectl top nodes # requires metrics-serverStale resources left behind after topology change
Known issue (#222). When an app’s build produces a different set of K8s resources, old ones aren’t auto-deleted. Manually:
# List all resources for an appkubectl get all -n atelier-apps -l atelier.io/app=<app-name>
# Delete the obsolete oneskubectl delete <kind>/<name> -n atelier-appsThe supervisor will detect orphaned resources (0 endpoints, selector mismatch) and surface them as warnings.
Longhorn volume stuck Attached but unused
# Force detachkubectl patch volume.longhorn.io <pv-name> -n longhorn-system \ --type=merge -p '{"spec":{"nodeID":""}}'K10 backups not running
# Check K10 podskubectl get pods -n kasten-io
# Check policy statuskubectl get policies.config.kio.kasten.io -n kasten-iokubectl describe policy <policy-name> -n kasten-io
# Check recent runskubectl get runactions.actions.kio.kasten.io -n kasten-io --sort-by=.metadata.creationTimestamp | tailAtelier DB lost / corrupted
The DB lives at /data/atelier.db in the atelier-core pod. Restore from K10 backup, or if unrecoverable, start fresh:
# Delete the PVC (WARNING: destroys all app metadata, not Gitea repos)kubectl delete pvc atelier-core-data-longhorn -n atelierkubectl rollout restart deployment/atelier-core -n atelierApps in Gitea will still exist — you can re-import them via the “From Gitea” button.
Getting Help
-
Check logs first:
Terminal window kubectl logs deployment/atelier-core -n atelier --tail=100kubectl logs deployment/atelier-ui -n atelier --tail=50 -
Check pod events:
Terminal window kubectl get events -n atelier-apps --sort-by=.lastTimestamp | tail -30 -
Platform health — Settings → System shows node IP, versions, uptime, and backup state.
-
File an issue at atelier-project/atelier/issues with logs, error messages, and reproduction steps.