Skip to content

Troubleshooting Guide

Solutions to common problems when installing, running, or building with Atelier.

Contents


Browser Access

Can’t reach atelier.home.arpa from my machine

Your machine’s /etc/hosts file must have an entry for the portal domain. On macOS/Linux:

Terminal window
echo "192.168.0.x atelier.home.arpa gitea.atelier.home.arpa" | sudo tee -a /etc/hosts

On Windows, edit C:\Windows\System32\drivers\etc\hosts as administrator.

Replace 192.168.0.x with your server’s IP.

.local domain broken on macOS/Linux

macOS and Linux (with Avahi) reserve .local for mDNS, which can cause unpredictable DNS resolution failures. The default domain has been changed to atelier.home.arpa — if your install is older, patch the ingress hosts:

Terminal window
kubectl patch ingress atelier-ui -n atelier --type json \
-p '[{"op":"replace","path":"/spec/rules/0/host","value":"atelier.home.arpa"}]'

On Windows, .local works fine. You can install with --domain atelier.local if preferred.

Blank white screen / app won’t load

Clear browser local storage for the site and log in again:

  • Chrome DevTools → Application → Local Storage → delete entries for your portal domain
  • Safari → Preferences → Privacy → Manage Website Data → remove

Or use an incognito/private window to rule out stale state.

404 when opening a generated app

Check the app’s ingress and middlewares:

Terminal window
# Verify ingress host and backend
kubectl get ingress <app-name> -n atelier-apps -o yaml
# Verify middlewares exist
kubectl get middleware -n atelier-apps | grep <app-name>

Expected: one trailing-slash-<app-name> middleware and one strip-<app-name> middleware, and the ingress annotation must reference both with trailing-slash before strip.

Copy button fails with “Failed to copy to clipboard”

Fixed in newer builds (PR #314). The Clipboard API requires a secure context (HTTPS or localhost), which plain HTTP doesn’t provide; the current build falls back to document.execCommand('copy') via a hidden textarea.


Installation

Disk space requirements

  • Base install: ~30 GB (K3s, registry, containerd images, PVCs for atelier-core and Gitea)
  • With Kasten K10 backup: add ~70 GB for snapshot storage
  • Each generated app: ~1-3 GB depending on images

Check with df -h /var/lib/rancher/k3s and df -h /var/lib/longhorn before installing.

K3s kubeconfig permissions reset after reboot

K3s resets /etc/rancher/k3s/k3s.yaml to 0600 on every restart, breaking non-root kubectl access. The installer fixes this but it needs re-running after K3s restarts:

Terminal window
sudo chmod 644 /etc/rancher/k3s/k3s.yaml

Or add to the K3s systemd override:

Terminal window
sudo systemctl edit k3s
# Add:
# [Service]
# ExecStartPost=/bin/chmod 644 /etc/rancher/k3s/k3s.yaml

Longhorn not ready — PVCs stay Pending

Apps with PVCs (backend /data mount, Gitea, registry) can’t start until Longhorn is healthy:

Terminal window
# Check Longhorn status
kubectl get pods -n longhorn-system
# All longhorn-manager, longhorn-driver-deployer, csi-* pods must be Running
kubectl describe pvc <pvc-name> -n <namespace>

Common causes: insufficient disk space, kernel not supporting iSCSI, nodes not labelled. See the Longhorn docs.

MCP images missing — ImagePullBackOff

From v0.9.8-beta onwards, mcp-fetch is pulled directly from ghcr.io/atelier-project/atelier-mcp-fetch:<version> (matching the running atelier-core’s build version) — no in-cluster registry indirection. If you see ImagePullBackOff on a fresh mcp-fetch deployment, check that the cluster node can reach ghcr.io over HTTPS.

Custom MCP servers brought in via the Settings → MCP Servers → Docker image flow still need a manual import + push (the local-registry path is preserved for user-supplied images):

Terminal window
sudo k3s ctr -n k8s.io images tag \
registry.atelier.local/mcp-my-server:latest \
localhost:32000/mcp-my-server:latest
sudo k3s ctr -n k8s.io images push --plain-http \
localhost:32000/mcp-my-server:latest

For pre-v0.9.8-beta clusters that still reference registry.atelier.local/mcp-fetch:latest in their existing deployment: the original tag/push commands above (with mcp-fetch substituted for mcp-my-server) are still valid for that deployment. To migrate to GHCR after upgrading core/ui, delete the deployment via Settings → MCP Servers so the catalog bootstrap recreates it with the GHCR ref on next core start.

Installer missing registries.yaml — apps fail with ImagePullBackOff

The installer creates /etc/rancher/k3s/registries.yaml which tells K3s to mirror registry.atelier.local to localhost:32000. If missing:

Terminal window
sudo tee /etc/rancher/k3s/registries.yaml <<EOF
mirrors:
"registry.atelier.local":
endpoint:
- "http://localhost:32000"
EOF
sudo systemctl restart k3s

Networking

Required ports

PortServicePurpose
80Traefik (web entrypoint)Portal, app ingresses
32000Registry (NodePort)Docker image push/pull
32001Gitea (NodePort)Git push/pull, UI
30302Gitea (NodePort)SSH git access (optional)
6443K3s APIkubectl access

Outbound connections required

  • LLM provider APIapi.anthropic.com, api.openai.com, or your Ollama host
  • Docker Hub / ghcr.io — Trivy image, Kasten K10, base images (alpine, node, python)
  • Longhorn Helm chartcharts.longhorn.io
  • Gitea imagegitea/gitea:1.22

No inbound internet required — everything serves over LAN.

WebSocket connections fail to upgrade

Traefik generally upgrades WebSocket connections automatically. If your app uses nginx (or another reverse proxy) in front of the WS backend inside the container, ensure that proxy includes these headers:

location /ws {
proxy_pass http://backend:3001/ws;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "Upgrade";
proxy_read_timeout 3600s;
proxy_send_timeout 3600s;
}

Also ensure the frontend uses window.location.pathname to derive the WS URL — not import.meta.env.BASE_URL. See WebSocket section below.


Authentication & Password Recovery

Locked out of the portal

Reset your password from the server — no UI access needed:

Terminal window
# Interactive (reads from stdin, doesn't leak to shell history)
kubectl exec -it -n atelier deployment/atelier-core -- \
atelier-core reset-password --user <username>

Requires kubectl access (which implies cluster admin). The new password must meet the standard policy: 10+ characters, uppercase, lowercase, digit.

Forgot Gitea admin password

Terminal window
kubectl exec -n atelier deployment/gitea -- su git -c \
"gitea admin user change-password --username atelier-admin --password <new-password>"

Forgot K10 dashboard password

Terminal window
# Generate a new bcrypt htpasswd
NEW_PASSWORD='new-password-here'
HASHED=$(htpasswd -nbB atelier "$NEW_PASSWORD" | base64 -w 0)
# Patch the secret
kubectl patch secret k10-basic-auth -n kasten-io \
-p "{\"data\":{\"auth\":\"$HASHED\"}}"
# Restart the auth service
kubectl rollout restart deployment/auth-svc -n kasten-io

JWT token expired in MCP config

Atelier JWTs expire after 24 hours. Regenerate by logging in again:

Terminal window
curl -s http://atelier.home.arpa/api/auth/login \
-H 'Content-Type: application/json' \
-d '{"username":"<name>","password":"<pass>"}' | jq -r .token

Or copy from your browser’s DevTools → Application → Local Storage → atelier_token.


App Build & Deploy Issues

Rebuild doesn’t create new pods

Fixed in recent builds. If you’re on an old version, force a rollout:

Terminal window
kubectl rollout restart deployment/<app-name>-backend \
deployment/<app-name>-frontend -n atelier-apps

The newer pipeline injects a atelier.io/build-tag annotation into the pod template, which forces K8s to create a new ReplicaSet on every build.

App in error state even though pods are healthy

Use the Reset Status button in the app detail header (appears when status is error). It checks actual pod health in K8s and resets to running or paused as appropriate.

If the UI button isn’t available, reset via DB:

Terminal window
kubectl exec -n atelier deployment/atelier-core -- sqlite3 /data/atelier.db \
"UPDATE apps SET status='running', last_error=NULL WHERE name='<app-name>'"

ErrImagePull on app pods

Common causes:

  1. registries.yaml missing — see Installation
  2. Image not actually built — check kubectl logs -n atelier job/<build-job> for build errors
  3. Image pushed to wrong registry — verify with curl http://localhost:32000/v2/_catalog

PVC stays Pending

Longhorn must be ready. See Longhorn not ready.

Rollout timeout

Terminal window
# Get the actual error
kubectl describe pod -n atelier-apps -l atelier.io/app=<app-name>
# Check recent events
kubectl get events -n atelier-apps --sort-by=.lastTimestamp | tail -20

Common causes: image pull failures, PVC pending, failing readiness probe, container crash.

Service selector mismatch — pods running but no endpoints / 502

After an LLM rebuild, pod labels may not match the service selector:

Terminal window
# Compare
kubectl get svc <name>-frontend -n atelier-apps -o jsonpath='{.spec.selector}'
kubectl get pods -n atelier-apps -l app=<name>-frontend
# Patch selector to match actual pod labels
kubectl patch service/<name>-frontend -n atelier-apps \
-p '{"spec":{"selector":{"app":"<actual-label>"}}}'

LLM & Nova Issues

Planner or Nova silently does nothing

Symptom: You describe an app and the plan never arrives, or Nova accepts your question but never responds. No error banner appears in the UI.

Diagnose: Tail atelier-core logs while you trigger another request:

Terminal window
kubectl -n atelier logs -f deploy/atelier-core --tail=0 2>&1 | grep -iE 'llm|noop|role'

Look for a WARN line that starts with No usable LLM provider could be constructed for this role. The message names:

  • The role that tried to resolve (build, plan, nova, supervisor, code_review)
  • The underlying error that made the final fallback fail (missing API key, unparseable base URL, unreachable host, etc.)
  • A concrete fix hint listing the env vars or profile fields to set

If no WARN appears but the request still hangs, the problem is downstream (provider is returning a 4xx/5xx, or a local endpoint is timing out). Check the profile’s Base URL is correct — for OpenAI-compatible endpoints like Ollama or LM Studio it must include /v1.

Assigned role ignored / using wrong profile

Symptom: You assigned a profile to Nova (for example) but the actual requests go to a different provider.

Check: Role assignments are stored in the settings table under keys like llm_role.nova. Verify via the API (any logged-in token works):

Terminal window
curl -sS -H "Authorization: Bearer $TOKEN" http://atelier.home.arpa/api/settings | jq .llm_roles

The value for each role is the profile id. If it’s missing, you haven’t assigned a profile for that role — there is no implicit fallback to any Default profile (the migrated Default is just a regular entry in the Profiles panel). In that case runtime resolution falls back to the legacy flat llm.* settings, then to the startup env-var configuration. If it’s present and points at a profile, the role uses that profile on the next LLM call (no pod restart needed).

LM Studio / Ollama / llama.cpp returns 400 Bad Request with context-length error

Symptom: Local OpenAI-compatible backend returns something like:

The number of tokens to keep from the initial prompt is greater than the context length
(n_keep: 4348 >= n_ctx: 4096). Try to load the model with a larger context length.

Cause: Atelier’s system prompts (especially Nova’s, which embeds the MCP tool catalogue) exceed the model’s configured context window.

Fix: Reload the model in your local runner with a larger context window (n_ctx). Starting point: 8192, bump to 16384 if you still overflow as conversation history grows. For Nova specifically, you can also trim the system prompt by disabling MCP servers Nova doesn’t need (Settings → MCP Servers → uncheck the Chat / Pipeline checkbox on unused tools).

Role-profile mapping out of date after schema changes

Symptom: After an upgrade, some roles have no profile assigned and fall back to env-var config.

Cause: Role assignments are persisted — they survive upgrades — but new roles added by the platform don’t auto-populate for existing installs.

Fix: Open Settings → LLM Profiles → Role Assignments, pick a profile for each unset role. Or create any new profile to trigger the auto-assign-on-create path, which fills every unassigned role at once.


WebSocket & Long-Lived Connections

LLM-generated apps with WebSockets hit two common bugs:

1. Broken WS URL (ws://host./ws)

Cause: LLM used import.meta.env.BASE_URL which is ./ at runtime, concatenated with the host.

Fix: Use window.location.pathname:

const loc = window.location;
const proto = loc.protocol === 'https:' ? 'wss:' : 'ws:';
const base = loc.pathname.replace(/\/$/, '');
const wsUrl = `${proto}//${loc.host}${base}/ws`;

Newer LLM prompts include this pattern — if you still hit it, push the fix directly to Gitea and re-deploy.

2. WebSocket closes immediately after connecting

Cause: React useEffect re-fires when connection state changes (e.g. onConnected flips connecting: false), triggering cleanup that closes the WS.

Fix: Use a ref for volatile connection state, not a useEffect dep:

const connectingRef = useRef(false);
connectingRef.current = tab.connecting;
useEffect(() => {
if (!connectingRef.current) return;
const ws = new WebSocket(url);
// ...
return () => {
if (wsRef.current === ws) { ws.close(); wsRef.current = null; }
};
}, [tab.config]); // NOT tab.connecting

Registry & Images

Can’t push to registry from Mac

dial tcp: lookup registry.atelier.local: no such host

Add to Mac’s /etc/hosts:

192.168.0.x registry.atelier.local

Registry stays on .local intentionally — it works fine via /etc/hosts and doesn’t have mDNS issues in this path.

Also configure Docker Desktop to treat registry.atelier.local:32000 as an insecure registry (Settings → Docker Engine → add to insecure-registries).

Registry image delete fails with MANIFEST_UNKNOWN

Fixed in PR #178. If you’re on an old build, you can delete directly:

Terminal window
# Find all manifest types and try each
for accept in \
"application/vnd.oci.image.index.v1+json" \
"application/vnd.docker.distribution.manifest.v2+json" \
"application/vnd.oci.image.manifest.v1+json"; do
DIGEST=$(curl -sI -H "Accept: $accept" \
"http://registry.atelier.local:32000/v2/<repo>/manifests/<tag>" \
| grep -i docker-content-digest | awk '{print $2}' | tr -d '\r')
[ -n "$DIGEST" ] && curl -X DELETE \
"http://registry.atelier.local:32000/v2/<repo>/manifests/$DIGEST" && break
done

Registry growing too large

Run garbage collection from Settings → Registry in the UI, or:

Terminal window
kubectl exec -n atelier deployment/registry -- registry garbage-collect \
--delete-untagged /etc/docker/registry/config.yml

Kubernetes & Storage

Check cluster health at a glance

Terminal window
kubectl get nodes
kubectl get pods -A --field-selector='status.phase!=Running'
kubectl top nodes # requires metrics-server

Stale resources left behind after topology change

Known issue (#222). When an app’s build produces a different set of K8s resources, old ones aren’t auto-deleted. Manually:

Terminal window
# List all resources for an app
kubectl get all -n atelier-apps -l atelier.io/app=<app-name>
# Delete the obsolete ones
kubectl delete <kind>/<name> -n atelier-apps

The supervisor will detect orphaned resources (0 endpoints, selector mismatch) and surface them as warnings.

Longhorn volume stuck Attached but unused

Terminal window
# Force detach
kubectl patch volume.longhorn.io <pv-name> -n longhorn-system \
--type=merge -p '{"spec":{"nodeID":""}}'

K10 backups not running

Terminal window
# Check K10 pods
kubectl get pods -n kasten-io
# Check policy status
kubectl get policies.config.kio.kasten.io -n kasten-io
kubectl describe policy <policy-name> -n kasten-io
# Check recent runs
kubectl get runactions.actions.kio.kasten.io -n kasten-io --sort-by=.metadata.creationTimestamp | tail

Atelier DB lost / corrupted

The DB lives at /data/atelier.db in the atelier-core pod. Restore from K10 backup, or if unrecoverable, start fresh:

Terminal window
# Delete the PVC (WARNING: destroys all app metadata, not Gitea repos)
kubectl delete pvc atelier-core-data-longhorn -n atelier
kubectl rollout restart deployment/atelier-core -n atelier

Apps in Gitea will still exist — you can re-import them via the “From Gitea” button.


Getting Help

  1. Check logs first:

    Terminal window
    kubectl logs deployment/atelier-core -n atelier --tail=100
    kubectl logs deployment/atelier-ui -n atelier --tail=50
  2. Check pod events:

    Terminal window
    kubectl get events -n atelier-apps --sort-by=.lastTimestamp | tail -30
  3. Platform health — Settings → System shows node IP, versions, uptime, and backup state.

  4. File an issue at atelier-project/atelier/issues with logs, error messages, and reproduction steps.