Troubleshooting Guide

Solutions to common problems when installing, running, or building with Atelier.

Browser Access
Installation
Networking
Authentication & Password Recovery
App Build & Deploy Issues
LLM & Nova Issues
WebSocket & Long-Lived Connections
Registry & Images
Kubernetes & Storage

Browser Access

Can’t reach `atelier.home.arpa` from my machine

Your machine’s /etc/hosts file must have an entry for the portal domain. On macOS/Linux:

echo "192.168.0.x  atelier.home.arpa gitea.atelier.home.arpa" | sudo tee -a /etc/hosts

On Windows, edit C:\Windows\System32\drivers\etc\hosts as administrator.

Replace 192.168.0.x with your server’s IP.

`.local` domain broken on macOS/Linux

macOS and Linux (with Avahi) reserve .local for mDNS, which can cause unpredictable DNS resolution failures. The default domain has been changed to atelier.home.arpa — if your install is older, patch the ingress hosts:

kubectl patch ingress atelier-ui -n atelier --type json \
  -p '[{"op":"replace","path":"/spec/rules/0/host","value":"atelier.home.arpa"}]'

On Windows, .local works fine. You can install with --domain atelier.local if preferred.

Blank white screen / app won’t load

Clear browser local storage for the site and log in again:

Chrome DevTools → Application → Local Storage → delete entries for your portal domain
Safari → Preferences → Privacy → Manage Website Data → remove

Or use an incognito/private window to rule out stale state.

404 when opening a generated app

Check the app’s ingress and middlewares:

# Verify ingress host and backend
kubectl get ingress <app-name> -n atelier-apps -o yaml

# Verify middlewares exist
kubectl get middleware -n atelier-apps | grep <app-name>

Expected: one trailing-slash-<app-name> middleware and one strip-<app-name> middleware, and the ingress annotation must reference both with trailing-slash before strip.

Copy button fails with “Failed to copy to clipboard”

Fixed in newer builds (PR #314). The Clipboard API requires a secure context (HTTPS or localhost), which plain HTTP doesn’t provide; the current build falls back to document.execCommand('copy') via a hidden textarea.

Installation

Disk space requirements

Base install: ~30 GB (K3s, registry, containerd images, PVCs for atelier-core and Gitea)
With Kasten K10 backup: add ~70 GB for snapshot storage
Each generated app: ~1-3 GB depending on images

Check with df -h /var/lib/rancher/k3s and df -h /var/lib/longhorn before installing.

K3s kubeconfig permissions reset after reboot

K3s resets /etc/rancher/k3s/k3s.yaml to 0600 on every restart, breaking non-root kubectl access. The installer fixes this but it needs re-running after K3s restarts:

sudo chmod 644 /etc/rancher/k3s/k3s.yaml

Or add to the K3s systemd override:

sudo systemctl edit k3s
# Add:
# [Service]
# ExecStartPost=/bin/chmod 644 /etc/rancher/k3s/k3s.yaml

Longhorn not ready — PVCs stay `Pending`

Apps with PVCs (backend /data mount, Gitea, registry) can’t start until Longhorn is healthy:

# Check Longhorn status
kubectl get pods -n longhorn-system

# All longhorn-manager, longhorn-driver-deployer, csi-* pods must be Running
kubectl describe pvc <pvc-name> -n <namespace>

Common causes: insufficient disk space, kernel not supporting iSCSI, nodes not labelled. See the Longhorn docs.

MCP images missing — `ImagePullBackOff`

From v0.9.8-beta onwards, mcp-fetch is pulled directly from ghcr.io/atelier-project/atelier-mcp-fetch:<version> (matching the running atelier-core’s build version) — no in-cluster registry indirection. If you see ImagePullBackOff on a fresh mcp-fetch deployment, check that the cluster node can reach ghcr.io over HTTPS.

Custom MCP servers brought in via the Settings → MCP Servers → Docker image flow still need a manual import + push (the local-registry path is preserved for user-supplied images):

sudo k3s ctr -n k8s.io images tag \
  registry.atelier.local/mcp-my-server:latest \
  localhost:32000/mcp-my-server:latest
sudo k3s ctr -n k8s.io images push --plain-http \
  localhost:32000/mcp-my-server:latest

For pre-v0.9.8-beta clusters that still reference registry.atelier.local/mcp-fetch:latest in their existing deployment: the original tag/push commands above (with mcp-fetch substituted for mcp-my-server) are still valid for that deployment. To migrate to GHCR after upgrading core/ui, delete the deployment via Settings → MCP Servers so the catalog bootstrap recreates it with the GHCR ref on next core start.

Installer missing `registries.yaml` — apps fail with `ImagePullBackOff`

The installer creates /etc/rancher/k3s/registries.yaml which tells K3s to mirror registry.atelier.local to localhost:32000. If missing:

sudo tee /etc/rancher/k3s/registries.yaml <<EOF
mirrors:
  "registry.atelier.local":
    endpoint:
      - "http://localhost:32000"
EOF
sudo systemctl restart k3s

Networking

Required ports

Port	Service	Purpose
80	Traefik (`web` entrypoint)	Portal, app ingresses
32000	Registry (NodePort)	Docker image push/pull
32001	Gitea (NodePort)	Git push/pull, UI
30302	Gitea (NodePort)	SSH git access (optional)
6443	K3s API	`kubectl` access

Outbound connections required

LLM provider API — api.anthropic.com, api.openai.com, or your Ollama host
Docker Hub / ghcr.io — Trivy image, Kasten K10, base images (alpine, node, python)
Longhorn Helm chart — charts.longhorn.io
Gitea image — gitea/gitea:1.22

No inbound internet required — everything serves over LAN.

WebSocket connections fail to upgrade

Traefik generally upgrades WebSocket connections automatically. If your app uses nginx (or another reverse proxy) in front of the WS backend inside the container, ensure that proxy includes these headers:

location /ws {
  proxy_pass http://backend:3001/ws;
  proxy_http_version 1.1;
  proxy_set_header Upgrade $http_upgrade;
  proxy_set_header Connection "Upgrade";
  proxy_read_timeout 3600s;
  proxy_send_timeout 3600s;
}

Also ensure the frontend uses window.location.pathname to derive the WS URL — not import.meta.env.BASE_URL. See WebSocket section below.

Authentication & Password Recovery

Locked out of the portal

Reset your password from the server — no UI access needed:

# Interactive (reads from stdin, doesn't leak to shell history)
kubectl exec -it -n atelier deployment/atelier-core -- \
  atelier-core reset-password --user <username>

Requires kubectl access (which implies cluster admin). The new password must meet the standard policy: 10+ characters, uppercase, lowercase, digit.

Forgot Gitea admin password

kubectl exec -n atelier deployment/gitea -- su git -c \
  "gitea admin user change-password --username atelier-admin --password <new-password>"

Forgot K10 dashboard password

# Generate a new bcrypt htpasswd
NEW_PASSWORD='new-password-here'
HASHED=$(htpasswd -nbB atelier "$NEW_PASSWORD" | base64 -w 0)

# Patch the secret
kubectl patch secret k10-basic-auth -n kasten-io \
  -p "{\"data\":{\"auth\":\"$HASHED\"}}"

# Restart the auth service
kubectl rollout restart deployment/auth-svc -n kasten-io

JWT token expired in MCP config

Atelier JWTs expire after 24 hours. Regenerate by logging in again:

curl -s http://atelier.home.arpa/api/auth/login \
  -H 'Content-Type: application/json' \
  -d '{"username":"<name>","password":"<pass>"}' | jq -r .token

Or copy from your browser’s DevTools → Application → Local Storage → atelier_token.

App Build & Deploy Issues

Rebuild doesn’t create new pods

Fixed in recent builds. If you’re on an old version, force a rollout:

kubectl rollout restart deployment/<app-name>-backend \
  deployment/<app-name>-frontend -n atelier-apps

The newer pipeline injects a atelier.io/build-tag annotation into the pod template, which forces K8s to create a new ReplicaSet on every build.

App in error state even though pods are healthy

Use the Reset Status button in the app detail header (appears when status is error). It checks actual pod health in K8s and resets to running or paused as appropriate.

If the UI button isn’t available, reset via DB:

kubectl exec -n atelier deployment/atelier-core -- sqlite3 /data/atelier.db \
  "UPDATE apps SET status='running', last_error=NULL WHERE name='<app-name>'"

ErrImagePull on app pods

Common causes:

registries.yaml missing — see Installation
Image not actually built — check kubectl logs -n atelier job/<build-job> for build errors
Image pushed to wrong registry — verify with curl http://localhost:32000/v2/_catalog

PVC stays `Pending`

Longhorn must be ready. See Longhorn not ready.

Rollout timeout

# Get the actual error
kubectl describe pod -n atelier-apps -l atelier.io/app=<app-name>

# Check recent events
kubectl get events -n atelier-apps --sort-by=.lastTimestamp | tail -20

Common causes: image pull failures, PVC pending, failing readiness probe, container crash.

Service selector mismatch — pods running but no endpoints / 502

After an LLM rebuild, pod labels may not match the service selector:

# Compare
kubectl get svc <name>-frontend -n atelier-apps -o jsonpath='{.spec.selector}'
kubectl get pods -n atelier-apps -l app=<name>-frontend

# Patch selector to match actual pod labels
kubectl patch service/<name>-frontend -n atelier-apps \
  -p '{"spec":{"selector":{"app":"<actual-label>"}}}'

LLM & Nova Issues

Planner or Nova silently does nothing

Symptom: You describe an app and the plan never arrives, or Nova accepts your question but never responds. No error banner appears in the UI.

Diagnose: Tail atelier-core logs while you trigger another request:

kubectl -n atelier logs -f deploy/atelier-core --tail=0 2>&1 | grep -iE 'llm|noop|role'

Look for a WARN line that starts with No usable LLM provider could be constructed for this role. The message names:

The role that tried to resolve (build, plan, nova, supervisor, code_review)
The underlying error that made the final fallback fail (missing API key, unparseable base URL, unreachable host, etc.)
A concrete fix hint listing the env vars or profile fields to set

If no WARN appears but the request still hangs, the problem is downstream (provider is returning a 4xx/5xx, or a local endpoint is timing out). Check the profile’s Base URL is correct — for OpenAI-compatible endpoints like Ollama or LM Studio it must include /v1.

Assigned role ignored / using wrong profile

Symptom: You assigned a profile to Nova (for example) but the actual requests go to a different provider.

Check: Role assignments are stored in the settings table under keys like llm_role.nova. Verify via the API (any logged-in token works):

curl -sS -H "Authorization: Bearer $TOKEN" http://atelier.home.arpa/api/settings | jq .llm_roles

The value for each role is the profile id. If it’s missing, you haven’t assigned a profile for that role — there is no implicit fallback to any Default profile (the migrated Default is just a regular entry in the Profiles panel). In that case runtime resolution falls back to the legacy flat llm.* settings, then to the startup env-var configuration. If it’s present and points at a profile, the role uses that profile on the next LLM call (no pod restart needed).

LM Studio / Ollama / llama.cpp returns `400 Bad Request` with context-length error

Symptom: Local OpenAI-compatible backend returns something like:

The number of tokens to keep from the initial prompt is greater than the context length
(n_keep: 4348 >= n_ctx: 4096). Try to load the model with a larger context length.

Cause: Atelier’s system prompts (especially Nova’s, which embeds the MCP tool catalogue) exceed the model’s configured context window.

Fix: Reload the model in your local runner with a larger context window (n_ctx). Starting point: 8192, bump to 16384 if you still overflow as conversation history grows. For Nova specifically, you can also trim the system prompt by disabling MCP servers Nova doesn’t need (Settings → MCP Servers → uncheck the Chat / Pipeline checkbox on unused tools).

Role-profile mapping out of date after schema changes

Symptom: After an upgrade, some roles have no profile assigned and fall back to env-var config.

Cause: Role assignments are persisted — they survive upgrades — but new roles added by the platform don’t auto-populate for existing installs.

Fix: Open Settings → LLM Profiles → Role Assignments, pick a profile for each unset role. Or create any new profile to trigger the auto-assign-on-create path, which fills every unassigned role at once.

WebSocket & Long-Lived Connections

LLM-generated apps with WebSockets hit two common bugs:

1. Broken WS URL (`ws://host./ws`)

Cause: LLM used import.meta.env.BASE_URL which is ./ at runtime, concatenated with the host.

Fix: Use window.location.pathname:

const loc = window.location;
const proto = loc.protocol === 'https:' ? 'wss:' : 'ws:';
const base = loc.pathname.replace(/\/$/, '');
const wsUrl = `${proto}//${loc.host}${base}/ws`;

Newer LLM prompts include this pattern — if you still hit it, push the fix directly to Gitea and re-deploy.

2. WebSocket closes immediately after connecting

Cause: React useEffect re-fires when connection state changes (e.g. onConnected flips connecting: false), triggering cleanup that closes the WS.

Fix: Use a ref for volatile connection state, not a useEffect dep:

const connectingRef = useRef(false);
connectingRef.current = tab.connecting;

useEffect(() => {
  if (!connectingRef.current) return;
  const ws = new WebSocket(url);
  // ...
  return () => {
    if (wsRef.current === ws) { ws.close(); wsRef.current = null; }
  };
}, [tab.config]); // NOT tab.connecting

Registry & Images

Can’t push to registry from Mac

dial tcp: lookup registry.atelier.local: no such host

Add to Mac’s /etc/hosts:

192.168.0.x  registry.atelier.local

Registry stays on .local intentionally — it works fine via /etc/hosts and doesn’t have mDNS issues in this path.

Also configure Docker Desktop to treat registry.atelier.local:32000 as an insecure registry (Settings → Docker Engine → add to insecure-registries).

Registry image delete fails with `MANIFEST_UNKNOWN`

Fixed in PR #178. If you’re on an old build, you can delete directly:

# Find all manifest types and try each
for accept in \
  "application/vnd.oci.image.index.v1+json" \
  "application/vnd.docker.distribution.manifest.v2+json" \
  "application/vnd.oci.image.manifest.v1+json"; do
  DIGEST=$(curl -sI -H "Accept: $accept" \
    "http://registry.atelier.local:32000/v2/<repo>/manifests/<tag>" \
    | grep -i docker-content-digest | awk '{print $2}' | tr -d '\r')
  [ -n "$DIGEST" ] && curl -X DELETE \
    "http://registry.atelier.local:32000/v2/<repo>/manifests/$DIGEST" && break
done

Registry growing too large

Run garbage collection from Settings → Registry in the UI, or:

kubectl exec -n atelier deployment/registry -- registry garbage-collect \
  --delete-untagged /etc/docker/registry/config.yml

Kubernetes & Storage

Check cluster health at a glance

kubectl get nodes
kubectl get pods -A --field-selector='status.phase!=Running'
kubectl top nodes  # requires metrics-server

Stale resources left behind after topology change

Known issue (#222). When an app’s build produces a different set of K8s resources, old ones aren’t auto-deleted. Manually:

# List all resources for an app
kubectl get all -n atelier-apps -l atelier.io/app=<app-name>

# Delete the obsolete ones
kubectl delete <kind>/<name> -n atelier-apps

The supervisor will detect orphaned resources (0 endpoints, selector mismatch) and surface them as warnings.

Longhorn volume stuck Attached but unused

# Force detach
kubectl patch volume.longhorn.io <pv-name> -n longhorn-system \
  --type=merge -p '{"spec":{"nodeID":""}}'

K10 backups not running

# Check K10 pods
kubectl get pods -n kasten-io

# Check policy status
kubectl get policies.config.kio.kasten.io -n kasten-io
kubectl describe policy <policy-name> -n kasten-io

# Check recent runs
kubectl get runactions.actions.kio.kasten.io -n kasten-io --sort-by=.metadata.creationTimestamp | tail

Atelier DB lost / corrupted

The DB lives at /data/atelier.db in the atelier-core pod. Restore from K10 backup, or if unrecoverable, start fresh:

# Delete the PVC (WARNING: destroys all app metadata, not Gitea repos)
kubectl delete pvc atelier-core-data-longhorn -n atelier
kubectl rollout restart deployment/atelier-core -n atelier

Apps in Gitea will still exist — you can re-import them via the “From Gitea” button.

Getting Help

Check logs first:

kubectl logs deployment/atelier-core -n atelier --tail=100
kubectl logs deployment/atelier-ui -n atelier --tail=50

Check pod events:

kubectl get events -n atelier-apps --sort-by=.lastTimestamp | tail -30

Platform health — Settings → System shows node IP, versions, uptime, and backup state.
File an issue at atelier-project/atelier/issues with logs, error messages, and reproduction steps.