The Learning System

How NightShift accumulates knowledge, evolves strategies, and improves with every run — across all 5 interconnected feedback loops.

Why run 50 is different from run 1

Most AI tools are stateless. They start fresh every session, making the same mistakes and rediscovering the same approaches. NightShift has five interlocking feedback loops that persist across runs:

Loop 1
Knowledge Base

Strategies, errors, domain facts. Queried before every action.

Loop 2
Agent Resources

Team patterns with UCB1 scoring. Better teams get proposed more.

Loop 3
Episodic Memory

Per-run record: strategy used, cost, quality score, criteria met.

Loop 4
Predictor

Pre-flight risk query: KB for past failures on similar nodes.

Loop 5
Librarian Consolidation

After each run, a Haiku agent consolidates KB: merges duplicates, drops noise, keeps actionable insights.

Knowledge Base (KB)

The KB is a hybrid vector store backed by LanceDB with ModernBERT embeddings. Hybrid search combines dense vector similarity (semantic) with BM25 (keyword). Typical query latency: 3ms.

Two tiers

Every run reads both tiers. Writes go to local by default; important cross-domain insights are elevated to global.

What gets written to KB

Reading KB in practice

# KB is queried automatically before every agent action
# You can inspect KB content with:
nightshift kb list                    # show recent entries
nightshift kb search "react hooks"   # search specific topic
nightshift kb stats                  # size, entry count, query latency
Contextual embeddings: NightShift uses Anthropic's contextual embedding technique — each chunk is embedded with surrounding context, not in isolation. This significantly improves retrieval accuracy for longer documents.

Agent Resources (AR) and UCB1

AR stores team patterns — which agents to use, in what order, with what dependencies. Each pattern has a UCB1 score that balances exploitation (use what worked) with exploration (try what's untested).

UCB1 formula

UCB1 = μᵢ + √(2 ln N / nᵢ)
where:
μᵢ = mean quality score for pattern i
N = total number of pattern selections across all patterns
nᵢ = number of times pattern i has been selected

The second term — √(2 ln N / nᵢ) — is the exploration bonus. A pattern that has never been tried gets N → ∞ bonus (in practice, very large), ensuring it gets selected eventually. A pattern used 1000 times has a tiny bonus — it lives or dies on its average quality.

Pattern lifecycle

  1. Coordinator asks AR to propose a team for the problem type
  2. AR runs UCB1 across stored patterns → returns top candidate
  3. Coordinator can accept, modify, or override
  4. Modified patterns are saved back to AR as new variants
  5. After the run, AR updates the pattern's UCB score with the quality result
  6. Successful novel patterns mutate (slight variations are created and added to AR)

Per-node performance tracking

AR tracks performance at the individual agent level too:

# Each node in AR has:
perf_runs: 47          # total times this agent type ran
perf_successes: 41     # times it completed without error
perf_avg_quality: 3.8  # average quality score from Evaluator

This data informs the Predictor (see below) and helps the Coordinator decide which agent type to assign for a given task.

Episodic Memory

After each run, NightShift writes an episode record to local storage. This is the highest-level memory — it captures what approach was taken and whether it worked.

# Episode record structure:
{
  "run_id": "2024-01-15T14:32:00",
  "problem_type": "code_fix",
  "pattern_used": "researcher+implementer+verifier",
  "cost_usd": 0.42,
  "quality_score": 4,
  "criteria_met": ["all tests pass", "no regression"],
  "criteria_unmet": [],
  "attempts": 2,
  "investor_signal": "exploit"
}

UCB1 strategy scoring uses episodic memory: when considering whether to retry a pattern, the system looks at past quality scores for that pattern on similar problem types.

Predictor

Before each node runs, the Predictor queries KB for past failures on similar nodes. This is a pre-flight risk assessment.

If KB returns a strong match (e.g., "the implementer agent consistently fails on files over 500 lines — it produces truncated output"), the Predictor flags this risk in the Coordinator's context. The Coordinator may then:

Key insight: The Predictor turns historical failure data into forward-looking risk signals. This is how the system gets more reliable over time, not just faster.

Librarian Consolidation

Raw KB writes accumulate duplicates and noise. After each run, the Librarian (a cheap Haiku agent) runs consolidation:

  1. Queries KB for semantically similar entries (cosine similarity > 0.92)
  2. Merges duplicate entries, keeping the most recent and specific
  3. Drops low-quality entries (vague, unhelpful, or superseded)
  4. Promotes entries that have been confirmed useful across multiple runs

Without consolidation, KB would grow unboundedly and retrieval quality would degrade. The Librarian keeps the KB dense and actionable — signal, not noise.

Auditor + Investor Learning Loop

The Auditor and Investor are themselves part of the learning system:

Auditor

Diagnoses anomalies after each attempt using a cheap Haiku call. The diagnosis text is written to KB, so future Auditors (on future runs) can recognize similar anomaly patterns.

Cross-attempt patterns (same error type recurring) are flagged and written with higher priority, making them more likely to surface in future KB queries.

Investor

Valuations are written to KB. Future Investors read past valuations to calibrate their own risk assessment. A pattern like "on problems with no KB entries, early explore signals tend to unlock better approaches" gets captured and reused.

This creates a meta-learning effect: the exploration strategy itself improves over time.

What to expect across runs

Here's the typical trajectory for a recurring problem type:

Run 1–3 — Discovery

KB has no entries for this problem type. AR has no performance data. Investor pushes explore. The system tries different patterns, makes mistakes, writes everything to KB. Cost is higher, quality is lower.

Run 4–10 — Calibration

KB begins to return relevant results. AR has initial UCB scores. Predictor starts flagging known failure modes before they happen. Quality improves, cost drops as fewer dead ends are explored.

Run 11–50 — Compounding

KB is dense with validated knowledge. Best patterns have high UCB scores and get selected reliably. Predictor prevents known failure modes. Librarian keeps KB clean. Investor can confidently push exploit. Cost drops by 40–70%, quality stabilizes at 4–5.

Stale KB: If your codebase changes significantly (major refactor, new framework), existing KB entries may become misleading. You can flush local KB with nightshift kb flush or selectively remove entries with nightshift kb remove <id>. Global KB is less affected since it stores more abstract knowledge.