The Learning System
How NightShift accumulates knowledge, evolves strategies, and improves with every run — across all 5 interconnected feedback loops.
Why run 50 is different from run 1
Most AI tools are stateless. They start fresh every session, making the same mistakes and rediscovering the same approaches. NightShift has five interlocking feedback loops that persist across runs:
Strategies, errors, domain facts. Queried before every action.
Team patterns with UCB1 scoring. Better teams get proposed more.
Per-run record: strategy used, cost, quality score, criteria met.
Pre-flight risk query: KB for past failures on similar nodes.
After each run, a Haiku agent consolidates KB: merges duplicates, drops noise, keeps actionable insights.
Knowledge Base (KB)
The KB is a hybrid vector store backed by LanceDB with ModernBERT embeddings. Hybrid search combines dense vector similarity (semantic) with BM25 (keyword). Typical query latency: 3ms.
Two tiers
- Global —
~/.nightshift/kb/— cross-project knowledge. Strategies that worked on any project you've run. - Local —
.nightshift/kb/(in your project directory) — project-specific facts, domain knowledge, file-specific errors.
Every run reads both tiers. Writes go to local by default; important cross-domain insights are elevated to global.
What gets written to KB
- Skill entries — successful agent approaches: "for React hooks, always check cleanup functions first"
- Evaluator reviews — what quality issues were found and why; what the Evaluator recommended
- Node failures — which agent type failed, what error, what context led to failure
- Investor valuations — exploration vs. exploitation assessments and why
- Auditor diagnoses — anomaly patterns detected across attempts
- Domain facts — information discovered while solving (API behavior, file structure, dependencies)
Reading KB in practice
# KB is queried automatically before every agent action # You can inspect KB content with: nightshift kb list # show recent entries nightshift kb search "react hooks" # search specific topic nightshift kb stats # size, entry count, query latency
Agent Resources (AR) and UCB1
AR stores team patterns — which agents to use, in what order, with what dependencies. Each pattern has a UCB1 score that balances exploitation (use what worked) with exploration (try what's untested).
UCB1 formula
μᵢ = mean quality score for pattern iN = total number of pattern selections across all patternsnᵢ = number of times pattern i has been selectedThe second term — √(2 ln N / nᵢ) — is the exploration bonus. A pattern that has never been tried gets N → ∞ bonus (in practice, very large), ensuring it gets selected eventually. A pattern used 1000 times has a tiny bonus — it lives or dies on its average quality.
Pattern lifecycle
- Coordinator asks AR to propose a team for the problem type
- AR runs UCB1 across stored patterns → returns top candidate
- Coordinator can accept, modify, or override
- Modified patterns are saved back to AR as new variants
- After the run, AR updates the pattern's UCB score with the quality result
- Successful novel patterns mutate (slight variations are created and added to AR)
Per-node performance tracking
AR tracks performance at the individual agent level too:
# Each node in AR has: perf_runs: 47 # total times this agent type ran perf_successes: 41 # times it completed without error perf_avg_quality: 3.8 # average quality score from Evaluator
This data informs the Predictor (see below) and helps the Coordinator decide which agent type to assign for a given task.
Episodic Memory
After each run, NightShift writes an episode record to local storage. This is the highest-level memory — it captures what approach was taken and whether it worked.
# Episode record structure: { "run_id": "2024-01-15T14:32:00", "problem_type": "code_fix", "pattern_used": "researcher+implementer+verifier", "cost_usd": 0.42, "quality_score": 4, "criteria_met": ["all tests pass", "no regression"], "criteria_unmet": [], "attempts": 2, "investor_signal": "exploit" }
UCB1 strategy scoring uses episodic memory: when considering whether to retry a pattern, the system looks at past quality scores for that pattern on similar problem types.
Predictor
Before each node runs, the Predictor queries KB for past failures on similar nodes. This is a pre-flight risk assessment.
If KB returns a strong match (e.g., "the implementer agent consistently fails on files over 500 lines — it produces truncated output"), the Predictor flags this risk in the Coordinator's context. The Coordinator may then:
- Split the task into smaller sub-problems
- Use a different agent archetype
- Add an explicit validation step after the risky node
Librarian Consolidation
Raw KB writes accumulate duplicates and noise. After each run, the Librarian (a cheap Haiku agent) runs consolidation:
- Queries KB for semantically similar entries (cosine similarity > 0.92)
- Merges duplicate entries, keeping the most recent and specific
- Drops low-quality entries (vague, unhelpful, or superseded)
- Promotes entries that have been confirmed useful across multiple runs
Without consolidation, KB would grow unboundedly and retrieval quality would degrade. The Librarian keeps the KB dense and actionable — signal, not noise.
Auditor + Investor Learning Loop
The Auditor and Investor are themselves part of the learning system:
Diagnoses anomalies after each attempt using a cheap Haiku call. The diagnosis text is written to KB, so future Auditors (on future runs) can recognize similar anomaly patterns.
Cross-attempt patterns (same error type recurring) are flagged and written with higher priority, making them more likely to surface in future KB queries.
Valuations are written to KB. Future Investors read past valuations to calibrate their own risk assessment. A pattern like "on problems with no KB entries, early explore signals tend to unlock better approaches" gets captured and reused.
This creates a meta-learning effect: the exploration strategy itself improves over time.
What to expect across runs
Here's the typical trajectory for a recurring problem type:
KB has no entries for this problem type. AR has no performance data. Investor pushes explore. The system tries different patterns, makes mistakes, writes everything to KB. Cost is higher, quality is lower.
KB begins to return relevant results. AR has initial UCB scores. Predictor starts flagging known failure modes before they happen. Quality improves, cost drops as fewer dead ends are explored.
KB is dense with validated knowledge. Best patterns have high UCB scores and get selected reliably. Predictor prevents known failure modes. Librarian keeps KB clean. Investor can confidently push exploit. Cost drops by 40–70%, quality stabilizes at 4–5.
nightshift kb flush or selectively remove entries with nightshift kb remove <id>. Global KB is less affected since it stores more abstract knowledge.