v0.42 · Multi-agent runtime

Spec-first multi-agent orchestration for serious AI-assisted development.

Dark Factory transforms rough feature ideas into structured implementation artifacts, review workflows, and reproducible engineering outputs — the way infrastructure should.

Pipelines
Multi-agent, 5-stage
Models
Local & cloud
Review
Human-in-the-loop
Publish
GitHub, Slack
run-mnuhgcmz-7nm9 · slack-visibility-panel LIVE
STAGE · 01intakePM · 12.4s · 18,420 tokSTAGE · 02architectureARCH · 18.2s · 24,108 tokSTAGE · 03specPM, BE, FE · running…STAGE · 04qaQA, SEC · pendingSTAGE · 05planningPM · pendingREVISEHUMAN REVIEWAwaiting reviewerverdict: pending · feedback channel openUNREVIEWEDPUBLISHGitHub · Slackartifacts × 6 · ready on accept
01 · The cost of unstructured AI work

AI-assisted development is fast.
Without structure, it isn't trustworthy.

Most teams have already wired a chat model into their IDE. What they haven't built is the orchestration layer above it. Without one, the same problems compound across every feature.

01

Specification drift

Decisions live in chat threads. Two weeks later, no one remembers which version of "the plan" actually shipped.

02

Context loss

Every prompt re-explains the problem. Architectural constraints, ADRs, and prior trade-offs are continually re-discovered.

03

Weak reproducibility

The same request, an hour later, returns a different design. There is no run ID, no seed, no trace of what produced what.

04

Reviewability gap

Reviewers see code, not reasoning. There is no diff at the spec layer, no verdict beyond LGTM, no way to compare alternatives.

02 · The orchestration layer

A factory floor for engineering artifacts.

Dark Factory treats every feature as a pipeline run. Inputs are structured. Outputs are versioned. Agents are specialised. Reviewers get a first-class surface, not a comment thread.

  • A.

    Structured specs by construction

    Every run produces SPEC.md, ARCH.md, QA.md, TASKS.md, tasks.json, RUN_REPORT.md. The set is fixed; deviations are flagged as quality failures, not silent skips.

  • B.

    Specialised agents, not one generalist

    PM, ARCH, BE, FE, QA, SEC, UXW. Each has explicit responsibilities, non-goals, and synthesis variants. Configurable per template.

  • C.

    Cycles, not single shots

    Iterative quality modes critique their own output and revise. Each cycle is preserved with its feedback, score, and verdict.

  • D.

    Reviewable, not just runnable

    Human review is a first-class state, with ACCEPTED / NEEDS_CHANGES verdicts that feed back into the next revise loop.

BEFORE
chat → code →
"does it work?" →
maybe →
merge →
drift
WITH DARK FACTORY
spec → run →
artifacts × 6 →
quality score →
review verdict →
publish
Reproducibility opaquerun-id + seed
Reviewability code onlyspec diff + verdict
Reasoning trace loststage-level ADRs
03 · Workflow

One pipeline. Seven deterministic stages.

Every run is a state machine. State transitions are recorded, scored, and replayable. Failed gates surface as actionable diffs, not free-form error text.

REQUEST
Intake
Rough idea or spec. Template selection, stack preset, agent roster.
ANALYSIS
PM clarifies
Goals, non-goals, success metrics. Surfaces missing context.
SPEC
Synthesis
Structured SPEC.md, ARCH.md with ADRs and data model.
PLAN
Task graph
tasks.json with owners, estimates, acceptance criteria.
QA
Quality gates
Test matrix, regression risks, semantic verdict score / 168.
REVISE
Feedback loop
Reviewer feedback drives the next cycle. State is preserved.
PUBLISH
Ship artifacts
GitHub PR, Slack dispatch, artifact export. Run-ID anchored.
04 · Capabilities

Built for operators, not demos.

The surface is designed around the same primitives an SRE expects: runs, traces, quality gates, replay, audit, budget.

Multi-agent orchestration

14 configurable agent roles with explicit responsibilities and synthesis variants.

Quality modes

Fast, balanced, or thorough. Thorough adds critique rounds with explicit cycle limits.

Human review loop

First-class ACCEPTED / NEEDS_CHANGES states that feed the next revise cycle.

Provider abstraction

Swap providers per agent. LM Studio, Ollama, OpenAI, Anthropic, or deterministic mock.

Local model support

Run the full pipeline against a local LLM. Air-gapped reproducibility on a workstation.

GitHub publishing

Run artifacts become PRs with linked issues. Rate-limit aware, branch-anchored.

Slack integration

Dispatch run completions, request reviews, monitor inbound events from a visibility panel.

Run comparison

Two-up diff at the artifact level. Deltas in score, tokens, calls, and verdicts.

Insights dashboard

Quality and efficiency over time. Template performance. Top failing checks.

Reproducible execution

Seeded runs, deterministic mock mode, full provider-call trace per stage.

05 · Example output

Six artifacts. One run. Replayable forever.

A real run from the console. Every cell — score, verdict, agents, cost — is anchored to a run ID and can be diffed, replayed, or published.

run-mnuhgcmz-7nm9
SPEC.md
ARCH.md
QA.md
TASKS.md
tasks.json
RUN_REPORT.md
PROVIDER TRACE
stages × 5
calls × 88
cycles × 2
SPEC.md 156 lines · 4.2 KB
# Specification: Slack event visibility panel

## Overview
A small internal visibility surface for Slack events handled by the
run workflow. Inspect recent inbound events, their handling result,
and retry failed processing without touching core orchestration.

## Constraints
- No new external dependencies
- Respect existing RLS policies
- No secrets in UI

## Data Model
// slack_events
id: uuid pk
channel: text
user_id: text
type: text
payload: jsonb
processed_at: timestamptz
status: text // received | processed | failed
error: text?

## State machine
received → processing → (processed | failed)

## API contract
GET /api/slack/events?limit=50&cursor=…
POST /api/slack/events/:id/retry

## Acceptance criteria
- A user can see recent Slack events with timestamp,
channel, user, and processing status.
- Failed events are visibly marked and retryable.
- No secrets are surfaced.
- Existing Slack commands and events continue to work.
Quality score STRONG
96/ 168
14 / 14 gates passed · cycle 02
Run metadata
run-idmnuhgcmz-7nm9 templatefeature-spec stacksveltekit modereal qualitythorough seed1688773968
ProviderLMSTUDIO
modelqwen2.5-coder-32b calls88 tokens-in222,400 tokens-out108,200 cost$2.42 elapsed3m 04s
ReviewACCEPTED
"Data model and API contract are crisp. Acceptance criteria align with intake. Ship it."
— @reviewer · 12m ago
06 · Human-in-the-loop revision

AI generates. Humans govern. The system revises.

Reviewer feedback is a typed input that drives the next cycle. Every cycle is preserved with its score, verdict, and the feedback that produced the change — not just the change itself.

CYCLE 01 · ARCHIVED started 11:34 · 1m 58s
52/ 168 MIXED
VERDICTmixed
GATES PASSED10 / 14
FAILED GATES4
COST$1.18
CYCLE 02 · CURRENT started 11:36 · 3m 04s
96/ 168 STRONG
VERDICTstrong
GATES PASSED14 / 14
DELTA vs C-01+44 (+85%)
COST$2.42
@JP
@jdpinetta REVIEWER · OPERATOR-1 turnaround: 4m 12s
● ACCEPTED CYCLE 02 / 02
Data model and API contract are crisp. Acceptance criteria align with intake intent. Constraint about RLS captured properly. Approved for publish — please open the PR against feature/slack-visibility.
VERDICT ACCEPTED · ROUTED TO publish/github · ACKED 11:42 UTC
SPEC.md · DIFF C-01 → C-02 + 38 − 11
@@ new section · data model + state machine @@
22User can browse Slack stuff and retry things.
22## Data model
23slack_events { id, channel, user_id, type,
24payload, processed_at, status, error? }
25
26## State machine
27received → processing → (processed | failed)
28
29## Acceptance criteria
30- Recent events visible w/ timestamp + status
31- Failed events visibly marked + retryable
32- No secrets surfaced
Inspect the full run RUN-ID mnuhgcmz-7nm9 · CYCLE 2 / 2 · SEED 1688773968
07 · Positioning
Dark Factory is applied AI systems engineering — not a chatbot productivity app, not a prompt library, not a copilot wrapper.

It is the orchestration layer for teams who treat AI like infrastructure.
BUILT FOR · infra-minded teams
OPTIMIZED FOR · reproducibility
PROVES OUT · 1 operator, full stack