v0.42 · Multi-agent runtime

Spec-first multi-agent orchestration for serious AI-assisted development.

Dark Factory transforms rough feature ideas into structured implementation artifacts, review workflows, and reproducible engineering outputs — the way infrastructure should.

Open Console View Example Run

Pipelines

Multi-agent, 5-stage

Models

Local & cloud

Review

Human-in-the-loop

Publish

GitHub, Slack

run-mnuhgcmz-7nm9 · slack-visibility-panel LIVE

01 · The cost of unstructured AI work

AI-assisted development is fast.
Without structure, it isn't trustworthy.

Most teams have already wired a chat model into their IDE. What they haven't built is the orchestration layer above it. Without one, the same problems compound across every feature.

Specification drift

Decisions live in chat threads. Two weeks later, no one remembers which version of "the plan" actually shipped.

Context loss

Every prompt re-explains the problem. Architectural constraints, ADRs, and prior trade-offs are continually re-discovered.

Weak reproducibility

The same request, an hour later, returns a different design. There is no run ID, no seed, no trace of what produced what.

Reviewability gap

Reviewers see code, not reasoning. There is no diff at the spec layer, no verdict beyond LGTM, no way to compare alternatives.

02 · The orchestration layer

A factory floor for engineering artifacts.

Dark Factory treats every feature as a pipeline run. Inputs are structured. Outputs are versioned. Agents are specialised. Reviewers get a first-class surface, not a comment thread.

A.

Structured specs by construction

Every run produces SPEC.md, ARCH.md, QA.md, TASKS.md, tasks.json, RUN_REPORT.md. The set is fixed; deviations are flagged as quality failures, not silent skips.
B.

Specialised agents, not one generalist

PM, ARCH, BE, FE, QA, SEC, UXW. Each has explicit responsibilities, non-goals, and synthesis variants. Configurable per template.
C.

Cycles, not single shots

Iterative quality modes critique their own output and revise. Each cycle is preserved with its feedback, score, and verdict.
D.

Reviewable, not just runnable

Human review is a first-class state, with ACCEPTED / NEEDS_CHANGES verdicts that feed back into the next revise loop.

BEFORE

chat → code →

"does it work?" →

maybe →

merge →

drift

WITH DARK FACTORY

spec → run →

artifacts × 6 →

quality score →

review verdict →

publish

Reproducibility opaque → run-id + seed

Reviewability code only → spec diff + verdict

Reasoning trace lost → stage-level ADRs

03 · Workflow

One pipeline. Seven deterministic stages.

Every run is a state machine. State transitions are recorded, scored, and replayable. Failed gates surface as actionable diffs, not free-form error text.

REQUEST

Intake

Rough idea or spec. Template selection, stack preset, agent roster.

ANALYSIS

PM clarifies

Goals, non-goals, success metrics. Surfaces missing context.

SPEC

Synthesis

Structured SPEC.md, ARCH.md with ADRs and data model.

PLAN

Task graph

tasks.json with owners, estimates, acceptance criteria.

Quality gates

Test matrix, regression risks, semantic verdict score / 168.

REVISE

Feedback loop

Reviewer feedback drives the next cycle. State is preserved.

PUBLISH

Ship artifacts

GitHub PR, Slack dispatch, artifact export. Run-ID anchored.

04 · Capabilities

Built for operators, not demos.

The surface is designed around the same primitives an SRE expects: runs, traces, quality gates, replay, audit, budget.

Multi-agent orchestration

14 configurable agent roles with explicit responsibilities and synthesis variants.

Quality modes

Fast, balanced, or thorough. Thorough adds critique rounds with explicit cycle limits.

Human review loop

First-class ACCEPTED / NEEDS_CHANGES states that feed the next revise cycle.

Provider abstraction

Swap providers per agent. LM Studio, Ollama, OpenAI, Anthropic, or deterministic mock.

Local model support

Run the full pipeline against a local LLM. Air-gapped reproducibility on a workstation.

GitHub publishing

Run artifacts become PRs with linked issues. Rate-limit aware, branch-anchored.

Slack integration

Dispatch run completions, request reviews, monitor inbound events from a visibility panel.

Run comparison

Two-up diff at the artifact level. Deltas in score, tokens, calls, and verdicts.

Insights dashboard

Quality and efficiency over time. Template performance. Top failing checks.

Reproducible execution

Seeded runs, deterministic mock mode, full provider-call trace per stage.

05 · Example output

Six artifacts. One run. Replayable forever.

A real run from the console. Every cell — score, verdict, agents, cost — is anchored to a run ID and can be diffed, replayed, or published.

run-mnuhgcmz-7nm9

SPEC.md

ARCH.md

QA.md

TASKS.md

tasks.json

RUN_REPORT.md

PROVIDER TRACE

stages × 5

calls × 88

cycles × 2

SPEC.md 156 lines · 4.2 KB

# Specification: Slack event visibility panel

## Overview
A small internal visibility surface for Slack events handled by the
run workflow. Inspect recent inbound events, their handling result,
and retry failed processing without touching core orchestration.

## Constraints
- No new external dependencies
- Respect existing RLS policies
- No secrets in UI

## Data Model
// slack_events
id: uuid pk
channel: text
user_id: text
type: text
payload: jsonb
processed_at: timestamptz
status: text // received | processed | failed
error: text?

## State machine
received → processing → (processed | failed)

## API contract
GET /api/slack/events?limit=50&cursor=…
POST /api/slack/events/:id/retry

## Acceptance criteria
- A user can see recent Slack events with timestamp,
channel, user, and processing status.
- Failed events are visibly marked and retryable.
- No secrets are surfaced.
- Existing Slack commands and events continue to work.

Quality score STRONG

96/ 168

14 / 14 gates passed · cycle 02

Run metadata

run-idmnuhgcmz-7nm9 templatefeature-spec stacksveltekit modereal qualitythorough seed1688773968

ProviderLMSTUDIO

modelqwen2.5-coder-32b calls88 tokens-in222,400 tokens-out108,200 cost$2.42 elapsed3m 04s

ReviewACCEPTED

"Data model and API contract are crisp. Acceptance criteria align with intake. Ship it."
— @reviewer · 12m ago

06 · Human-in-the-loop revision

AI generates. Humans govern. The system revises.

Reviewer feedback is a typed input that drives the next cycle. Every cycle is preserved with its score, verdict, and the feedback that produced the change — not just the change itself.

CYCLE 01 · ARCHIVED started 11:34 · 1m 58s

52/ 168 MIXED

VERDICTmixed

GATES PASSED10 / 14

FAILED GATES4

COST$1.18

REVIEWER FEEDBACK INJECTED "SPEC missing structured Data model, State machine, and API contract sections. Acceptance criteria too vague. Tighten and rerun thorough."

— @reviewer · 11:36 UTC

CYCLE 02 · CURRENT started 11:36 · 3m 04s

96/ 168 STRONG

VERDICTstrong

GATES PASSED14 / 14

DELTA vs C-01+44 (+85%)

COST$2.42

CHANGES APPLIED + Added Data model / State machine / API contract sections
+ Tightened acceptance criteria to 4 testable bullets
+ Added explicit non-goals (no UI changes to existing endpoints)

@JP

@jdpinetta REVIEWER · OPERATOR-1 turnaround: 4m 12s

● ACCEPTED CYCLE 02 / 02

Data model and API contract are crisp. Acceptance criteria align with intake intent. Constraint about RLS captured properly. Approved for publish — please open the PR against feature/slack-visibility.

VERDICT ACCEPTED · ROUTED TO publish/github · ACKED 11:42 UTC

SPEC.md · DIFF C-01 → C-02 + 38 − 11

@@ new section · data model + state machine @@

22User can browse Slack stuff and retry things.

22## Data model

23slack_events { id, channel, user_id, type,

24payload, processed_at, status, error? }

26## State machine

27received → processing → (processed | failed)

29## Acceptance criteria

30- Recent events visible w/ timestamp + status

31- Failed events visibly marked + retryable

32- No secrets surfaced

Inspect the full run RUN-ID mnuhgcmz-7nm9 · CYCLE 2 / 2 · SEED 1688773968

07 · Positioning

Dark Factory is applied AI systems engineering — not a chatbot productivity app, not a prompt library, not a copilot wrapper.

It is the orchestration layer for teams who treat AI like infrastructure.

BUILT FOR · infra-minded teams

OPTIMIZED FOR · reproducibility

PROVES OUT · 1 operator, full stack