v2 · Local MLOps platform

Local fine-tuning, run like production.

TuneKit turns local LLM fine-tuning into a tracked, reproducible operation — datasets, recipes, training runs, a versioned adapter registry, and eval suites in one operator console.

Open Console See the console

Models

Local & API

Runs

Tracked & replayable

Adapters

Versioned registry

Evals

Suites & baselines

run-tk04f · qwen3-4b · lora r16 TRAINING

TRAINING LOSS

step840 / 1000

loss0.284

learning rate2.0e-4

throughput1,840 tok/s

eta3m 12s

01 · The cost of ad-hoc tuning

Fine-tuning is easy to start.
Hard to operate.

Getting a LoRA to train is a one-liner. Running fine-tuning as a repeatable practice — where results are trackable, comparable, and reproducible — is a different problem entirely.

01

Scattered runs

Training lives in shell scripts and notebooks. Two weeks later, no one remembers which config produced the good adapter.

02

Lost adapters

Output weights pile up in folders with no version, no metadata, no lineage back to the dataset and recipe that made them.

03

No baseline

"Is it actually better?" has no answer without an eval. Vibes-based validation doesn't survive the next model.

04

Opaque cost

Token spend, throughput, and training time vanish into terminal scrollback. Efficiency is invisible.

02 · The operator surface

One console for the
whole tuning loop.

TuneKit treats every fine-tune as a tracked run. Datasets are structured. Adapters are versioned. Evals are built in. The output is a registry you can trust — not a folder of mystery weights.

A.

Datasets as first-class

Upload, validate, and compose from recipes. Bad rows surface before a run wastes an hour of compute.

B.

Runs you can replay

Every job records base model, LoRA config, hyperparameters, and seed. Reproducibility is the default, not a spreadsheet.

C.

A registry, not a folder

Adapters are versioned, activated, and compared. Lineage links each one back to its dataset, recipe, and run.

D.

Evals built in

Suites score every adapter against a baseline, so "better" is a number — and Chat Lab lets you feel it live.

— BEFORE

train.py → checkpoint/ →
"did it work?" →
rename folder →
forget which one

— WITH TUNEKIT

dataset → recipe →
run + live metrics →
adapter vN registered →
eval score →
activate

REPRODUCIBILITYRUN-ID + SEED

ADAPTER LINEAGEDATASET → RECIPE → RUN

VALIDATIONEVAL SUITE + CHAT LAB

03 · Workflow

Dataset to adapter,
in six tracked steps.

Every stage writes structured state. A run is a state machine, not a script — inspectable at any point, replayable from the top.

SOURCE

Dataset

Upload & validate, or compose from a dataset recipe.

CONFIG

Recipe

Pick base model, LoRA rank, and hyperparameters.

RUN

Train

Launch the job, watch live loss and throughput.

OUTPUT

Adapter

Weights land in the versioned registry with metadata.

SCORE

Eval

Run a suite; score the adapter against a baseline.

VALIDATE

Chat Lab

Talk to the adapter live before you ship it.

04 · Capabilities

Built for operators,
not one-off experiments.

The surface is designed around the primitives a fine-tuning practice actually needs: datasets, runs, adapters, evals, and cost.

Dataset validation

Upload CSV/JSONL, catch malformed rows and schema drift before training.

Recipes & starter kits

Reusable training configs and curated presets for common base models.

Live training metrics

Loss curves, throughput, and logs streaming as the job runs.

Versioned adapter registry

Every adapter gets a version, base model, mode, and lifecycle state.

Eval suites & baselines

Score adapters against a baseline so improvement is measurable.

Chat Lab validation

Load an active adapter and validate behavior in live conversation.

Efficiency analytics

Token spend, tokens/sec, and cost surfaced per run and per model.

Provider-agnostic

Fine-tune against local MLX models or route to API providers.

05 · The console

Real screens.
Real runs.

Not a mockup — the operating console for a local fine-tuning practice. Dashboard analytics, the adapter registry, the dataset builder, and the advisor.

TuneKit dashboard — training efficiency, adapter quality, and run analytics — Dashboard — fine-tuning efficiency and quality analytics at a glance.

TuneKit adapter registry — versioned, with base model and eval scores — Adapter registry — versioned, with lifecycle state and eval scores.

TuneKit dataset builder — structured dataset composition and validation — Dataset builder — compose and validate training data.

TuneKit advisor — guidance on training configuration and next steps — Advisor — guidance on configuration and next steps.

06 · Example output

One run.
A registered, scored adapter.

Every training job resolves to a versioned adapter with full lineage, final metrics, and an eval score — ready to activate or compare.

qwen3-4b-support-v3

Qwen3 4B · LoRA r16 · MLX 4-bit

● ACTIVE

FINAL LOSS

0.284

EVAL PASS

92%

VS BASELINE

+18 pts

TRAIN TIME

14m 06s

EVAL SUITE — support-quality

tone-match

94%

policy-recall

88%

refusal-correctness

96%

hallucination

9%

RUN METADATA

run-idtk04f-9m2a

datasetsupport-tickets-v2

recipelora-support

rows3,120

epochs3

lr2.0e-4

seed1688773

modelocal · mlx

LINEAGE

support-tickets-v2

↓

lora-support recipe

↓

qwen3-4b-support-v3

07 · Positioning

TuneKit is applied MLOps for local models — not a notebook, not a script farm, not a hosted black box.

It is the operator layer for teams who fine-tune on their own hardware and need every run to be reproducible.

BUILT FOR · local-first teams

OPTIMIZED FOR · reproducibility

PROVES OUT · dataset → adapter → eval

Local fine-tuning, run like production.

Fine-tuning is easy to start.Hard to operate.

One console for thewhole tuning loop.

Dataset to adapter,in six tracked steps.

Built for operators,not one-off experiments.