Skip to content
v2 · Local MLOps platform

Local fine-tuning, run like production.

TuneKit turns local LLM fine-tuning into a tracked, reproducible operation — datasets, recipes, training runs, a versioned adapter registry, and eval suites in one operator console.

Models
Local & API
Runs
Tracked & replayable
Adapters
Versioned registry
Evals
Suites & baselines
run-tk04f · qwen3-4b · lora r16 TRAINING
TRAINING LOSS
step840 / 1000
loss0.284
learning rate2.0e-4
throughput1,840 tok/s
eta3m 12s
01 · The cost of ad-hoc tuning

Fine-tuning is easy to start.
Hard to operate.

Getting a LoRA to train is a one-liner. Running fine-tuning as a repeatable practice — where results are trackable, comparable, and reproducible — is a different problem entirely.

01
Scattered runs

Training lives in shell scripts and notebooks. Two weeks later, no one remembers which config produced the good adapter.

02
Lost adapters

Output weights pile up in folders with no version, no metadata, no lineage back to the dataset and recipe that made them.

03
No baseline

"Is it actually better?" has no answer without an eval. Vibes-based validation doesn't survive the next model.

04
Opaque cost

Token spend, throughput, and training time vanish into terminal scrollback. Efficiency is invisible.

02 · The operator surface

One console for the
whole tuning loop.

TuneKit treats every fine-tune as a tracked run. Datasets are structured. Adapters are versioned. Evals are built in. The output is a registry you can trust — not a folder of mystery weights.

A.
Datasets as first-class

Upload, validate, and compose from recipes. Bad rows surface before a run wastes an hour of compute.

B.
Runs you can replay

Every job records base model, LoRA config, hyperparameters, and seed. Reproducibility is the default, not a spreadsheet.

C.
A registry, not a folder

Adapters are versioned, activated, and compared. Lineage links each one back to its dataset, recipe, and run.

D.
Evals built in

Suites score every adapter against a baseline, so "better" is a number — and Chat Lab lets you feel it live.

— BEFORE
train.py → checkpoint/ →
"did it work?" →
rename folder →
forget which one
— WITH TUNEKIT
dataset → recipe →
run + live metrics →
adapter vN registered →
eval score →
activate
REPRODUCIBILITYRUN-ID + SEED
ADAPTER LINEAGEDATASET → RECIPE → RUN
VALIDATIONEVAL SUITE + CHAT LAB
03 · Workflow

Dataset to adapter,
in six tracked steps.

Every stage writes structured state. A run is a state machine, not a script — inspectable at any point, replayable from the top.

SOURCE
Dataset

Upload & validate, or compose from a dataset recipe.

CONFIG
Recipe

Pick base model, LoRA rank, and hyperparameters.

RUN
Train

Launch the job, watch live loss and throughput.

OUTPUT
Adapter

Weights land in the versioned registry with metadata.

SCORE
Eval

Run a suite; score the adapter against a baseline.

VALIDATE
Chat Lab

Talk to the adapter live before you ship it.

04 · Capabilities

Built for operators,
not one-off experiments.

The surface is designed around the primitives a fine-tuning practice actually needs: datasets, runs, adapters, evals, and cost.

Dataset validation

Upload CSV/JSONL, catch malformed rows and schema drift before training.

Recipes & starter kits

Reusable training configs and curated presets for common base models.

Live training metrics

Loss curves, throughput, and logs streaming as the job runs.

Versioned adapter registry

Every adapter gets a version, base model, mode, and lifecycle state.

Eval suites & baselines

Score adapters against a baseline so improvement is measurable.

Chat Lab validation

Load an active adapter and validate behavior in live conversation.

Efficiency analytics

Token spend, tokens/sec, and cost surfaced per run and per model.

Provider-agnostic

Fine-tune against local MLX models or route to API providers.

05 · The console

Real screens.
Real runs.

Not a mockup — the operating console for a local fine-tuning practice. Dashboard analytics, the adapter registry, the dataset builder, and the advisor.

tunekit · /dashboard
TuneKit dashboard — training efficiency, adapter quality, and run analytics
Dashboard — fine-tuning efficiency and quality analytics at a glance.
tunekit · /adapters
TuneKit adapter registry — versioned, with base model and eval scores
Adapter registry — versioned, with lifecycle state and eval scores.
tunekit · /datasets
TuneKit dataset builder — structured dataset composition and validation
Dataset builder — compose and validate training data.
tunekit · /advisor
TuneKit advisor — guidance on training configuration and next steps
Advisor — guidance on configuration and next steps.
06 · Example output

One run.
A registered, scored adapter.

Every training job resolves to a versioned adapter with full lineage, final metrics, and an eval score — ready to activate or compare.

qwen3-4b-support-v3
Qwen3 4B · LoRA r16 · MLX 4-bit
● ACTIVE
FINAL LOSS
0.284
EVAL PASS
92%
VS BASELINE
+18 pts
TRAIN TIME
14m 06s
EVAL SUITE — support-quality
tone-match
94%
policy-recall
88%
refusal-correctness
96%
hallucination
9%
RUN METADATA
run-idtk04f-9m2a
datasetsupport-tickets-v2
recipelora-support
rows3,120
epochs3
lr2.0e-4
seed1688773
modelocal · mlx
LINEAGE
support-tickets-v2
lora-support recipe
qwen3-4b-support-v3
07 · Positioning
TuneKit is applied MLOps for local models — not a notebook, not a script farm, not a hosted black box.

It is the operator layer for teams who fine-tune on their own hardware and need every run to be reproducible.
BUILT FOR · local-first teams
OPTIMIZED FOR · reproducibility
PROVES OUT · dataset → adapter → eval