META · 2025–2026

Model Management

Turning a sprawling model catalog into something you can group, read, and debug

Senior Product Designer

Groupings

custom + AI-recommended sets

Overview

Lifecycle, freshness, health

Freshness

cost-vs-quality tradeoff

The problem

The model management surface was where engineers went to find, group, and debug their ML models. The broader platform around it was heavily used, but this surface reached only a small fraction of those people, and its information architecture was the reason. You navigated by toggling between models, model types, and inference services; the details that told you what a model actually was stayed buried; and model names weren’t editable, so search rarely landed on the model you were after.

So, people built their own. Teams kept private spreadsheets mapping model and model-type IDs to the fields and metrics they tracked, updated by hand every week or two. This was obviously inefficient: no one was sure which copy was current, and every new joiner inherited the confusion instead of a system. Underneath the workaround was the harder problem: status, lifecycle stage, freshness, cost, metrics, and quality lived across different tools, so reasoning about a single model meant assembling the picture by hand. I led a full revamp, rebuilding the information architecture so model names were editable and searchable and the model, model-type, and inference-service views no longer had to be toggled between, plus a new model overview, custom and AI-recommended groupings, and a freshness analysis surface.

My role and constraints

I came on as the new designer after a reorg in August 2025 moved the previous designer into management. The same reorg moved most of the team that had built model management into a new org, the UX researchers included, so much of the context that usually lives in a team’s memory left with it. The people who stayed were newer to the surface or had come from other domains, and we were all onboarding into something none of us had built, so part of the work was rebuilding that lost context before I could improve on it. I still ramped fast: I shipped the freshness designs within a month of joining and the rest of the model management redesign by mid-February 2026, covering the overview, freshness analysis, groupings and projects, search across models and schedules, onboarding, and project sharing. The team was five engineers, and with limited PM coverage and no dedicated researcher I drove much of the feature definition and direction-setting myself, leaning on the evidence I could gather firsthand. Two things shaped the work:

Expert users, scattered signal.

The people using this could reason about a model deeply, but only if the relevant state was in front of them. The job was synthesis, not more dashboards.

AI that operators would trust.

The surface introduced AI-recommended groupings and actions, and a recommendation only helps if the person believes it and can override it, so trust had to be designed in rather than bolted on.

Featured design decision 01

Make the freshness decision legible as a tradeoff

The hardest decision on the surface was how fresh to keep a model, and that is really a cost-versus-quality tradeoff that nothing surfaced legibly. The surface starts with the diagnosis: it breaks down where latency accrues across the ML pipeline, stage by stage, so an engineer can see which stage to target:

ATS distribution across pipeline stages. Highlighted P50/P90 show where latency accrues, the diagnosis before the freshness decision. (Recreation)

To decide the offset, you configure a study: a snapshot window, the NE metrics that matter and how to weight them, and a target latency offset. Once it completes, the study lines up the tradeoff per offset, so the decision reads in one table:

• −2h / 0 / +2h / +4h latency offset, one per row

• projected NE % regression and GAS change, the quality side

• estimated cost and estimated ROI, the business side

Freshness study results. The completed study lines up quality (NE / GAS), cost, and ROI per latency offset. (Recreation)

The bet was that an ML expert makes a better and faster freshness call when quality, cost, and ROI are lined up per offset rather than left as raw metrics to assemble. It is the same principle as my ML Guardian work: lead with the decision-relevant synthesis, not the underlying data.

Featured design decision 02

Overview and groupings

The model overview surface did the same synthesis at the model level, pulling lifecycle stage (author, train, serve), ATS freshness, and snapshot health into a single read. A snapshot is a model state the system captures on a schedule or an engineer triggers, and its health reads as passed, bypassed, rejected, or purged. Groupings let people carve the catalog into the working set that matched their task, and the system proposed AI-recommended groupings by suggesting similar models based on features, datasets, and purpose. The line I held throughout was that AI proposed and the operator disposed, so that 1) people moved faster when the suggestion was right and 2) they kept full control when it was not.

Model overview. Lifecycle stage, ATS freshness, and snapshot health over time (passed / bypassed / rejected / purged) in one read. (Recreation)

New project modal. A model assistant proposes similar models by features, datasets, and purpose; the operator adds or dismisses each. (Recreation)

Make the case for custom groupings

Groupings almost didn’t ship. I was scoped to fix the information architecture, and the team wanted that done fast and was wary of a net-new feature, so the proposal drew real pushback, from a mostly-absent product manager and a risk-averse engineering manager. The spreadsheet workaround was the proof that the need was real, but I didn’t argue it on instinct. I synthesized the findings from every prior research report, ran my own round of testing on the designs with eight users, and had an engineer field a survey on which direction people preferred. All three pointed the same way. It took about a month of back-and-forth, but the case held, and we built and shipped it.

Outcome

I was laid off in the May 2026 reduction before the launch metrics were analyzed, so I cannot claim a concrete adoption or savings number on this one. What I can show is the design reasoning and the surfaces that shipped.

What I learned

Groupings almost didn’t ship. I was scoped to fix the information architecture, and the team wanted that done fast and was wary of a net-new feature, so the proposal drew real pushback, from a mostly-absent product manager and a risk-averse engineering manager. The spreadsheet workaround was the proof that the need was real, but I didn’t argue it on instinct. I synthesized the findings from every prior research report, ran my own round of testing on the designs with eight users, and had an engineer field a survey on which direction people preferred. All three pointed the same way. It took about a month of back-and-forth, but the case held, and we built and shipped it.

Synthesis beats more data.

For expert users the design win is usually pulling the decision-relevant signal into one read, not adding another dashboard.

Recommendations earn trust by being overridable.

An AI suggestion the operator cannot inspect or reject is just an unaccountable decision, and the override is what makes the recommendation usable at all.

Evidence wins the room.

With the researchers gone and the team still ramping, there was no shared context to appeal to, so I built the case from evidence: prior research, a usability round, and a preference survey that all pointed the same way. In an engineering-heavy org, that triangulation is what turns a designer’s conviction into something a reluctant team will fund and build.

⚠ Screens shown are portfolio reconstructions with fictional data. Product names, financial figures, and internal references have been replaced. Design decisions, interaction patterns, and outcomes reflect the actual work.

Next case study: