Maria Jimbo | Senior Product Designer

Go back home

ML Notifications

Alerts that drive decisions, not just inform them

META · 2025 · LEAD Product Designer

3.6x

Faster resolution

22%

early termination

+20.5%

clickthrough rate

78.8%

active engagement

3.6x

Faster resolution

22%

early termination

+20.5%

clickthrough rate

78.8%

active engagement

3.6x

Faster resolution

22%

early termination

+20.5%

clickthrough rate

78.8%

active engagement

The problem

Meta runs an enormous number of ML training jobs daily, and the longer a problem goes undetected the more compute and engineer time is wasted. The legacy system was dashboard-based: the backend could detect issues in real time but had no way to surface them to the right person at the right moment, so most issues were caught days late, or required the ML engineer to manually "babysit" (keep looking at their individual job pages for status updates). The surfacing problem was the design problem.

My role and constraints

I joined mid-half, two weeks before a planning deadline, and led end-to-end design with two engineers and a PM. Three constraints shaped the work:

Notification fatigue on a different level.

Meta engineers get hundreds of notifications a week, so the bar to earn engagement was already high.

High stakes.

A typical action is killing a training job or investigating a pipeline, both with real cost (dollars/GPU hours, engineer hours).

Trust in the detection.

Early false positives would make engineers dismiss the whole system.

Featured design decision

Connect notifications to the decision flow

The core decision was simple to state: don’t make alerts inform, make alerts decide. Most systems follow inform-then-redirect, which means two context switches, and faced with that most engineers dismiss the alert and never come back. I designed the notification to carry enough context to act directly. Each one included:

• what was detected, in plain language

• the category of alert, so engineers could weight false-positive risk

• the specific job, its owner, and the time window

• two actions in the notification itself: Terminate and Investigate

A decision-ready notification: plain-language detection with the values bolded, alert category for weighting false-positive risk, job ownership and time window, and both actions available without leaving the alert. (Recreation)

Terminate was the controversial one. I argued the costs are asymmetric: a false-positive termination loses a few restartable hours, while a missed real issue can burn days of compute, so the asymmetry favors making termination easy, behind a confirm dialog showing the recent metric trend. The 22% early-termination rate validated it, because the system surfaced the decision when it was actually decidable.

The terminate confirm dialog shows the relevant metric trend chart against baseline/threshold, so engineers can verify the alert against the data in one glance before committing to an action. (Recreation)

What I learned

Notification design is decision-flow design.

I thought I was designing a component; I was designing a decision flow that happened to start in a notification.

Asymmetric-cost analysis is a tool I now reach for.

On any false-positive-versus-false-negative trade-off, find which error is bounded, because that’s the one you can afford to make more often.

Fast ramp-up is a skill.

I got useful by pairing and watching people work in their existing tools rather than asking them to explain things in meetings. Observe, then ask (promptly).

⚠ Screens shown are portfolio reconstructions with fictional data. Product names, metrics, and internal references have been replaced. Design decisions, interaction patterns, and outcomes reflect the actual work.

Next case study: