/

How we Stayed Reliable During Enrollment Season: Capacity, Change Discipline, and Clarity

How we Stayed Reliable During Enrollment Season: Capacity, Change Discipline, and Clarity

How we Stayed Reliable During Enrollment Season: Capacity, Change Discipline, and Clarity

Perspective

Perspective

Perspective

By Nick Howard

Jan 16, 2026

4 min read

A ship weathering a storm
A ship weathering a storm

Medicare AEP and ACA Open Enrollment are predictable surges where dialer and CRM reliability directly shows up as enrollments and revenue. Onyx Platform went into AEP 2026 as a young company, but our team has lived through many enrollment seasons as operators and builders.

Our thesis is simple: enrollment season reliability is not luck. It comes from explicit design and infrastructure choices, disciplined change management, and clear communication with clients.

During AEP we had no customer-reported platform outages. On the day that the internet broke in late October (due to an AWS us-east-1 outage), our software was operating the entire day, although one of our service providers had two short outages.

Enrollment season is a complex operating environment, not just higher volume

AEP and OEP are not just "more traffic." They are periods where small platform failures compound quickly because agents are stacked back-to-back, queues are full, and agencies invested heavily in marketing.

For a telephonic enrollment operation, reliability has four operational requirements:

  • Sustained call connectivity so connects do not turn into dead air, one-way audio, or failed transfers.

  • Consistent CRM responsiveness so agents can load a person record and write outcomes without lag or timeouts.

  • Complete recording and transcript retention so you do not create compliance gaps when volume is highest.

  • A support team that understands what matters so if degradation happens, systems and people prioritize enrollment-critical workflows first.

The most common enrollment season failure pattern is avoidable: unplanned changes land during peak usage windows, infrastructure capacity runs too tight for hours instead of minutes, and ownership gets blurry when something breaks. If you treat enrollment season as a distinct operating environment with its own rules, you can set guardrails before you are under pressure.

Four levers to maximize uptime

Extra infrastructure capacity headroom buys decision time

We added extra database and compute capacity far beyond what was necessary or expected. This was not the cheapest option. We optimized for customer success over our own short-term margins, but this was an easy decision for us. Paying an extra 20% on our AWS bill to make sure our customers don’t lose one of their most critical days is the sort of easy tradeoff that the Onyx Platform brand is built on. We always prioritize our clients needs over our own.

When you have room, a spike just becomes a slow queue you can drain, not a cascade of timeouts that drops calls and loses CRM records.

A feature freeze reduces surprise and protects core paths

We froze product changes ahead of peak windows and limited deployments to low-risk updates. A change qualified as "safe" only if it met all of these criteria:

  • It avoided telephony, recording, and data-integrity paths.

  • It shipped behind a feature flag with a fast off switch.

  • It had a rollback playbook and an explicit monitoring plan (our changes always do).

Reviews included customer workflow risk, not just code correctness

Tests passing is not the bar during enrollment season (it should never be the only bar). We reviewed changes through the lens of customer workflow: what screen loads during a connect, what gets written at disposition, and what an agent does next when a call connects.

For any change that touched those flows, we required a clear answer to two questions. What is the worst plausible failure mode? How do we detect any failure fast?

That kept us focused on user impact, not just implementation details.

Simple infrastructure choices limit blast radius

We are intentional about every hardware and software choice. The more variety you add, the less resilient you become.

Our platform does not rely on higher-level infrastructure features like AWS Lambda or auto-scaling to determine the min and max sizes of the container fleet. When you have strong engineers, you can build software that runs on simpler infrastructure, which makes a more reliable platform. During the October 2025 AWS disruption, some services that depended on those higher-level features experienced extended outages. Our minimalist stack stayed up when everyone else was down.

We also scheduled major migrations outside core seasons. During AEP and OEP, the default stance was to preserve enrollment-critical behavior.

Practically, that meant defining what stays up first:

  • Dialing and call control.

  • Loading person records and writing the outcomes that drive follow-up work.

  • Call recording capture and durable storage.

And it meant defining what can slow down or pause if needed without blocking enrollments: deep analytics, long-running scoring jobs, and non-critical data exports.

That approach mattered during the late-October AWS us-east-1 disruption that affected multiple SaaS dependencies used across the industry, including Twilio for some call flows. While AWS worked through their outage for most of the day, we limited customer impact to two 30-minute windows.

Make uptime predictable with an SLA, maintenance windows, and clear ownership

Clear expectations reduce operational risk for both sides. We publish an SLA that separates enrollment-critical functionality from secondary features, and we define maintenance windows that minimize agent impact.

In practice, that means:

  • Labeling which workflows are "must stay up" during enrollment season and which can slow down if needed.

  • Communicating planned changes early, with operator-facing impact statements.

  • Keeping on-call ownership clear, with runbooks that match real enrollment season failure modes.

We plan to keep treating enrollment season as a distinct operating environment, with the same discipline around capacity, change control, and post-incident learning.

If you want to talk through AEP or OEP readiness for your agency, contact us.