Operating a two-service stack: what we learned in the first year

Switchboard ships as two services and a Solana program. After a year of running it for paying customers, here is what worked, what broke, and what we changed.

A year ago we shipped Switchboard with a deliberately simple architecture: a Customer API on port 3000, a Core Engine on port 3001, a Postgres or Mongo database behind them, and a Solana coordinator program. We told ourselves we would resist the temptation to add a third service until we had proven we could not get away with two. Twelve months in, we have proven we can get away with two — but it has not been a free ride. This post is the operational retro.

What worked

The 2-service split was the right call. Every time we got an “outage” page in the first six months, the on-call could identify within ninety seconds whether it was a Customer API problem (auth, rate limiting, request shape) or a Core Engine problem (oracle, chain ops, billing). Two services means two log streams, two metric prefixes, two runbooks. Three services means a six-way decision tree. We have shipped roughly thirty incident reports in twelve months; in none of them was the diagnosis confused by service boundaries.

The Postgres-or-Mongo choice was popular. We almost did not ship it — supporting two databases is a real ongoing cost — but the customer reaction was overwhelmingly positive. Some teams have a strict Postgres-shop policy; some have a strict Mongo-shop policy; one had a strict “whichever you support, we will pick the other” policy. Supporting both let us close deals we would otherwise have lost on the database religion question.

The MIT license worked as a sales tool. This was a surprise. We expected the license to mostly drive open-source contributors. What it actually did was disarm procurement teams: when the contract negotiation got tense, we could point at the repo and say “if this ever stops working for you, here is the eject button.” Three deals went from “we need a discount” to “we need a contract” the moment the OSS exit option was confirmed.

The 92% test coverage number was load-bearing. We hit that number early and protected it. When a customer or a regulator asked about reliability, the number was there. We did not have to argue, we did not have to talk about ad-hoc QA processes, we just pointed at the CI badge. Coverage is not correctness, but the conversation is much shorter when the number is high.

What broke

Solana RPC rate limits, three different times. We had been a generous user of public RPCs in early dev. The moment we hit production volume, the RPCs throttled us, and the symptom was “Core Engine intermittently times out on coordinator reads.” It took us longer than it should have to figure out that the coordinator was fine and our RPC budget was the problem. Fix: we now run a dedicated Solana RPC pool with a paid provider, with our own backup geth-style fallback. We document this explicitly in the deployment guide.

Lerna’s bootstrap surprised us. We chose Lerna early because the monorepo had multiple publishable packages. Lerna 7’s bootstrap step occasionally produced inconsistent dependency trees that worked in dev but failed in CI. We have now mostly moved to npm workspaces with selective use of Lerna for publishing. The lesson was generic: pick the smallest tooling that solves the problem, and migrate when a new one is genuinely better.

The first benchmark suite was a lie. Our internal benchmarks ran on a local devnet and showed brilliant latency numbers. Production numbers were 4x worse and we got rightfully clowned in customer demos for a month. We rebuilt the suite to run on mainnet with real fees and produce per-release reports. The lesson: if your benchmark does not pay real money, it does not produce honest numbers.

The “default” verifier choice was too aggressive. We initially defaulted new routes to light-client verification on every chain that supported it. The problem: light-client verification requires the destination chain’s light client to be up-to-date, which it usually is, but during chain congestion the light client falls behind and the route stalls. We changed the default to light-client with BLS fallback after 5s and the stall rate dropped by 90%. Documented in the upgrade notes.

What we changed

Three concrete changes in the first year:

Observability stack came in earlier than planned. We originally targeted Q3 for a per-route metrics dashboard. Customers asked for it in week six. We shipped a minimal version in week eight and a real one in month four. The takeaway: observability is not a “phase 2” feature. If your customers cannot see what their traffic is doing, they will assume the worst.

Customer API rate limiting got smarter. Initial rate limits were per-API-key, per-minute. This was the wrong shape: real apps have bursty traffic with quiet periods. We moved to a token-bucket model with per-key configurable burst and refill rates. The number of “rate limited in production” support tickets dropped to roughly one per month from roughly one per day.

We standardized on fail-closed as the default and made fail-open explicit. Early on, customers picked fail-open because it sounded better — “open” sounds like a feature. After the second time someone wrote us a tense email because a fail-open route had silently desynced their app, we flipped the default. fail-closed is now the default and fail-open requires explicit acknowledgement in the SDK (“I understand the consistency tradeoff”). Friction in the right place beats clarity nowhere.

What we are doing next year

A short list, with honest scoping:

Native chain support for two more L1s. Chain count is the most-requested feature. We are picking one Move chain and one alt-L1 based on customer pull.
A proper light-client SDK. Right now we ship the light clients per chain as part of the destination adapter. We are pulling the common code into a reusable library so external teams can deploy their own light clients and have us route through them.
A “bring your own coordinator” mode for self-hosters. Some self-hosters do not want the Solana dependency. We are designing a Postgres-coordinator mode that loses the latency win but keeps the API surface. It will not be the recommended path, but it will exist.
A more honest pricing calculator. The current calculator overestimates batch sizes in some configurations. We are rewriting it against real production data per chain.

The lesson worth saving

Most of the operational pain in the first year did not come from the architecture we shipped. It came from the second-order details: RPC budgets, tooling choices, default policies, observability rollout timing. The architecture was actually fine. The smallest stack that does the job, instrumented well, with conservative defaults and honest documentation, beats a clever architecture with hidden cliffs every time.

If you are about to ship a cross-chain stack of your own, that is the meta-lesson. Pick the boring architecture. Spend the saved complexity on getting the defaults right and the observability earlier than you planned. Your on-call engineer in twelve months will thank you.

We will probably write another one of these next year. The numbers should be more interesting by then.