How do you monitor something you can't detect?

I ran into an interesting (and frankly annoying) problem at work the other day. For context, I work on a personal finance app and my team is responsible for looking after our credit card product. We work with several vendors to make this happen, including one that handles our card provisioning. We’ve had several issues with this vendor before, and it didn’t come as a surprise that the incident we’ll discuss was caused by them as well.

The incident

As with all incidents, I got a page one morning from our customer success team that multiple users were facing issues adding their credit card to their mobile wallet. This is one of the flows where we rely on our vendor’s SDK, both for detecting whether a user already has our card in their wallet and to kick off the provisioning process. A quick investigation revealed that while starting the provisioning flow from the app was working fine, in the wallet app, users were unable to verify their identity (one-time passcode sent by our vendor to their email/phone) to finish adding the card to their wallet.

Obviously, this is not something we influence and we had to chase down the vendor, have several back-and-forths with them to get them to understand the issue and finally fix it. Overall, from when we were notified by customers, the issue was resolved in less than a day. Unfortunately, we took ages to detect the issue in the first place as we had to wait for our customers to start complaining and our customer success teams to run through the usual triage process and connect the dots to find the greater pattern. This meant that the issue went undetected for just shy of three days before we were notified.

A detection time of 3 days is terrible. Even though we didn’t cause the issue, we weren’t able to quickly notify our vendor, as we had no visibility into what was happening inside the flow. To make matters worse, the vendor SDK was reporting a successful card added event to us because technically the card was added to the user’s wallet, just in an unverified state where they couldn’t actually use it.

Reducing our detection time

As part of the root cause analysis, I had the idea to set up a proxy metric that would hopefully allow us to reduce the three-day time to detection and proposed it as a follow-up in the post-mortem. Since we can’t directly measure the success of the add to wallet process, I thought we could instead continuously sample how many of our users have a card but haven’t added it to their wallet.

This sounds expensive, but it was something we were already doing on app start to determine whether we need to render various “Add to wallet” buttons in the app. We could simply hook into this check and log a metric every time we detect that the user didn’t have our card added to their wallet. We could then set up a monitor and an alert on our monitoring platform to let us know if the baseline ever spiked in a 1 - 2 hour rolling window.

This way, we should be notified a lot sooner if customers are running into issues and can proactively get our vendor on the line to fix the issue.