Technology Insights

Upgrading Real-time Event Notifications: Unveiling Our Webhook Overhaul

By Etienne Hardy / February 20, 2024

Upgrading Notifications Blog

APIs are ubiquitous these days. Almost all SaaS platforms offer APIs to allow their customers to automate their business processes and workflows. But APIs are not enough. Customers building on APIs often need to be notified of significant events happening on the platform they are integrating with in real time in order to execute some business workflow on their end. In the case of the AppDirect platform, such significant events might be that a new user was created, a subscription was purchased or canceled, or a new product was published. Such a real-time system-to-system event communication mechanism is commonly called webhooks.

Webhooks allow the AppDirect platform to notify interested customers when important events happen. This article will provide a quick overview of what webhooks are and how they work, and most importantly walk you through how we at AppDirect revamped our webhook architecture in 2023 and the new exciting features this new architecture offers. Let’s go!

Check out our Developer Center for more details about how we use Webhooks

Webhooks overview

So what exactly are webhooks? As mentioned in the introduction, webhooks are a mechanism to notify interested parties of important events happening within the AppDirect platform, such as a subscription being purchased.

If webhooks did not exist, how could AppDirect customers know when a subscription was purchased? One way would be to use the AppDirect API and continuously poll for new subscriptions, something like the following:

Polling - periodically making requests to a system to check for new events or data.

Such a polling approach has many drawbacks, but the three that stand out the most are: complexity, inefficiency, and notification delays. It’s more complicated for customers to implement as they need to write a lot more code to keep track of new events. Also, it is highly inefficient as customers might issue API calls even though no new events have happened, consuming API resources and possible resource limit quotas. Lastly, there might be delays before customers get notified of the new event, depending on the polling frequency.

So how exactly are webhooks different? A picture is worth a thousand words

Webhook - Webhooks are a push mechanism as opposed to pull.
Webhooks are real-time system-to-system push notifications that relay events

Much simpler! Webhooks are a push mechanism as opposed to pull. Reusing the subscription purchase example, when such an event happens in the AppDirect platform, AppDirect will send an HTTP POST request to interested customers with a payload containing the important event data. The data is pushed to the listening system, which is much more efficient. Customers now simply have to process the request and execute their business specific logic, such as updating the number of subscriptions they purchased. Webhooks free integrating customers from having to determine when new subscriptions were created, which is a lot simpler. Customers are also notified in real time about events happening within the AppDirect platform, which might be important for certain use cases.

Now having an understanding of what webhooks are and their benefits, let’s turn our attention to why AppDirect revamped its webhook architecture.

Why a webhook revamp?

There are a few reasons that prompted us to revamp our webhook architecture in 2023, but we will focus on the three critical ones.

  1. Now microservice driven—Firstly, our current webhook implementation grew overtime within AppDirect’s main monolithic module. There’s nothing wrong per se with monoliths, but AppDirect’s monolith is pretty large and the webhook functionality was overly complex. With AppDirect’s platform now being primarily microservice driven, there was an occasion to extract the webhook functionality out of the monolithic module and make it addressable to all other microservices.

  2. Commitment to reliability and seamless integration—Secondly, our current webhook implementation was simply not reliable enough for our customers. Webhooks are an important integration mechanism for our customers, some have critical business logic executing as a result of receiving a webhook, so they need to be delivered reliably and in the expected order. When delivery failures occurred, customers did not have any insight and visibility into them and had to resort to opening support cases.

  3. Erasing coupling and ownership questions—Last but not least, the webhook functionality, having grown into the monolithic module, resulted in coupling and ownership questions. Who’s responsible for domain specific event payload building? Who’s responsible for the delivery of webhooks? All questions arising from a functionality which has organically grown overtime. We needed a clear separation of concern between webhook event payload building, which is domain specific, and the webhook delivery mechanism.
The implementation did not provide the required traceability to properly support the feature and our customers. Finding the root cause of problems required a lot of log mining and database spelunking. Now customers have direct insight.

So it’s with these problems in mind that we sat out to overhaul the AppDirect webhook functionality into a brand new microservice capable of scaling as platform usage grew. A side goal was also to allow the easy implementation of new webhook related functionality, a subject which we’ll see later on in this article.

Domain-driven design applied to webhooks

Providing guaranteed webhook delivery, or as close to as possible, does not happen by accident. There are many things that can go wrong at any point in time while delivering a webhook: there might be intermittent networking problems within our clusters, the customer’s webhook listener might be misconfigured or temporarily unavailable, webhook authentication might fail, and the list goes on and on. So how do we cope with these situations? We had some good foundations available to us as we’ll see a bit later, but those were not sufficient in and of themselves. Being domain-driven design practitioners, we put failure scenarios into the core of our webhook domain model, and it started by having a clear picture of the different possible states of a webhook:

Webhook states model

By having a clear webhook state model, we could make sure to capture these different scenarios as part of our code and test cases, thus failure handling did not come as an afterthought, as is sometimes the case. You can see in the picture above that delivery and retries were first-class concerns, so they naturally appeared as objects in the codebase as a result (we’ll discuss shortly how we were able to persist those domain objects to derive all the delivery metrics we wanted). The result is that the core webhook delivery logic is expressed in terms of abstractions and not technical details: abstractions are at the forefront and technical concerns have been pushed as implementation details. There were many other design artifacts that would be too long to cover here, but needless to say that proper domain modeling and domain-driven design were key to the success of the project.

Technical architecture

To support this model, and to fully address some of the concerns that triggered the revamp, we needed some good foundational tools. Luckily, the AppDirect platform has a few we could leverage to build the new webhook architecture. The first one was Kafka, which is a durable streaming platform. Kafka provides well documented message ordering guarantees based on the partitioning key selected.

Read our blog about how we replaced our RabbitMQ layer with Kafka: 7 steps to replacing a message broker in a distributed system.

So Kafka looked interesting as the conduit between domain microservices and the webhooks service as it could solve one aspect of the webhook ordering problem (Kafka is also used internally in the service implementation for the webhook delivery). Storage wise, we had MongoDB. Since we wanted to persist both webhook configurations and delivery metrics based on our domain model, MongoDB appeared as an excellent solution with its document based storage format and given the anticipated access patterns. Since MongoDB also offers Time-To-Live expiration on documents, we could leverage this feature since we do not need to store delivery metrics forever.

Having said all of that, here’s what our new webhooks architecture looks like at the component level:

Webhooks architecture

As already hinted, webhooks are now handled by a brand new standalone service, responsible for everything from webhook configuration management to webhook delivery. This new service exposes a GraphQL API for webhook configuration as well as queries allowing the retrieval of webhook delivery metrics. These webhook delivery metrics, based on persisted domain objects, are exposed directly to our customers via the webhook Event Log UI. This is brand new functionality allowed by the new architecture!

The fact that webhooks are now exposed as an independent service makes it easy for domain services to trigger webhooks. A domain service simply has to build the webhook payload and then requests its delivery by sending a Kafka message to the webhooks service. As mentioned earlier, by leveraging Kafka, we get some level of webhook ordering out of the box by carefully selecting the partitioning key. Simply using Kafka is not enough for total ordering though as there can still be ordering problems between webhooks related to a specific resource (e.g.: a user). We have addressed those by building an ordering mechanism within the webhooks service itself.

How does it look?

All of those architectural improvements are nice. Were there any improvements in the user experience? Of course! Here’s a screenshot of the new webhooks configuration management screen and the brand new Event Log functionality:

Webhook event log

The Event Log is new user visible functionality that lists all recent webhooks and their status information. Users can search for a specific webhook and see if it was delivered successfully or not, and if not, why it wasn’t delivered. If you pay close attention to the far right-end of the webhook in error state, you’ll notice another piece of new functionality: the manual webhook

retry! In case all automated retries have been exhausted and a customer absolutely needs to receive the failed webhook, they have the ability to press the button and the service will make another attempt to deliver the webhook. This puts power into the hands of our customers and makes their lives (and ours!) easier!

Migrating to new architecture

Implementing the new webhooks service and UI was only part of the story. As with any significant architecture overhaul, a migration was necessary. Here’s our basic migration process:

  1. First of all, we needed to migrate existing webhook configuration data from the old database to the new one.

  2. Secondly, we needed to move customers from the old to the new webhook implementation. So how could we migrate our existing webhooks customers to the new architecture in a risk-free and transparent fashion? For the data migration aspect, we developed a data migration facility that allowed us to transfer existing webhook configuration data to the new service database. This data migration functionality silently copied data and kept it in sync until we were ready to flip on the switch for the new webhook service.

  3. Third, we relied on our AppDirect feature flagging service. At AppDirect, we have a sophisticated feature flagging service that allows us to separate the deployment of new functionality from its enablement. With this feature flagging service, we were able to activate the new webhook functionality on a per tenant and per webhook basis, greatly reducing the risk of activating such an important architectural revamp. We obviously did not go directly to production with the new webhooks, we first rolled it out on test environments and monitored its behavior.

  4. Finally, after fixing a small number of problems that could only be detected by running in a live environment, we set out to start promoting the new implementation on production environments on a gradual basis.

What’s next?

So where are we after all of this? Our new webhooks implementation has been running in production flawlessly for a few months now. It is still only handling a subset of all webhooks, but it should handle all of them very shortly. 

Our benchmarks tell us that the new implementation can deliver thousands of webhooks per minute in ideal conditions, so we know it will be able to scale as our platform usage continues growing. Building a webhook functionality from scratch was a challenging and fun project. Migrating customers from the old to the new functionality without causing interruptions or problems was even more challenging! This new webhooks implementation opens the door to exciting new possibilities for the future.