Technology Insights

Don’t Let Microservice Data Segregation Mess With Your Customers’ Search Experience

By Adam Demjen / Apr 01, 2022

Customer Search Experience Blog

Is your software system built on a monolithic or microservice architecture? If it’s the latter, how do you solve the problem of joining data from multiple services while running search queries?

In this article, I’ll take you through:

  • Three potential approaches for bridging the gap between distinct and isolated parts of searchable data
  • How we solved the problem at AppDirect with AppSearch using functional design choices, including handling the data ingestion pipeline, cross-domain enrichment, searching using a GraphQL API, and domain-specific search configuration.

In an ideal microservice-based ecosystem the business flows are split into domains—targeted subject areas with well-defined boundaries. Each domain manages its data and services independently from each other. For example, at AppDirect, marketplace product data is handled by the product domain. Identity and Access Management owns user and company account information. We have repositories of subscriptions, licenses, invoices and so on, each managed within the appropriate domain-bounded context.

CRUDL operations within these repositories are usually optimized. A domain service can easily read and write its own data, since it’s readily available. If some external information is required at the time of reading or writing, it can be fetched from the respective domain using its API.

However, searching or filtering a listing of the same data is a more challenging problem.

Imagine an invoice listing UI where a user can search invoices by entering some query terms into a search bar, and can optionally apply some filters, such as invoice creation date range. This UI is served by an invoice search backend.


One of the main challenges with this backend is that not all the necessary information is available right away to perform the search. Invoice data is directly accessible and filterable, since it’s the domain’s own data. But suppose we wanted to narrow down the results by the billed company name. That piece of information is managed by a different domain, so it might not be at hand for this service.

The second challenge is that if the search service allows for complex queries—such as full text search, fuzzy search or partial matching—a conventional SQL database query will perform poorly compared to a large dataset. For example, if the user enters the character string, such as “abc”, and expects to get a list of all invoices that have this string anywhere in their attributes, this can incur a high performance cost for the database, unless it’s optimized for such scenarios.

Domain segregation therefore poses issues with data consistency and performance for search operations. The process that runs in this problem area is called cross-domain search.

A Look at Three Potential Approaches

Let’s look at some ways we could solve this problem to bridge the gap between the isolated parts of searchable data.

In this article I refer to the main searchable data as domain data, and to the secondary attributes that are only used in a filtering or sorting context as reference data. Continuing the example from before, invoice data is domain data and company attributes are reference data.

Here are three of the possible approaches I’ve seen used to address this challenge—including the one we’ve chosen to work with at AppDirect.

1. The Ugly—Post-Processing Filtering

    In this scenario, whenever a search query is executed, the domain service first runs it on its own dataset, applying all the criteria it can. Then it trims down results by cross-referencing them with reference data, and discarding non-matches. In our example, invoices would first be searched by date, status, and so on, depending on the client’s selection. Then the company IDs from the hits would be used to run a bulk query on those IDs, passing the company criteria, and the system would discard invoices with company IDs that don’t match. Then repeat until we have a full page’s worth of hits.

    Unfortunately this approach is anything but scalable, and it also makes some features such as pagination or calculating the total count difficult, if not impossible, to tackle.

    2. The Bad—Synchronizing Data from Other Domains

      The domain service maintains a copy of the reference datasets that could play a role in searching. So the invoice service manages a live table of company records solely for searching purposes, but those records are otherwise unnecessary for invoice management. Not only does this tighten the coupling between domains, it also brings up the challenge of keeping the data from becoming stale. To solve this, the service would need to listen to events from that domain, or worse, it would have to periodically poll the domain and reconcile changes.

      3. The Good—Dedicated Cross-Domain Search Service for Nimble, Efficient Search Queries

        The key part of an efficient solution is having all searchable data available at the same place at query time, so that all filters, sorting and other controls can be applied in a single, local operation. This requires a dedicated cross-domain search service that holds denormalized data for all domains and all fields that participate in searching, as well as a process that updates this data at near real time.

        This is the approach that we use at AppDirect for our AppSearch functionality to power various search UIs, such as the company listing page on a marketplace.


        Enter AppSearch

        Knowing that multiple teams at AppDirect were looking for an internal cross-domain search solution, we began building one. We analyzed some existing and potential future search use cases, and laid down a couple of functional design decisions that ultimately emerged as a service called AppSearch:

        • Generic and multi-tenant—It should be a one-stop shop for all search features so that domains can outsource their search operations.
        • Optimized and feature rich—Being the expert in this area, a search service should support all desired capabilities related to searching—filtering, sorting, pagination, ranking, facet counting, auto-completion and so on.
        • Event-based data updates—By listening to events coming straight from the domain, the encompassed changes should be applied to the searchable data right away.
        • Low latency GraphQL API—Fast responses and GraphQL should both make the service consumer friendly.
        • Redundant, but barely—Limiting search data to searchable attributes should keep coupling to a minimum. It also prevents domains from using it as an authentic source of reference data for non-search operations.

        After doing some research and evaluation of technology, we selected Elasticsearch as the data store, Kafka as the event broker and the GraphQL language for the API.

        Let’s see how we glued them all together!

        Data In: Ingestion Pipeline

        Domain data and reference data must reach AppSearch before we can search it. This is what the data ingestion pipeline does: It captures relevant changes occurring in the domain—called domain events—then transforms these events to fit into AppSearch’s model, and guarantees their delivery in near real time.

        Events are captured by a component that was originally built for feeding data and analytics reports. It listens to the shared AppDirect event bus managed by a Kafka broker, but it also supports CDC (change data capture—handling events triggered by database updates). Therefore the component learns about relevant domain changes as they occur, which can then be transformed via specific Flink stream processing jobs, and published back to a Kafka topic. An Elasticsearch Sink Connector listens to the topic and writes directly to Elasticsearch, which then indexes or reindexes the data.

        This way all relevant changes eventually get reflected in the AppSearch data store.


        Cross-Domain Enrichment

        Ok, now we’re updating our search index whenever changes happen in the domain upstream, but how do we identify and apply reference data updates? Remember, we have a denormalized data model without any joins, so such changes need to be reflected in every affected dataset.

        The answer has two parts. By maintaining a configuration about indexed fields and domain relationships within each index, the ingestion process can pinpoint indices that are affected by a particular change. The name of company ID 123 has changed? Check which indices care about company name and update all of them.

        So we have identified the target indices, but we still need to scope the updates—company ID 123 probably relates to many entities. An optimization for AppSearch is underway to support this with the bulk update-by-query feature of Elasticsearch. This is a command telling Elasticsearch: “look for documents in which there is a company node and the company ID is 123, then set the company name to the new name”.


        Data Out: Searching with a GraphQL API

        Domain services interact with AppSearch directly. They pass their query and get the results back. AppSearch takes care of security and data transformation.


        GraphQL has several advantages over REST, so like any recently developed AppDirect service, AppSearch exposes its functionality only via a GraphQL API.

        We implemented a generic search query that takes a target index, along with optional filter, sorting and pagination options.

          search(
            index: String!,     # Target index
            filter: Filter,     # Filter criteria
            first: Int,         # Pagination controls
            last: Int,
            after: String,
            before: String,
            orderBy: [OrderBy!] # Sorting controls
          ): SearchConnection!

        Input arguments and return types of search are fully generic:

        • Each filter clause is a tuple of field name (what to match on), value(s) (what to match against) and operator (how to match):
          { field: “status”, op: EQ, values: [“PAID”, “CLOSED”] }, { field: “invoiceCreationDate”, op: GTE, values: “2020-01-01” }
        • Sorting clauses contain field name and direction:
          { field: “invoiceCreationDate”, direction: DESC }
        • The returned connection object contains a single page of found nodes and their attributes: the indexed fields as key-value pairs.
        • Since we’re dealing with a single page of hits, some pagination information is also included in the response that enables navigation through the full dataset (such as a cursor string for fetching the next page).
        { 
          "data": { 
            "search": { 
              "totalCount": 8738, 
              "pageInfo": { 
                "hasNextPage": true, 
                "endCursor": "b2Zmc2V0PTE5" 
              }, 
              "nodes": [ 
                { 
                  "id": "00b50a06-724e-4149-b3b6-b1c35cb44097", 
                  "fields": [ 
                    { 
                      "field": "invoiceNumber", 
                      "value": "1239010" 
                    }, 
                    { 
                      "field": "invoiceCreationDate", 
                      "value": "2021-04-03" 
                    }, 
                    { 
                      "field": "companyName", 
                      "value": "ACME Inc." 
                    }, 
                    { 
                      "field": "status", 
                      "value": "PAID" 
                    }, 
                    ... 
                  ] 
                }, 
                { 
                  "id": "ba9ed5ea-0b0d-4231-99e7-1ad4b994e884", 
                  ... 
                } 
              ] 
            } 
          } 
        }

        As mentioned earlier, AppSearch only deals with those attributes that are required for searching (in filtering or sorting criteria), so clients might need to enrich the data before passing the results to the search UI. However, this should be a quick and easy step on the domain side, involving running a bulk lookup of entities by IDs received from AppSearch.

        Domain Configuration

        In each of these data manipulation and querying processes the application needs to understand the data and the surrounding context. AppSearch embeds this information in schema files we call domain configuration.

        Every domain team maintains its own config file—or more, if they have multiple search indices—which contains some metadata to control the processes related to the data:

        • Fields and their types—For the query API to interpret each field and what kind of filters are allowed on them
        • Reference data fields and paths—To perform cross-domain enrichment at ingestion time
        • Authorization—Permissions required to access the index

        Conclusion

        Cross-domain searching is a problem that emerges naturally in a microservice architecture due to the segregation of domain services and their data. AppSearch solves this challenge for AppDirect’s internal domains in a robust and effective manner. By working with independent denormalized datasets optimized for each domain, search queries are versatile and fast.

        However this architecture has some caveats that need to be addressed for AppSearch:

        • Denormalized data works well for searching, but it comes with a maintenance overhead. Any change upstream must be applied to several datasets at the same time
        • Using a generic schema has benefits (simple data model), but it also forces the input and return structure to be generic
        • Moving data through an ingestion pipeline and indexing is simple, but it has some limitations. For example, hard deletion is a problem we’re working to address.

        In the next few quarters we’ll continue working on enhancing AppSearch while onboarding several AppDirect domains to solve their performance problems around searching.