Technology Insights

Stash: A Fast and Scalable Data Store

By Adam Demjen / Feb 19, 2019

Stash Blog Image

Why Build a Data Store?

When your organization’s long-term goal includes “indexing content from all over the Internet,” you know that fun (and challenging) times are coming your way. And that is exactly the vision of AppWise, the intelligent workspace by AppDirect.

Think of AppWise as a combined Google Search and Facebook News Feed across AppDirect and multiple connected applications. You integrate your cloud app with AppDirect, data starts flowing in, and AppWise makes this data securely searchable and navigable for authorized users through various facets and platforms.

For an app like this, the biggest challenges is data volume, variety, and velocity. Consider receiving real-time Twitter updates from every author you follow, multiply this by millions of users, and channel all that into a single app. Yeah, that’s waves and waves of data that is ingested 24/7 that we need to process and store somewhere. And that only covers the first half of the story; we also want to make that data available for users so that they can search, filter, sort, and aggregate it freely—all with sub-second response times.

It was clear from the beginning that in order to support requirements of this scale, we had to find the best database technology.

The Store, the Index, and the Broker

During our research and proof of concept evaluation, it quickly turned out that a single technology was not sufficient for the above requirements; we needed a combination of them. So we went with two strong NoSQL players in the field: ScyllaDB and Elasticsearch.

ScyllaDB is essentially Cassandra rewritten and optimized in C++ while maintaining feature parity. It’s extremely good at storing large amounts of data without blinking an eye. However, any engineer who has gone through the detailed process of modeling data for Cassandra knows that once it’s finished, it’s pretty much cast in stone. Queries always revolve around partitions, there’s only limited filtering capability, no sorting, no aggregations, so unless all your information requirements look like “fetch me whatever is in that data partition exactly how it’s stored,” you’re stuck.

That’s where Elasticsearch comes in. Based on the Apache Lucene technology, it’s great at finding stuff fast via inverted indexes, and since you can index any part of a document, it supports a wide range of flexible and dynamic queries. It’s not a perfect data store, however, and definitely not designed to store terabytes of information which is otherwise unused in these queries.

So how do we get the best of the two worlds? Our answer was to build Stash, a lightweight data store microservice for AppWise written in Node.JS. It gives us fast storage, fast retrieval, and virtually unlimited horizontal scalability by connecting these two technologies with a homegrown solution we call the Broker.

The Broker

Whether it’s about pushing data in or pulling data out, the Stash Broker is responsible for authenticating, validating, and processing the request. It interacts with the underlying storage technologies, translates requests into their respective languages, and then consolidates the fetched information before returning it to the caller.

Data In

Since 99 percent of data writes to Stash are upserts (i.e., “here’s the full snapshot of my data, store it; replace it if it already exists”) coming from other AppWise services, we chose a REST API for data ingestion. Simply POST to /api/v1/resources, where the request body contains the full content. The Broker translates this both to a ScyllaDB upsert and an Elasticsearch upsert and executes them asynchronously.

Of course we validate the payload before doing anything. We don’t want any corrupt data to end up in either one of the storages (it’s difficult to identify inconsistencies in the data after it’s stored).

Data Out

The read pattern is more mixed; It follows both human and machine interaction patterns. That is: “Find all resources I have access to. Get me the full content of this resource. Narrow the list down to only those from Twitter which have the word ‘cat’ in them. How many are there? Give me the total count too. Oh, and I’m only interested in the visible details, my mobile app doesn’t care about control attributes and such.”

Supporting a wide range of query patterns called for a more flexible solution than REST so we added GraphQL to our stack. Incoming GraphQL requests contain the desired response schema as well as filters and other control attributes, which the Broker translates in order to perform a two-hop query: 1. search for matching keys in Elasticsearch; 2. fetch all content of found items by key from ScyllaDB asynchronously. All this happens within a two-digit millisecond response time, even with hundreds of concurrent requests per second.

Here’s an example of what it looks like:

POST /api/v1/graphql
(Authenticated with user JWT)
 
{
  "query": "query findResourcesByUser($filters: FilterOptions, $pagination: PaginationOptions) {
    findResourcesByUser(filters: $filters, pagination: $pagination) {
      content {
        key { 
          source
          resourceId
        }
        title
        lastUpdated
      }
      performance {
        totalExecutionTimeInMillis
      }
    }
  }",
  "variables": {
    "filters": {
      "sources": ["jira", "github"]
    },
    "pagination": {
      "pageSize": 10,
      "pageNumber": 0
    }
  }
}
 
 
{
  "data": {
    "findResourcesByUser": {
      "content": [
        {
          "key": {
            "source": "github",
            "resourceId": "PR-........."
          },
          "title": "Updating config for test",
          "lastUpdated": 1548687264562
        },
        {
          "key": {
            "source": "jira",
            "resourceId": "........."
          },
          "title": "Clean up old JUnit references",
          "lastUpdated": 1547683882000
        },
        ...
      ],
      "performance": {
        "totalExecutionTimeInMillis": 38
      }
    }
  }
}

Design Considerations

From the beginning we designed Stash to be “dull,” that is to be unaware of almost any business logic. Even though it’s a data store, it doesn’t maintain any relationships or integrity constraints between the entities it holds—that’s up to the caller. Stash is not a database which serves a specific business purpose, but a low-level abstraction layer with only one clear responsibility: interact with the underlying storages and keep them synchronized.

Where to Next?

Stash is already doing its job well for AppWise’s purposes, however there always exist many ways to improve the capabilities of a service. For Stash, AppWise’s goals and AppDirect’s engineering roadmap will drive the large scale initiatives we embark on next. Incremental updates in addition to full upserts to optimize processing even further? Templatizing schemas to make Stash more generic? Making Stash available across AppDirect to serve other organizations’ data requirements? I’m excited to find out.

Adam Demjen is a Senior Staff Backend Developer at AppDirect.

Interested in exploring engineering careers at AppDirect? Visit our Careers website to learn more.

VIEW CAREER OPPORTUNITIES