News & Updates

The Stale Knowledge Base Problem in Numbers: A Case Study

By R. J. Stangle / Apr 23, 2015

How big are the problems in an enterprise knowledge base?

Inaccurate, outdated knowledge base content is one of the most common issues for organizations today. According to a TSIA Annual Knowledge Management Survey, some of the top reasons knowledge management programs fail include:

  • Lack of a knowledge sharing culture
  • No incentive to use the [KM] system
  • Not capturing knowledge from Professional Services consultants
  • Out of date content and gaps in the Knowledge Base

But what does the “stale knowledge base problem” look like by the numbers? I ran an experiment with our own knowledge base to find out:

  • How fresh is the content in our knowledge base?
  • How well structured is the content? Is it intuitive and easy to access?

The knowledge base in numbers:

  • 9 years of content
  • Almost 9,000 documents
  • 78 spaces (our knowledge base is structured in spaces)
  • 700+ collections (inside each space, user-created collections)


To assess how outdated a piece of content was, we used the last update date for each document as our reference.

In the chart below, the blue bars represent the number of knowledge base documents and the month they were last updated in. The orange line is the ‘cumulative frequency distribution’, the percentage of out of date documents in a given time range.

Note that the peaks in number of documents are related to the creation of new teams and products, new version releases, events and problems with a knowledge base software upgrade. (Mouse over visualization for more details)


Research highlights:

  • In July 2011 (almost 4 years ago), the cumulative frequency distribution had reached 50%(meaning, only half of our documents were up to date)
  • In the last 12 months, almost 85% of our documents haven’t been updated
  • If we look at the last 6 months, 94% are out of date!

Possible solutions

Archiving: Assuming we could archive all the documents that have not been updated in the last two years, the number of documents could be reduced by more than 70%. Of course, we cannot apply a single rule like this blindly, because some “expired” documents could still be considered important references.

Freshness: Search engines consider the “freshness” of a document an important ranking signal. We can apply a similar algorithm to determine a document’s relevance, ranking “fresh”, up to date content higher in search results.


One of the key factors that affects the knowledge base’s content structure is the number of documents per collection. Small collections can create big overhead for the user, while big collections are hard to browse.

We expected to find that the majority of our collections contained a reasonable number of documents and few outliers.

We calculated that average number of documents in a collection is 11. However, the median is 2, the mode is 1, the minimum is 1 and the maximum is 524. These numbers describe a highly skewed distribution towards small collections with some outliers.

Here is a histogram that describe the dataset. (Mouse over visualization for more details)


Research highlights:

  • Highly skewed distribution
  • Average: 11 documents per collection
  • Median: 2 documents per collection
  • 46% of the collections have only 1 document
  • 70%+ of the collections have 5 documents or less
  • 17 collections have more than 100 documents

Possible solutions

Merge: For collections that only contain a few documents, a user could merge them based on criteria the collections have in common, such as content, timeline, team, projects.

Using clustering techniques, we calculated the similarity between collections. The heat map below shows the similarity between collections from the same space. The hotter the colour, the higher the similarity, the colder the colour, the lower the similarity. One example is the collection “Review Reveal” that could be merged with “Reveal Meeting Minutes”, because in this case, both contain a small number of documents and share similar content themes.


Split: Using Machine Learning and Natural Language Processing techniques we created a metric that defines how cohesive a collection is. Collections with many documents are more likely to have low cohesion.

In this case, the user could split the documents into new collections. As an example, we used the collection “Research and Discussions”. This collection has 97 documents and the following dendrogram (which indicates how strongly correlated items are) shows a possible split into 3 new collections.

In the chart below, the green branch joins documents related to “Big Data” and “Hadoop”, the red branch joins the “Reveal installation process” and the cyan branch joins “Data Science”, “Search” and “Machine Learning”.



In this experiment, we used a data-driven approach to address the questions “How fresh is the content in the KB?” and “How well structured is the content and is it accessible?”

The results indicate that in the case of our internal knowledge base, considering only documents updated in the last two years, 70% of the content is outdated.

Also, overtime, users tend to forget the current structure and to create new collections to add new documents, generating a large number of small collections. The four actions suggested in this post (archiving, freshness, merge and split) could be valid solutions to address both issues.

Read our blog post Spring Cleaning Knowledge Content with Curation to learn more about our upcoming content organization feature for Reveal, Collections.