AppDirect Blog

News & Updates

The Science of Search Engine Relevance

By Alexis / Jun 02, 2014

Our new Reveal solution captures support agents’ web searches in Google and provides them with the most trusted  solutions to the tech support problems they’re trying to solve.  It does this by leveraging the experiences of  other agents who have already searched the web to solve similar problems.  One of the first challenges we faced with Reveal was that support agents use the web for things other than tech support as well.

Thus, in developing Reveal’s clustering and search process, we needed to ensure that we only capture search queries related to tech support, and ignore all personal queries so that we don’t have wrong signals.

In this post we’ll describe, how we separated work from play when analyzing their search queries.

HOW DO WE SEPARATE THE WHEAT FROM THE CHAFF?

For obvious privacy reasons, we had to rule out any solution that involved looking at an agents’ personal information and we had to filter the search without any human intervention. Several algorithms can already do this, but none of them are a silver bullet, so we developed our own solution using several layers of computational linguistics.

LAYER 1: IN OR OUT

Our first layer, a Maximum Entropy algorithm, used words as features. This means we converted query terms (i.e., strings of keywords) into points in a space with multiple dimensions. It’s important to note that in math we’re not constrained by the physical limitations of the real world, so we can have more than three dimensions—hundreds of thousands of them, if needed. Our brains cannot really grasp this, but computers can.

For obvious privacy reasons, we had to rule out any solution that involved looking at an agents’ personal information.

So if an agent searches for “results Canadiens Lightning game,” the Maximum Entropy analysis will place these words along four dimensions. If the agent searches for “problem Linksys router DNS blinking red light,” that’s seven dimensions. To “teach” the algorithm which query terms to filter, we asked experts to assign “in” or “out” to a series of tech support agents’ real-life web browsing data. The initial training only required about 1,000 query terms, but we included up to 1,500 for good measure. The result: a 78.7% positive identification rate, with only 9.3% false positives (i.e., business-related queries lost in the process). However, one out of five personal queries was still in plain sight, which is not acceptable from a privacy standpoint.

LAYER 2: FREQUENT OR RARE

We noticed that personal queries tended to be much less redundant than tech support searches. When it comes to their private lives, agents have unique needs, such as researching directions, recipes or health advice. Tech support issues, on the other hand, tend to be more repetitive. To identify clusters of similar query terms, we applied a second layer: a clustering algorithm to analyze the position of keywords relative to each other in the multidimensional space. Keywords that didn’t cluster were more likely to be personal, so were eliminated.

LAYER 3: THE FINAL POLISH

So far, our filtering scored over an impressive 90%. Still, some fine-tuning was needed. The third and final layer of the process ran the remaining query terms through a binary classification algorithm. This time, however, the algorithm analyzed the actual content of the pages the agents clicked on. Since large webpages or documents contain many words, the chance of positive identification was much higher. This requires more computing power, but it’s only applied to a small data set. And that’s how we separated the nuggets from the dross — protecting agents’ privacy while obtaining precious tech support knowledge for their peers to leverage.

This process was presented in more detail at the 2014 Canadian Artificial Intelligence Conference in Montreal, where it won the Best Application Paper award.

Based on our ongoing work in this area, we’re confident in our Reveal solution’s ability to maintain quality data by only leveraging real tech support search information, while at the same time protecting support agents’ privacy by ignoring their personal searches.

Interested in trying Reveal? Request an invitation here.