Technology Insights

Solving the JavaScript SEO Conundrum: Part One

By Dmitry Torba / February 24, 2013

Using JS rendering framework like Backbone, ExtJS, or Angular is amazing. It’s like flying. You can create incredible dynamic user interfaces, forgetting about the archaic DOM. Scaling is magical too. Each user brings his or her own computational horsepower, freeing your backend to be an ultra light performance beast.

Sounds great, but there’s one problem: search engines. Search engines are designed and implemented with documents in mind, not apps. This is significant, because search is the most important way users find your site and your apps.

So why doesn’t it just work? Well, the answer is quite simply performance. Rendering millions of JS pages is a huge resource-intensive task. Right now, search engine crawlers are just simple text parsers. A bot downloads the HTML document from your site, and parses out the interesting bits. This is how Google can crawl hundreds of millions of pages across the Internet. The process is very light and very repeatable.

Not so with JS rendered pages. To crawl a JS rendered page, a crawler bot would literally have to run a browser core for each page. Open a million tabs and you will quickly see why that’s a very heavy task.

So what do we do? We could discard our shiny libraries and go back to the DOM. We could stop flying. But abandoning flight is a bit painful once you’ve spread your wings. Perhaps there is a more elegant solution?

One approach is to use site maps. Site maps are XML documents that map the JS rendered territory. The idea is that when a search bot hits your site it will use the map instead of your page to get an idea of what’s going on.

However, there are several problems with site maps. First, sitemaps have to be rendered on the server side. This means that you have put effort in replicating the content you are building on the client side. This is doable, but it’s not clean. It requires more work, it’s hard to automate, and it’s prone to mistakes. Plus, the whole idea of using the new JS libraries is to have an ultra-light server side. Site maps derail this movement.

Most importantly, maps are not the territory. The search engine infrastructure has been built around the idea of documents, the actual content that the user sees. Like any codebase, search infrastructure codebase probably works optimally with its current input set. The search algorithms have been fine tuned against actual HTML pages. Who knows how they work with sitemaps? How many site owners are brave enough to test sitemaps and jeopardize their site rank?

Another option, proposed by Google, is to have the site provide the HTML snapshots. This way the computational burden is shifted toward the site owner. With this strategy, you mark your JS pages with special markers (hash-bang or meta-tags) and the search bot makes a special request (adding a URL parameter). When your serverside sees the special request, it serves an HTML snapshot of your JS content. Sounds great—well, until you try to generate the HTML snapshot.

Google’s recommendation is to use HTMLUnit to generate snapshots. HTMLUnit is a Java browser. It was originally designed to do Selenium-like unit tests. HTMLUnit is a monster. It spews out exceptions at a phenomenal rate. It’s a resource hog, so you quickly realize that running it on the same machine as your backend is not a good idea. Use HTMLUnit at your own risk!

A better alternative is PhantomJS. This is essentially a headless WebKit. It is super fast C code. Controlling PhantomJS is tricky, but it does the snapshot job very well. And since PhantomsJS runs JS, you can use SocketIO to easily coordinate multiple instances in parallel.

In part two of this discussion, which will be coming soon, we’ll talk about how to set up a cluster of PhantomJS servers to generate snapshots of any JS site. Be sure to check back here for our next installment.

Dmitry Torba is a senior web developer at AppDirect.