Difficulties of content re-purposing

CMS

Jahia Team8/5/2013

This post was originally published on AIIM's Expert Blogs by Serge Huber, CTO at Jahia Solutions

********

A little over a year ago, I wrote a (unexpectedly) popular blog article entitled "After Flash, why PDF must die" and in this post I'd like to reflect back a little on how things have evolved since, or how they have actually stagnated. As a technical person, I have to spend quite some time reading about new technologies, either by accessing content posted on the internet or by purchasing books. Like many, I have tried my best to protect the trees and to avoid as much as possible to purchase real books and prefer electronic formats instead. But, as I wrote in my previous blog post, problems are still present in the file format arena, although you might be surprised as to which one I end up using the most :) Publishing content across multiple formats is no small feat, and I will quickly go over the difficulties and review some possible solutions to help improve the situation.

When shopping for e-books, you already have a myriad of choices: you could go to Apple iTunes Store and purchase an iBook (based on the ePUB standardized format), or go to Amazon (based on the proprietary Mobi format), or simply download a PDF direct from the book publisher. The problem I already described previously is still the same: the PDF format is still the most readable, since it contains everything including advanced layouts, but it doesn't scale well to different screen sizes (except if you like to pan and zoom a lot) while the other books formats are not (yet) well mastered by editors and the quality of the layouts and formatting is usually much worse than the PDF version. So even I end up always reading the PDF version on my iPad, which really bugs me because I would really enjoy being able to comfortably read content on the Kindle’s e-Ink screen. But on the Kindle version most tables and even some illustrations are either too small or sometimes even missing to be readable, so I have to use the PDF version to be able to read the content as it was supposed to be presented.

Another interesting example concerns the editing and publishing of long (technical) documents. Again the PDF is fine for making sure that you stay as close as possible to the original formatting (after all that was what it was designed for), but if you intend to publish content on the web and benefit from search engine optimization (SEO) indexing, using a PDF file is not recommended. Instead it would be much better to have your content readily available in HTML form so that when search engines such as Google and Bing index it, search results will directly point to the proper text location, making sure that the content will get all the exposure it deserves. Although this sounds simple, it is actually quite tricky to repurpose content that was authored in a word processor such as Microsoft Word or even Google Docs to a corporate website. Among the difficulties, building properly usable content navigation and browsing is a tricky task. Sure you could simply use a "Save as HTML" functionality of a word processor, but the generated HTML would be quite difficult to integrate well with a corporate website, and it would not have a clean navigation integration since in most cases it would simply generate one huge HTML file. Things might be a bit better if you choose a source file format that actually uses more structured content such as LaTeX or even Maven’t APT format, but these are tools that are much harder to put in the hands of the content producers, and that is usually not appropriate.

You could also try to reverse the solution, and say that the source of the content could be HTML files or even better content elements in a CMS, but at some point should you need to generate a PDF or another electronic book format, you would also need to think twice about the quality of the transformation process. Usually using a CMS as a source might be easier, but the tooling for content authoring might not be as good as the one available in word processors. So it’s another trade-off but it might still be an interesting one, especially if your CMS has an easy to use content editing interface. But long document editing is still usually done offline, so for these types of workloads the usual word processing suspects will be preferred.

Transformation technologies are therefore at the center of content repurposing, and while in some cases they may be automated, it will be a good idea to regularly check and possibly improve the quality of the transformation. If we take Amazon’s transformations tools, they definitely needs to improve significantly to be really capable of delivering results that are acceptable not only for novels or other forms of “simple” text, but also for more technical or graphical books. This is not an easy task and will probably take some time, and possibly even some changes in the source tool to provide better support for such transformations (such as hinting metadata or markers). Amazon on the other hand also issues lots of recommendations for simplifying content so that it works well with Kindles, but this is a manual task that will not necessarily translate to other publishing technologies such as iBooks.

In conclusion, I think that despite my continuing will to see PDF less present on the multi-screen arena, I think it is possibly still one of the best solutions to the content publishing for reading, and hopefully that will change in the (near) future. The content repurposing tooling must also improve significantly to make it easier for content publishers and distributors to provide quality content to users wherever they are, and on whatever device they are.