The surge in digitisation of publisher backlists has spawned an industry, much of it based in India, which specialises in converting printed books into ebooks. In this tutorial, we’ll look at what you need to supply to a conversion service, how they convert a book, the outputs, typical costs, and things to consider when working with outsourced conversion services.
Supplying the book for conversion
The three most common ways publishers provide books to an ebook conversion service are:
- Print-ready PDF
- Hard-copy book
- Original files from the page layout program
A print-ready PDF is produced from the page-layout software, typically Adobe InDesign, in a form that the printer can use to set up the printing press.
There might also be a (similar) version set up to work with a digital print-on-demand (POD) service.
If you can obtain this file (a challenge for many older books), it’s generally the best format for an outsourced conversion service.
Conversion services have developed sophisticated programs and processes to extract the text and layout information to produce a clean, marked up source file for ebook production.
Note that the PDF file supplied to a printer is not suitable for distribution directly as a PDF ebook. Your service provider will need to produce a special screen version.
For hard-copy books, the process has a couple of extra steps. It starts with a page-by-page scan. This creates a large image file where the text of each scanned page is readable but it’s not in a format that can be edited. To create editable text, a process called optical character recognition (OCR) is used to convert the images of text into editable text.
OCR is an imperfect process, so a human proofreader corrects errors and applies mark up tags to mark up the document structure. The result should be a clean, marked up source file which is used for conversion.
Original layout files
Some services will accept the original files produced by page layout programs, most commonly Adobe InDesign, Quark Express or Adobe FrameMaker. You’ll need to consult with the service provider as to whether they accept these files, and under what restrictions. There might be problems using them if they are from very old versions, or have missing fonts or images.
If the book has never been published in print form, or has been subject to extensive revisions for a digital-only edition, the source format is likely to be Microsoft Word.
The conversion process
Let’s take a look at the conversion process.
As we saw above, the process starts with the scanned book pages or a print-ready PDF or layout files.
Next, the text, images and other information are extracted from the source file for further processing. Conversion services invest a lot of their technical resources into making this part of the process more accurate, and for high volume customers they will fine-tune their extraction tools to match publishers’ content types and house styles.
Before the extracted content can be transformed into an ebook, the raw text must be marked up using a special markup language to tag the structure (and sometimes the meaning) of the content.
Markup languages are used to allow computer processing of documents. Mark up is necessary because computer programs cannot recognise the elements of a page from their appearance. The programs read the tags to identify which elements are chapter headings, captions, paragraphs, etc.
The two most commonly used markup languages in ebook production are HTML (HyperText Markup Language) and XML (eXtensible Markup Language).
- HTML (HyperText Markup Language) is the language of the web and the underlying language of Kindle and EPUB ebooks. The latest version of HTML has about 100 tags which describe various elements such as heading, title, article, aside, and paragraph. A tag name is enclosed between angle brackets ‘<‘ and ‘>’ to separate it from normal text. Examples are: <title>, <article>, <h1> (heading 1) and <p> (paragraph).
- XML (eXtensible Markup Language) uses tags like HTML’s to mark up documents. In fact, the two languages are closely related: they share a common parent and the latest HTML versions are a subset of XML. But unlike HTML, which must describe every piece of content using its fixed set of 100 or so tags, XML is extensible. This means that you can add as many tags as you like, including tags that describe both the structure and the meaning of the content. The latter is referred to as semantic markup. So XML can be used to describe content in great detail and specific to a particular industry or application, such as the subject matter or reading level.
XML Schemas. While it’s possible to create a completely customised system of XML markup — essentially a custom vocabulary to describe your works — in practice it is more common to adopt existing markup systems (called schemas) which have widespread support within an industry. A commonly-used schema for books is called DocBook which provides an extensive markup ‘vocabulary’ specific to books.
Extending HTML mark up. XML tends to be favoured as the markup language of choice for large-scale publishing and conversion projects. However, HTML has advocates who point to advantages including its relative simplicity and its much wider usage. HTML can be extended to provide similar richness to XML’s markup by the use of class attributes.
Once a file is marked up, a program can be applied to it (referred to as a transformation engine) to output it in new formats. If sound decisions are made at the markup stage, this input document can be used to produce a range of outputs, now and in the future as technology and formats change. Here’s a graphical representation of the transformation process.Output
As well as EPUB and Kindle, another common output is PDF, optimised for ebook, Print-on-Demand or for web viewing.
Many conversion services are geared up to the needs of publishers with large backlists, or who regularly convert print editions to ebooks. These publishers will gain cost and efficiency benefits from investing in standards, such as a standardised markup system, and reuse of components. Examples are templates, style sheets and common graphic elements to support house styles.
For small volumes and one-off projects, you’ll probably pay US$0.50 to $1.00 per page or more plus one-off charges for production of a single book. Scanning from hard copy, and a high proportion of complex pages will boost this cost.
For higher volume users, there is a wide range of prices and significant differences based on the types of books being converted but expect to pay half or less of the low volume rates.
Ebook conversion service providers
Here are a few examples of the many service providers who will convert books to ebooks from print or electronic sources. Each of these companies services international publishers.
Ebook conversion service providers for medium to large publishers
This group will suit higher volume publishers. Several of them, like Aptara, Datamatics, DCL and Innodata have a long history in XML document conversion and serve markets outside of publishing, such as medical, legal, technical and scientific. Others such as Infogrid Pacific and iPublishCentral have built their organisations on digitising books. All of these companies provide services to the textbook, professional and STM (Scientific, Technical and Medical) sectors as well as trade publishers.
Aptara. (http://www.aptaracorp.com/key-markets/digital-publishing/) US-headquartered with 20 years experience, 5000 employees.
CodeMantra. (http://www.codemantra.com/AssetConversionServices.htm) US-headquartered, provides production, distribution and digital asset management solutions.
Datamatics. (http://www.datamatics.com/services/publishing-services/ebook-content-transformation) India-headquartered with offices worldwide, one of India’s longest-established outsourcing companies.
DCL (Data Conversion Laboratory). (http://www.dclab.com/ebook_production_services.asp). US-based, experience in electronic document mark up and conversion.
Infogrid Pacific. (http://www.infogridpacific.com/igp/). Singapore-headquartered, provides conversion services and hosted technology options for digital production and distribution, aimed at small, medium and large publishers worldwide.
Innodata. (http://www.innodata.com/industries/publishing). US-headquartered. Large digital content services provider, operating across a range of industries.
iPublishCentral. (http://www.ipublishcentral.com/convertion.php). US-headquartered. Services include both conversion and hosting service for publisher ebook portals.
Ebook conversion service providers for small publishers
A selection of service providers who will help publishers from around the world who need to convert a single copy or small volume from a PDF or hard copy. Some also offer distribution or custom design services.
- Blue Leaf Book Scanning. (http://www.blueleaf-book-scanning.com). This online service will scan your books and OCR them to produce a text-readable file that you can then convert to an ebook. They will scan using either destructive (you won’t get your book back) or non-destructive methods (your book will be returned undamaged). The service will typically cost in the order of US$50-100. You can order an EPUB or Kindle file at the same time that it’s scanned, but this isn”t recommended. The OCR process won’t produce 100% error-free results, so you should proofread the scanned document before you get it converted into an ebook.
- BookBaby. (http://www.bookbaby.com) will produce an ebook for US$149 from a range of source files including PDF. Lulu (lulu.com), eBookIt (http://www.ebookit.com) and others offer a similar service. Note that some service providers will not accept the PDF files sent to printers, or will charge a surcharge, because they can be quite complex.
- DCL Epub on Demand. (http://www.dclab.com/epub_on_demand.asp). This is a one-off service offering from DCL, opening its high-end ebook conversion process to small publishers and self-publishers. Each job is quoted individually through an online quotation service.
- eBookPartnership. (http://www.ebookpartnership.com). UK-based service for self-publishers and small publishers. Works from a range of electronic and hardcopy formats, provides custom design and distribution services. Distribution partner for UK retailer Waterstones.
- Stembuck Book Scans. (http://www.stembuckbookscans.com) Offers non-destructive and destructive scanning for printed editions, and a good range of source file formats for digital upload. Well-priced with convenient and comprehensive online ordering.