Looking at how eBooks are made

by Les

These days many books are being published in electronic form, either instead of being printed on paper, or as well as being printed. In this post I’m going to look at some of the formats used for this.

For a book to be useful as an electronic document it has to use a format which is readily available to the potential reading audience. Otherwise the intended audience will not read it. Printed books require nothing but the book itself to be read. Electronic books require some kind of device to make them readable.

The earliest widely accepted format for electronic documents was Portable Document Format(PDF). Originally, PDF was a proprietary format developed by Adobe Systems. This company made the specification freely available in 1993, but it was still proprietary until 2008, when Adobe published a royalty-free Public License, and PDF became an open standard published by the International Organization for Standardization as ISO 32000-1:2008.

PDF combines three components into a single file: a description of each page, including the actual text, the layout, and any graphics; the actual fonts used in the document; and a storage system to put all of this into a single file, including compression where appropriate.

PDF has a major drawback as an electronic document format: it is page-oriented. Each page is displayed exactly as planned, regardless of the size or shape of the display device. Most PDF documents use a US Letter or A4 page size, with portrait orientation. Many electronic devices have displays much smaller, and most of those with displays as large as the specified page size have landscape orientation. PDF does have an advantage over many other formats for documents with graphics: it retains the relative locations of text and pictures.

PDF files can be displayed by almost all computers, eReaders, smartphones and tablets. Whether they are readable on all these devices is another matter.

In recent years most electronic books have become available in one or more of an array of formats specifically designed for eReaders. I’m going to concentrate on one of the most widely used formats: EPUB. This format is not supported by the most popular eReader (Amazon’s Kindle), but it is supported by almost every other eReader currently available. Software is available for reading EPUB files on most computers, smartphones, and tablets.

EPUB is a reflowable format. The text line length and number of lines on a page for a document are adjusted to suit the screen size and resolution of the display device, and the font used, and the font size, can usually be adjusted by the user. For books which are nearly all text (without graphics) this is very convenient. If there are figures, the graphics tend to move about unpredictably.

The contents of an EPUB book are very straightforward. It is a set of XML files in a zip-compressed folder. The following paragraphs look at a typical EPUB book, Leo Tolstoy’s The Cossacks as distributed by epubBooks.

Unzipped EPUB file showing contents

Unzipped EPUB file showing contents

Unzipped, the EPUB is seen to contain two folders, META-INF and OPS, plus a very small text file, mimetype, containing the single line:


The first folder, META-INF, contains a single file container.xml, which has the following content:

<?xml version="1.0" encoding="UTF-8"?>
<container xmlns="urn:oasis:names:tc:opendocument:xmlns:container" version="1.0">
<rootfile full-path="OPS/content.opf" media-type="application/oebps-package+xml"/>

If the book had Digital Rights Management (DRM) there would be a file named rights.xml in this folder.

The second folder, OPS, contains the actual book.

Contents of the OPS folder

Contents of the OPS folder

Contents of the OPS folder (continued)

Contents of the OPS folder (continued)

In this book there are two subfolders, css and images, a number of HTML files, an OPF file, and an NCX file.

The actual text of the book is contained in HTML (or XHTML) files. For this book, there is a file for the title page, one file for each chapter, a footnotes file, and an information file about the publisher. The OPF file (content.opf) tells the eReader which files make up the book, and the order they come in if the book is read sequentially, as well as metadata for the book: information such as the author, publisher, copyright, etc. The NCX file (toc.ncx) contains the Table of Contents for the book.

The CSS files

The CSS files

The css subfolder contains CSS style sheets. These are linked in the <head> field of the HTML files as necessary, and define the display styles. The images subfolder contains the figures and other images to be displayed in the book. They are referenced by <img> tags in the HTML files.

The image files

The image files

All of these files, except the image files, are text files which can be easily edited. So at least in theory anyone with a simple text editor and the necessary knowledge can convert a text file into an eBook.

Fortunately, free tools are available to do the job for those without the necessary knowledge. The most useful of these is Sigil . This package not only allows complete editing of an EPUB document, but includes validators for HTML and EPUB formats, and tools for generating the OPF and NCX files (both of which have rather abstruse formats and are intolerant of errors).

Once you have created an EPUB book using Sigil, you have accomplished the hardest part of converting a manuscript into an eBook. The only remaining problem is to put it into a form readable by a Kindle, the most popular eReader. Again, there is a free tool for doing this: Calibre. Just add your new EPUB book to the Calibre library on your computer, and convert it to MOBI format.