Comparing Unicode and JPEG

Tim Golden

2007-10-01 10:10

I recently needed to explain to a colleague of mine how Unicode and encodings worked. This guy’s a programmer — although from a non-technical background — and doesn’t usually have an issue with technical concepts, but the usual explanations weren’t really getting through. I found that the explanation below helped him out, so I reproduce it here for anyone else’s benefit:

When you have an image which you save as a JPEG file, you’re taking something you see on screen (whose internal format you neither know nor care) and saving it encoded in the JPEG way in a file whose extension is .jpg, a hint to the image viewer application that the contents of the file is a picture encoded a certain way. Of course an image viewer might be able to work that out by sniffing the header, but the extension saves some time.

You could have saved it as some black-and-white format if, for example, you wanted a smaller file size and were prepared to lose some detail (viz, the colours) or if you knew that the colours involved were only black and white anyway. Or you might have used GIF if you knew that it would be more efficient in space terms for this particular image or that someone you were sending it to could only read GIFs. And so on. Ultimately, you know that any application which pretends to be able to read a file of the format you’ve specified will present on screen the image you started off with.

Encoding text is much the same. You start with text which looks like something on the screen. Often it’s conventional Western characters (the unaccented letters a-z in upper and lower case); sometimes there are accents on top or extra characters; you can even have entirely different character sets, such as Chinese or Linear B. Ultimately, you want that to appear on someone else’s screen (or printout or browser) as it appears on yours.

So you save it in a format which you both know, and you say what the encoding (format) is. Some encodings will only allow you to store characters in a certain range (say, Western characters only or Russian characters only). Others will store everything. You make the same decisions about losing data and the coherence of the result. You could use encodings, such as UTF-8 which guarantee to encode every codepoint in the Unicode universe, but that might be more expensive in terms of space. (Although UTF-8 does its best to encode common Western characters in fewer bytes, which is helpful if you’re using common Western characters!). When someone at the other end loads the text into their editor or browser, their application decodes it back into Unicode again and displays the appropriate characters.

But how does that application know what encoding was used in the first place? Well that’s the unfortunate difference: there’s no established way to indicate what encoding (format) text is in. There are some conventions (an HTTP charset header, for example, or a first line which looks like this: # -*- coding: utf-8 -*-) but none is universal. You could have an understanding between you and people with whom you exchange text that replaces the common .txt extension by an encoding name. But it would only work at that level. There are ways of sniffing the encoding (ranging from the simple recognition of an initial byte pair to statistical analysis of the contents) but none is foolproof.

Ultimately, though, in the same way in which you trust your image viewers to unpack image data from its format to some native Image type and possibly to pack it into another format elsewhere, your applications have to unpack encoded data from a file to a native Unicode type until it needs to be written out to storage or to another system, at which point you encode it again, advertising the encoding as best you can.