2011-07-17

Using gencodec to make a custom character mapping

One of the problems I face in the QTI migration tool is markup that looks like this:

<mattext>The circumference of a circle diameter 1 is given by the mathematical constant: </mattext>
<mattext charset="greek">p</mattext>

In XML the charset used in a document is detected according to various rules, starting from information available before the XML stream is parsed and culminating in the encoding declaration in the XML declaration at the top of the file:

<?xml version = "1.0" encoding = "UTF-8">

For this reason, the use of the charset parameter in QTI version 1 is of limited value, at best it might provide a hint on an appropriate font to use when rendering the element.  This is not a huge problem these days but when QTI v1 was written it was common for document renderings to be peppered with large squares indicating that the selected font had no glyph for the required character.  These days renderers are smarter about selecting default fonts enabling developers to display arbitrary unicode text.

So you would think that charset is redundant but there is one situation where we do need to take note: the symbol font. The problem is explained well in this article: Symbol font – Unicode alternatives for Greek and special characters in HTML.  The use of 'greek' in the QTI v1 examples is clearly intended to indicate use of the symbol font in a similar way - not the use of the 'greek' codepage in ISO-8859. The Symbol font is used a lot in older mathematical questions, you can play around with the codec on this neat little web page: Symbol font to Unicode converter.

According to the above article the unicode character representing the lower-case letter 'p', when rendered in the symbol font actually appears to the user like this: π - known as Greek small letter pi.

The problem for my Python script is that I need to map these characters to the target unicode forms before writing them out to the QTI version 2 file.   This is where the neat gencodec.py script comes in.  I don't know where this is documented other than in the gencodec source file itself.  But this is a very useful utility!

The synopsis of the tool is:

This script parses Unicode mapping files as available from the Unicode
site (ftp://ftp.unicode.org/Public/MAPPINGS/) and creates Python codecmodules from them.

So I downloaded the following mapping to a directory called 'codecs' on my laptop:

ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/APPLE/SYMBOL.TXT

Then I ran the gencodec script:

$ python gencodec.py codecs pyslet
converting SYMBOL.TXT to pysletsymbol.py and pysletsymbol.mapping

And confirmed that the mapping was working using the interpreter:

$ python
Python 2.7.1 (r271:86882M, Nov 30 2010, 09:39:13) 
[GCC 4.0.1 (Apple Inc. build 5494)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> unicode('p','symbol')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
LookupError: unknown encoding: symbol
>>> import pysletsymbol
>>> reg=pysletsymbol.getregentry()
>>> import codecs
>>> def SymbolSearch(name):
...   if name=='symbol': return reg;
...   else: return None
... 
>>> codecs.register(SymbolSearch)
>>> unicode('p','symbol')
u'\u03c0'
>>> print unicode('p','symbol')
π

In previous versions of the migration tool I didn't include symbol font mapping because I thought it would be too laborious to create the mapping.  I was wrong, future versions will do this mapping automatically.