swl10: Python

Showing posts with label Python. Show all posts

2016-01-03

sphinx: ignoring sys.path?

I'm a big fan of using Sphinx for documentation and I use it for my Pyslet Python module. For some time now I've had an infuriating problem with generating documentation that I've been putting off solving on the assumption that it is caused by some weird and hard to discover configuration issue.

The problem is that whenever I run sphinx to build my package documentation it uses the currently installed version of the package and not the local copy I'm working on. This is particularly annoying because I have various versions of Python installed on my Mac (going back some years) and sphinx always seems to run with the native Python that came with XCode (Python 2.7.10 at the time of writing) whereas typing "python" at the terminal uses one of my older custom builds. I guess I'd always thought there was something weird going on and just relied on the workaround which was to run:

$/usr/bin/python setup.py install

...every time I wanted to rebuild the docs. That forced the latest version of the package to be installed into the Python interpreter that was going to be run by sphinx.

I had already edited my sphinx conf.py file as follows:

sys.path.insert(0, os.path.abspath('..'))

That should have been enough to put the correct path to the working copy in the search order before anything that was installed in the site packages but, alas, it just didn't seem to work. I checked that sys.path was correct, I even started the interpreter on the command line to ensure the working copies were getting loaded after this modification. They were, it seemed inexplicable!

I finally solved the mystery today, and thought I'd blog the answer in case anyone else does the same stupid thing I did.

Earlier today I added a new module to my package and running the docs gave me an import error. Previously I'd thought that some strange site-specific import hook was causing sys.path to be ignored but an ImportError suggested that sys.path was being completely ignored. I got so frustrated that I edited the autodoc.py package to check that sys.path was set correctly just before it calls __import__ and then I set a breakpoint to see if I could see what was happening. I tried setting sys.path to a single directory, the one with the working version of my package in it. Still the import brought up the version installed in site packages. How is this possible?

When you import a module with a name like "package.modA" you're actually (sort of) importing the package first and not the module. As a result, if the package has already imported and you execute import package.modB Python will ignore sys.path because it already knows where the package is. This was exactly what was going wrong for me...

The Solution

My package contains a trivial module called 'info' which contains a few strings that I use to describe the package both within the package itself and also in setup.py. At some point I must have gone in to the conf.py I use for Sphinx and optimised away some of the redundant text by importing the strings directly from my package...

#
# conf.py
# 
import pyslet.info as info

# If extensions (or modules to document with autodoc) are in another directory,
# add these directories to sys.path here. If the directory is relative to the
# documentation root, use os.path.abspath to make it absolute, like shown here.
sys.path.insert(0, os.path.abspath('..'))

# ... [snip]

# General information about the project.
project = info.title_name
copyright = info.copyright

Notice that conf.py imports my 'info' module before I change sys.path, as a result it gets imported from site packages and every future import from my pyslet package will come from site packages too. The solution is simple, and sanity has been restored:

#
# conf.py
# 

# If extensions (or modules to document with autodoc) are in another directory,
# add these directories to sys.path here. If the directory is relative to the
# documentation root, use os.path.abspath to make it absolute, like shown here.
sys.path.insert(0, os.path.abspath('..'))
import pyslet.info as info

# ... [snip]

# General information about the project.
project = info.title_name
copyright = info.copyright

And now my documentation builds correctly from the working copy. I cannot believe that I'm the first person to trip over this issue. A bit of Googling (after the fact) reveals this module, for example: CodeChat conf.py which uses a similar technique to mine.

Happy New Year!

2014-11-14

Basic Authentication, SSL and Pyslet's HTTP/OData client

Pyslet is my Python package for Standards in Learning Education and Training and represents a packaging up of the core of my QTI migration script code in a form that makes it easier for other developers to use. Earlier this year I released Pyslet to PyPi and moved development to Github to make it easier for people to download, install and engage with the source code.

Note: this article updated 2017-05-24 with code correction (see comments for details).

Warning: The code in this article will work with the latest Pyslet master from Github, and with any distribution on or later than pyslet-0.5.20141113. At the time of writing the version on PyPi has not been updated!

A recent issue that came up concerns Pyslet's HTTP client. The client is the base class for Pyslet's OData client. In my own work I often use this client to access OData feeds protected with HTTP's basic authentication but I've never properly documented how to do it. There are two approaches...

The simplest way, and the way I used to do it, is to override the client object itself and add the Authorization header at the point where each request is queued.

from pyslet.http.client import Client

class MyAuthenticatedClient(Client):

    # add an __init__ method to set some credentials 
    # in the client

    def queue_request(self, request):
        # add in the authorization credentials
        if (self.credentials is not None and
                not request.has_header("Authorization")):
            request.set_header('Authorization',
                               str(self.credentials))
            super(MyAuthenticatedClient, self).queue_request(request)

This works OK but it forces the issue a bit and will result in the credentials being sent to all URLs, which you may not want. The credentials object should be an instance of pyslet.http.auth.BasicCredentials which takes care of correctly formatting the header. Here is some sample code to create that object:

from pyslet.http.auth import BasicCredentials
from pyslet.rfc2396 import URI

credentials = BasicCredentials()
credentials.userid = "user@example.com"
credentials.password = "secretPa$$word"
credentials.protectionSpace = URI.from_octets(
    'https://www.example.com/mypage').get_canonical_root()

With the above code, str(credentials) returns the string: 'Basic dXNlckBleGFtcGxlLmNvbTpzZWNyZXRQYSQkd29yZA==' which is what you'd expect to pass in the Authorization header.

To make this code play more nicely with the HTTP standard I added some core-support to the HTTP client itself, so you don't need to override the class anymore. The HTTP client now has a credential store and an add_credentials method. Once added, the following happens when a 401 response is received:

The client iterates through any received challenges
Each challenge is matched against the stored credentials
If matching credentials are found then an Authorization header is added and the request resent
If the request receives another 401 response the credentials are removed from the store and we go back to (1)

This process terminates when there are no more credentials that match any of the challenges or when a code other than 401 is received.

If the matching credentials are BasicCredentials (and that's the only type Pyslet supports out of the box!), then some additional logic gets activated on success. RFC 2617 says that for basic authentication, a challenge implies that all paths "at or deeper than the depth of the last symbolic element in the path field" fall into the same protection space. Therefore, when credentials are used successfully, Pyslet adds the path to the credentials using BasicCredentials.add_success_path. Next time a request is sent to a URL on the same server with a path that meets this criterium the Authorization header will be added pre-emptively.

If you want to pre-empt the 401 handling completely then you just need to add a suitable path to the credentials before you add them to the client. So if you know your credentials are good for everything in /website/~user/ you could continue the above code like this:

credentials.add_success_path('/website/~user/')

That last slash is really important, if you leave it off it will add everything in '/website/' to your protection space which is probably not what you want.

SSL

If you're going to pass basic auth credentials around you really should be using https. Python makes it a bit tricky to use HTTPs and be sure that you are using a trusted connection. Pyslet tries to make this a little bit easier. Here's what I do.

With Firefox, go to the site in question and check that SSL is working properly
Export the certificate from the site in PEM format and save to disk, e.g., www.example.com.crt
Repeat for any other sites I want my python script to work with.
Concatenate the files together and save them to, say, 'certificates.pem'
Pass this file name to the HTTP (or OData) client constructor.

from pyslet.http.client import Client

my_client = Client(ca_certs='certificates.pem')
my_client.add_credentials(credentials)

In this code, I've assumed that the credentials were created as above. To be really sure you are secure here, try grabbing a file from a different site or, even better, generate a self-signed certificate and use that instead. (The master version of Pyslet currently has such a certificate ready made in unittests/data_rfc2616/server.crt). Now pass that file for ca_certs and check that you get SSL errors! If you don't, something is broken and you should proceed with caution, or you may just be on a Mac (see notes in Is Python's SSL module correctly validating certificates... for details). And don't pass None for ca_certs as that tells the ssl module not to check at all!

If you don't like messing around with the certificates, and you are using a machine and network that is pretty trustworthy and from which you would happily do your internet banking then the following can be used to proxy for the browser method:

import ssl, string
import pyslet.rfc2396 as uri

certs = []
for s in ('https://www.example.com', 'https://www.example2.com', ):
    # add other sites to the above tuple as you like
    url = uri.URI.from_octets(s)
    certs.append(ssl.get_server_certificate(url.get_addr(),
                 ssl_version=ssl.PROTOCOL_TLSv1))
    with open('certificates.pem', 'wb') as f:
        f.write(string.join(certs,''))

Passing the ssl_version is optional above but the default setting in many Python installations will use the discredited SSLv3 or worse and your server may refuse to serve you, I know mine does! Set it to a protocol you trust.

Remember that you'll have to do this every so often because server certificates expire. You can always grab the certificate authority's certificate instead (and thereby trust a whole slew of sites at once) but if you're going that far then there are better recipes for finding and re-using the built-in machine certificate store anyway. The beauty of this method is that you can self-sign a server certificate you trust and connect to it securely with a Python client without having to mess around with certificate authorities at all, provided you can safely courier the certificate from your server to your client that is! If you are one of the growing number of people who think the whole trust thing is broken anyway since Snowden then this may be an attractive option.

With thanks to @bolhovsky on Github for bringing the need for this article to my attention.

2014-05-26

Adding OData support to Django with Pyslet: First Thoughts

A couple of weeks ago I got an interesting tweet from @d34dl0ck, here it is:

This got me thinking, but as I know very little about Django I had to do a bit of research first. Here's my read-back of what Django's data layer does in the form of a concept mapping from OData to Django. In this table the objects are listed in containment order and the use case of using OData to expose data managed by a Django-based website is assumed. (See below for thoughts on consuming OData in Django as if it were a data source.)

OData Concept	Django Concept	Pyslet Concept
DataServices	Django website: the purpose of OData is to provide access to your application's data-layer through a standard API for machine-to-machine communication rather than through an HTML-based web view for human consumption.	Instance of the DataServices class, typically parsed from a metadata XML file.
Schema	No direct equivalent. In OData, the purpose of the schema is to provide a namespace in which definitions of the other elements take place. In Django this information will be spread around your Python source code in the form of class definitions that support the remaining concepts.	Instance of the Schema class, typically parsed from a metadata XML file.
EntityContainer	The database. An OData service can define multiple containers but there is always a default container - something that corresponds closely with the way Django links to multiple databases. Most OData services probably only define a single container and I would expect that most Django applications use the default database. If you do define custom database routers to map different models to different databases then that information would need to be represented in the corresponding Schema(s).	In Pyslet, an EntityContainer is defined by an instance of the EntityContainer class but this instance is handed to a storage layer during application startup and this storage layer class binds concrete implementations of the data access API to the EntitySets it contains.
EntitySet	Your model class. A model class maps to a table in the Django database. In OData the metadata file contains the information about which container contains an EntitySet and the EntityType definition in that file contains the actual definitions of the types and field names. In contrast, in Django these are defined using class attributes in the Python code.	Pyslet sticks closely to the OData API here and parses definitions from the metadata file. As a result an EntitySet instance is created that represents this part of the model and it is up to the object responsible for interfacing to the storage layer to provide concrete bindings.
Entity	An instance of a model class.	An instance of the Entity object, typically instantiated by the storage object bound to the EntitySet.

Where do you start?

Step 1: As you can see from the above table, Pyslet depends fairly heavily on the metadata file so a good way to start would be to create a metadata file that corresponds to the parts of your Django data model you want to expose. You have some freedom here but if you are messing about with multiple databases in Django it makes sense to organise these as separate entity containers. You can't create relationships across containers in Pyslet which mirrors the equivalent restriction in Django.

Step 2: You now need to provide a storage object that maps Pyslet's DAL onto the Django DAL. This involves creating a sub-class of the EntityCollection object from Pyslet. To get a feel for the API my suggestion would be to create a class for a specific model initially and then, once this is working, consider how you might use Python's built-in introspection to write a more general object.

To start with, you don't need to do too much. EntityCollection objects are just like dictionaries but you only need to override itervalues and __getitem__ to get some sort of implementation going. There are simple wrappers that will (inefficiently) handle ordering and filtering for you to start with so itervalues can be very simple...

def itervalues(self):
    return self.OrderEntities(
        self.ExpandEntities(
        self.FilterEntities(
        self.entityGenerator())))

All you need to do is write the entityGenerator method (the name is up to you) and yield Entity instances from your Django model. This looks pretty simple in Django, something like Customer.objects.all() where Customer is the name of a model class would appear to return all customer instances. You need to yield an Entity object from Pyslet's DAL for each customer instance and populate the property values from the fields of the returned model instance.

Implementing __getitem__ is probably also very easy, especially when you are using simple keys. Something like Customer.objects.get(pk=1) and then a similar mapping to the above seems like it would work for implementing basic resource look up by key. Look at the in-memory collection class implementation for the details of how to check the filter and populate the field values, it's in pyslet/odata2/memds.py.

Probably the hardest part of defining an EntityCollection object is getting the constructor right. You'll want to pass through the Model class from Django so that you can make calls like the above:

def __init__(self,djangoModel,**kwArgs):
    self.djangoModel=djangoModel
    super(DjangoCollection,self).__init__(**kwArgs)

Step 3: Load the metadata from a file, then bind your EntityCollection class or classes to the EntitySets. Something like this might work:

import pyslet.odata2.metadata as edmx
doc=edmx.Document()
with open('DjangoAppMetadata.xml','rb') as f:
    doc.Read(f)
customers=doc.root.DataServices['DjangoAppSchema.DjangoDatabase.Customers']
# customers is an EntitySet instance
customers.Bind(DjangoCollection,djangoModel=Customer)

The Customer object here is your Django model object for Customers and the DjangoCollection object is the EntityCollection object you created in Step 2. Each time someone opens the customers entity set a new DjangoCollection object will be created and Customer will be passed as the djangoModel parameter.

Step 4: Test that the model is working by using the interpreter or a simple script to open the customers object (the EntitySet) and make queries with the Pyslet DAL API. If it works, you can wrap it with an OData server class and just hook the resulting wsgi object to your web server and you have hacked something together.

Post hack

You'll want to look at Pyslet's expression objects and figure out how to map these onto the query objects used by Django. Although OData provides a rich query syntax you don't need to support it all, just reject stuff you don't want to implement. Simple queries look like they'd map to things you can pass to the filter method in Django fairly easily. In fact, one of the problems with OData is that it is very general - almost SQL over the web - and your application's data layer is probably optimised for some queries and not others. Do you want to allow people to search your zillion-record table using a query that forces a full table scan? Probably not.

You'll also want to look at navigation properties which map fairly neatly to the relationship fields. The Django DAL and Pyslet's DAL are not miles apart here so you should be able to create NavigationCollection objects (equivalent to the class you created in Step 2 above) for these. At this point, the power of OData will begin to come alive for you.

Making it Django-like

I'm not an expert on what is and is not Django like but I did notice that there is a Feed concept for exposing RSS in Django. If the post hack process has left you with a useful implementation then some sort of OData equivalent object might be a useful addition. Given that Django tends to do much of the heavy lifting you could think about providing an OData feed object. It probably isn't too hard to auto-generate the metadata from something like class attributes on such an object. Pyslet's OData server is a wsgi application so provided Django can route requests to it you'll probably end up with something that is fairly nicely integrated - even if it can't do that out of the box it should be trivial to provide a simple Django request handler that fakes a wsgi call.

Consuming OData

Normally you think of consuming OData as being easier than providing it but for Django you'd be tempted to consider exposing OData as a data source, perhaps as an auxiliary database containing some models that are externally stored. This would allow you to use the power of Django to create an application which mashed up data from OData sources as if that data were stored in a locally accessible database.

This appears to be a more ambitious project: Django non-rel appears to be a separate project and it isn't clear how easy it would be to intermingle data coming form an OData source with data coming from local databases. It is unlikely that you'd want to use OData for all data in your application. The alternative might be to try and write a Python DB API interface for Pyslet's DAL and then get Django treating it like a proper database. That would mean parsing SQL, which is nasty, but it might be the lesser of two evils.

Of course, there's nothing stopping you using Pyslet's builtin OData client class directly in your code to augment your custom views with data pulled from an external source. One of the features of Pyslet's OData client is that it treats the remote server like a data source, keeping persistent HTTP connections open, managing multi-threaded access and and pipelining requests to improve throughput. That should make it fairly easy to integrate into your Django application.

2014-05-12

A Dictionary-like Python interface for OData Part III: a SQL-backed OData Server

This is the third and last part of a series of three posts that introduce my OData framework for Python. To recap:

In Part I I introduced a new data access layer I've written for Python that is modelled on the conventions of OData. In that post I validated the API by writing a concrete implementation in the form of an OData client.
In Part II I used the same API and wrote a concrete implementation using a simple in-memory storage model. I also introduced the OData server functionality to expose the API via the OData protocol.
In this part, I conclude this mini-series with a quick look at another concrete implementation of the API which wraps Python's DB API allowing you to store data in a SQL environment.

As before, you can download the source code from the QTIMigration Tool & Pyslet home page. I wrote a brief tutorial on using the SQL backed classes to take care of some of the technical details.

Rain or Shine?

To make this project a little more interesting I went looking for a real data set to play with. I'm a bit of a weather watcher at home and for almost 20 years I've enjoyed using a local weather station run by a research group at the University of Cambridge. The group is currently part of the Cambridge Computer Laboratory and the station has moved to the William Gates building.

The Database

The SQL implementation comes in two halves. The base classes are as close to standard SQL as I could get and then a small 'shim' sits over the top which binds to a specific database implementation. The Python DB API takes you most of the way, including helping out with the correct form of parameterisation to use. For this example project I used SQLite because the driver is typically available in Python implementations straight out of the box.

I wrote the OData-style metadata document first and used it to automatically generate the CREATE TABLE commands but in most cases you'll probably have an existing database or want to edit the generated scripts and run them by hand. The main table in my schema got created from this SQL:

CREATE TABLE "DataPoints" (
    "TimePoint" TIMESTAMP NOT NULL,
    "Temperature" REAL,
    "Humidity" SMALLINT,
    "DewPoint" REAL,
    "Pressure" SMALLINT,
    "WindSpeed" REAL,
    "WindDirection" TEXT,
    "WindSpeedMax" REAL,
    "SunRainStart" REAL,
    "Sun" REAL,
    "Rain" REAL,
    "DataPointNotes_ID" INTEGER,
    PRIMARY KEY ("TimePoint"),
    CONSTRAINT "DataPointNotes" FOREIGN KEY ("DataPointNotes_ID") REFERENCES "Notes"("ID"))

To expose the database via my new data-access-layer API you just load the XML metadata, create a SQL container object containing the concrete implementation and then you can access the data in exactly the same way as I did in Part's I and II. The code that consumes the API doesn't need to know if the data source is an OData client, an in memory dummy source or a full-blown SQL database. Once I'd loaded the data, here is a simple session with the Python interpreter that shows you the API in action.

>>> import pyslet.odata2.metadata as edmx
>>> import pyslet.odata2.core as core
>>> doc=edmx.Document()
>>> with open('WeatherSchema.xml','rb') as f: doc.Read(f)
... 
>>> from pyslet.odata2.sqlds import SQLiteEntityContainer
>>> container=SQLiteEntityContainer(filePath='weather.db',containerDef=doc.root.DataServices['WeatherSchema.CambridgeWeather'])
>>> weatherData=doc.root.DataServices['WeatherSchema.CambridgeWeather.DataPoints']
>>> collection=weatherData.OpenCollection()
>>> collection.OrderBy(core.CommonExpression.OrderByFromString('WindSpeedMax desc'))
>>> collection.SetPage(5)
>>> for e in collection.iterpage(): print "%s: Max wind speed: %0.1f mph"%(unicode(e['TimePoint'].value),e['WindSpeedMax'].value*1.15078)
... 
2002-10-27T10:30:00: Max wind speed: 85.2 mph
2004-03-20T15:30:00: Max wind speed: 82.9 mph
2007-01-18T14:30:00: Max wind speed: 80.6 mph
2004-03-20T16:00:00: Max wind speed: 78.3 mph
2005-01-08T06:00:00: Max wind speed: 78.3 mph

Notice that the container itself isn't needed when accessing the data because the SQLiteEntityContainer __init__ method takes care of binding the appropriate collection classes to the model passed in. Unfortunately the dataset doesn't go all the way back to the great storm of 1987 which is a shame as at the time I was living in a 5th floor flat perched on top of what I was reliably informed was the highest building in Cambridge not to have some form of structural support. I woke up when the building shook so much my bed moved across the floor.

Setting up a Server

I used the same technique as I did in Part II to wrap the API with an OData server and then had some real fun getting it up and running on Amazon's EC2. Pyslet requires Python 2.7 but EC2 Linux comes with Python 2.6 out of the box. Thanks to this blog article for help with getting Python 2.7 installed. I also had to build mod_wsgi from scratch in order to get it to pick up the version I wanted. Essentially here's what I did:

# Python 2.7 install
sudo yum install make automake gcc gcc-c++ kernel-devel git-core -y
sudo yum install python27-devel -y
# Apache install
#  Thanks to http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/install-LAMP.html
sudo yum groupinstall -y "Web Server"
sudo service httpd start
sudo chkconfig httpd on

And to get mod_wsgi working with Python2.7...

sudo bash
cd
yum install httpd-devel -y
mkdir downloads
cd downloads
wget http://modwsgi.googlecode.com/files/mod_wsgi-3.4.tar.gz
tar -xzvf mod_wsgi-3.4.tar.gz
cd mod_wsgi-3.4
./configure --with-python=/usr/bin/python2.7
make
make install
# Optional check to ensure that we've got the correct Python linked
# you should see the 2.7 library linked
ldd /etc/httpd/modules/mod_wsgi.so
service httpd restart

To drive the server with mod_wsgi I used a script like this:

#! /usr/bin/env python

import logging, os.path
import pyslet.odata2.metadata as edmx
from pyslet.odata2.sqlds import SQLiteEntityContainer
from pyslet.odata2.server import ReadOnlyServer

HOME_DIR=os.path.split(os.path.abspath(__file__))[0]
SERVICE_ROOT="http://odata.pyslet.org/weather"

logging.basicConfig(filename='/var/www/wsgi-log/python.log',level=logging.INFO)

doc=edmx.Document()
with open(os.path.join(HOME_DIR,'WeatherSchema.xml'),'rb') as f:
    doc.Read(f)

container=SQLiteEntityContainer(filePath=os.path.join(HOME_DIR,'weather.db'),
    containerDef=doc.root.DataServices['WeatherSchema.CambridgeWeather'])

server=ReadOnlyServer(serviceRoot=SERVICE_ROOT)
server.SetModel(doc)

def application(environ, start_response):
 return server(environ,start_response)

I'm relying on the fact that Apache is configured to run Python internally and that my server object persists between calls. I think by default mod_wsgi serialises calls to the application method but a smarter configuration with a multi-threaded daemon would be OK because the server and container objects are thread safe. There are limits to the underlying SQLite module of course so you may not gain a lot of performance this way but a proper database would help.

Try it out!

If you were watching carefully you'll see that the above script uses a public service root. So let's try the same query but this time using OData. Here it is in Firefox:

Notice that Firefox recognises that the OData feed is an Atom feed and displays the syndication title and updated date. I used the metadata document to map the temperature and the date of the observation to these (you can see they are the same data points as above by the matching dates). The windiest days are never particularly hot or cold in Cambridge because they are almost always associated with Atlantic storms and the sea temperature just doesn't change that much.

The server is hosted at http://odata.pyslet.org/weather

2014-02-24

A Dictionary-like Python interface for OData Part II: a Memory-backed OData Server

In my previous post, A Dictionary-like Python interface for OData I introduced a new sub-package I've added to Pyslet to implement support for OData version 2. You can download the latest version of the Pyslet package from the QTI Migration Tool & Pyslet home page.

To recap, I've decided to set about writing my own data access layer for Python that is modelled on the conventions of OData. I've validated the API by writing a concrete implementation in the form of an OData client. In this post I'll introduce the next step in the process which is a simple alternative implementation that uses a different underlying storage model, in other words, an implementation which uses something other than a remote OData server. I'll then expose this implementation as an OData server to validate that my data access layer API works from both perspectives.

Metadata

Unlike other frameworks for implementing OData services Pyslet starts with the metadata model, it is not automatically generated from your code, you must write it yourself. This differs from the object-first approach taken by other frameworks, illustrated here:

This picture is typical of a project using something like Microsoft's WCF. Essentially, there's a two-step process. You use something like Microsoft's entity framework to generate classes from a database schema, customise the classes a little and then the metadata model is auto-generated from your code model. Of course, you can go straight to code and implement your own code model that implements the appropriate queryable interface but this would typically be done for a specific model.

Contrast this with the approach taken by Pyslet where the entities are not model-specific classes. For example, when modelling the Northwind service there is no Python class called Product as there would be in the approach taken by other frameworks. Instead there is a generalised implementation of Entity which behaves like a dictionary. The main difference is probably that you'll use supplier['Phone'] instead of simply supplier.phone or, if you'd have gone down the getter/setter route, supplier.GetPhone(). In my opinion, this works better than a tighter binding for a number of reasons, but particularly because it makes the user more mindful of when data access is happening and when it isn't.

Using a looser binding also helps prevent the type of problems I had during the development of the QTI specification. Lots of people were using Java and JAXB to autogenerate classes from the XML specification (cf autogenerating classes from a database schema) but the QTI model contained a class attribute on most elements to allow for stylesheet support. This class attribute prevented auto-generation because class is a reserved word in the Java language. Trying to fix this up after auto-generation would be madness but fixing it up before turns out to be a little tricky and this glitch seriously damaged the specification's user-experience. We got over it, but I'm wary now and when modelling OData I stepped back from a tighter binding, in part, to prevent hard to fix glitches like the use of Python reserved words as property names.

Allocating Storage

For this blog post I'm using a lightweight in-memory data storage implementation which can be automatically provisioned from the metadata document and I'm going to cheat by making a copy of the metadata document used by the Northwind service. Exposing OData the Pyslet way is a little more work if you already have a SQL database containing your data because I don't have a tool that auto-generates the metadata document from the SQL database schema. Automating the other direction is easy, but more on that in Part III.

I used my web browser to grab a copy of http://services.odata.org/V2/Northwind/Northwind.svc/$metadata and saved it to a file called Northwind.xml. I can then load the model from the interpreter:

>>> import pyslet.odata2.metadata as edmx
>>> doc=edmx.Document()
>>> f=open('Northwind.xml')
>>> doc.Read(f)
>>> f.close()

This special Document class ensures that the model is loaded with the special Pyslet element implementations. The Products entity set can be looked up directly but at the moment it's empty!

>>> productSet=doc.root.DataServices['ODataWeb.Northwind.Model.NorthwindEntities.Products']
>>> products=productSet.OpenCollection()
>>> len(products)
0
>>> products.close()

This isn't surprising, there is nothing in the metadata model itself which binds it to the data service at services.odata.org. The model isn't linked to any actual storage for the data. By default, the model behaves as if it is bound to an empty read-only data store.

To help me validate that my API can be used for something other than talking to real OData services I've created an object that provisions storage for an EntityContainer (that's like a database in OData) using standard Python dictionaries. By passing the definition of an EntityContainer to the object's constructor I create a binding between the model and this new data store.

>>> from pyslet.odata2.memds import InMemoryEntityContainer
>>> container=InMemoryEntityContainer(doc.root.DataServices['ODataWeb.Northwind.Model.NorthwindEntities'])
>>> products=productSet.OpenCollection()
>>> len(products)
0

The collection of products is still empty but it is now writeable. I'm going to cheat again to illustrate this by borrowing some code from the previous blog post to open an OData client connected to the real Northwind service.

>>> from pyslet.odata2.client import Client
>>> c=Client("http://services.odata.org/V2/Northwind/Northwind.svc/")
>>> nwProducts=c.feeds['Products'].OpenCollection()

Here's a simple loop to copy the products from the real service into my own collection. It's a bit clumsy in the interpreter but careful typing pays off:

>>> for nwProduct in nwProducts.itervalues():
...   product=collection.CopyEntity(nwProduct)
...   product.SetKey(nwProduct.Key())
...   collection.InsertEntity(product)
... 
>>> len(collection)
77

To emphasise the difference between my in-memory collection and the live OData service I'll add another record to my copy of this entity set. Fortunately most of the fields are marked as Nullable in the model so to save my fingers I'll just set those that aren't.

>>> product=collection.NewEntity()
>>> product.SetKey(100)
>>> product['ProductName'].SetFromValue("The one and only Pyslet")
>>> product['Discontinued'].SetFromValue(False)
>>> collection.InsertEntity(product)
>>> len(collection)
78

Now I can do everything I can with the OData client using my copy of the service, I'll filter the entities to make it easier to see:

>>> import pyslet.odata2.core as core
>>> filter=core.CommonExpression.FromString("substringof('one',ProductName)")
>>> collection.Filter(filter)
>>> for p in collection.itervalues(): print p.Key(), p['ProductName'].value
... 
21 Sir Rodney's Scones
32 Mascarpone Fabioli
100 The one and only Pyslet

I can access my own data store using the same API that I used to access a remote OData service in the previous post. In that post, I also claimed that it was easy to wrap my own implementations of this API to expose it as an OData service.

Exposing an OData Server

My OData server class implements the wsgi protocol so it is easy to link it up to a simple http server and tell it to handle a single request.

>>> from pyslet.odata2.server import Server
>>> server=Server("http://localhost:8081/")
>>> server.SetModel(doc)
>>> from wsgiref.simple_server import make_server
>>> httpServer=make_server('',8081,server)
>>> httpServer.handle_request()

My interpreter session is hanging at this point waiting for a single HTTP connection. The Northwind service doesn't have any feed customisations on the Products feed and, as we slavishly copied it, the Atom-view in the browser is a bit boring so I used the excellent JSONView plugin for Firefox and the following URL to hit my service:

http://localhost:8081/Products?$filter=substringof('one',ProductName)&$orderby=ProductID desc&$format=json

This is the same filter as I used in the interpreter before but I've added an ordering and specified my preference for JSON format. Here's the result.

As I did this, Python's simple server object logged the following output to my console:

127.0.0.1 - - [24/Feb/2014 11:17:05] "GET /Products?$filter=substringof(%27one%27,ProductName)&$orderby=ProductID%20desc&$format=json HTTP/1.1" 200 1701
>>>

The in-memory data store is a bit of a toy, though some more useful applications might be possible. In the OData documentation I go through a tutorial on how to create a lightweight memory-cache of key-value pairs exposed as an OData service. I'm not really suggestion using it in a production environment to replace memcached. What this implementation is really useful for is developing and testing applications that consume the DAL API without needing to be connected to the real data source. Also, it can be wrapped in the OData Server class as shown above and used to provide a more realistic mock of an actual service for testing that your consumer application still works when the data service is remote. I've used it in Pyslet's unit-tests this way.

In the third and final part of this Python and OData series I'll cover a more interesting implementation of the API using the SQLite database.

2014-02-12

A Dictionary-like Python interface for OData

Overview

This blog post introduces some new modules that I've added to the Pyslet package I wrote. Pyslet's purpose is providing support for Standards for Learning, Education and Training in Python. The new modules implement the OData protocol by providing a dictionary-like interface. You can download pyslet from the QTIMigration Tool & Pyslet home page. There is some documentation linked from the main Pyslet wiki. This blog article is as good a way as any to get you started.

The Problem

Python has a database API which does a good job but it is not the whole solution for data access. Embedding SQL statements in code, grappling with the complexities of parameterization and dealing with individual database quirks makes it useful to have some type of layer between your web app and the database API so that you can tweak your code as you move between data sources.

If SQL has failed to be a really interoperable standard then perhaps OData, the new kid on the block, can fill the vacuum. The standard is sometimes referred to as "ODBC over the web" so it is definitely in this space (after all, who runs their database on the same server as their web app these days?).

My Solution

To solve this problem I decided to set about writing my own data access layer that would be modeled on the conventions of OData but that used some simple concepts in Python. I decided to go down the dictionary-like route, rather than simulating objects with attributes, because I find the code more transparent that way. Implementing methods like __getitem__, __setitem__ and itervalues keeps the data layer abstraction at arms length from the basic python machinery. It is a matter of taste. See what you think.

The vision here is to write a single API (represented by a set of base classes) that can be implemented in different ways to access different data sources. There are three steps:

An implementation that uses the OData protocol to talk to a remote OData service.
An implementation that uses python dictionaries to create a transient in-memory data service for testing.
An implementation that uses the python database API to access a real database.

This blog post is mainly about the first step, which should validate the API as being OData-like and set the groundwork for the others which I'll describe in subsequent blog posts. Incidentally, it turns out to be fairly easy to write an OData server that exposes a data service written to this API, more on that in future posts.

Quick Tutorial

The client implementation uses Python's logging module to provide logging. To make it easier to see what is going on during this walk through I'm going to turn logging up from the default "WARN" to "INFO":

>>> import logging
>>> logging.basicConfig(level=logging.INFO)

To create a new OData client you simply instantiate a Client object passing the URL of the OData service root. Notice that, during construction, the Client object downloads the list of feeds followed by the metadata document. The metadata document is used extensively by this module and is loaded into a DOM-like representation.

>>> from pyslet.odata2.client import Client
>>> c=Client("http://services.odata.org/V2/Northwind/Northwind.svc/")
INFO:root:Sending request to services.odata.org
INFO:root:GET /V2/Northwind/Northwind.svc/ HTTP/1.1
INFO:root:Finished Response, status 200
INFO:root:Sending request to services.odata.org
INFO:root:GET /V2/Northwind/Northwind.svc/$metadata HTTP/1.1
INFO:root:Finished Response, status 200

Client objects have a feeds attribute that is a plain dictionary mapping the exposed feeds (by name) onto EntitySet objects. These objects are part of the metadata model but serve a special purpose in the API as they can be opened (a bit like files or directories) to gain access to the (collections of) entities themselves. Collection objects can be used in the with statement and that's normally how you'd use them but I'm sticking with the interactive terminal for now.

>>> products=c.feeds['Products'].OpenCollection()
>>> for p in products: print p
... 
INFO:root:Sending request to services.odata.org
INFO:root:GET /V2/Northwind/Northwind.svc/Products HTTP/1.1
INFO:root:Finished Response, status 200
1
2
3
... [and so on]
...
20
INFO:root:Sending request to services.odata.org
INFO:root:GET /V2/Northwind/Northwind.svc/Products?$skiptoken=20 HTTP/1.1
INFO:root:Finished Response, status 200
21
22
23
... [and so on]
...
76
77

The products collection behaves like a dictionary, iterating through it iterates through the keys in the dictionary. In this case these are the keys of the entities in the collection of products in Microsoft's sample Northwind data service. Notice that the client logs several requests to the server interspersed with the printed output. That's because the server is limiting the maximum page size and the client is following the page links provided. These calls are made as you iterate through the collection allowing you to iterate through very large collections without loading everything in to memory.

The keys alone are of limited interest, let's try a similar loop but this time we'll print the product names as well:

>>> for k,p in products.iteritems(): print k,p['ProductName'].value
... 
INFO:root:Sending request to services.odata.org
INFO:root:GET /V2/Northwind/Northwind.svc/Products HTTP/1.1
INFO:root:Finished Response, status 200
1 Chai
2 Chang
3 Aniseed Syrup
...
...
20 Sir Rodney's Marmalade
INFO:root:Sending request to services.odata.org
INFO:root:GET /V2/Northwind/Northwind.svc/Products?$skiptoken=20 HTTP/1.1
INFO:root:Finished Response, status 200
21 Sir Rodney's Scones
22 Gustaf's Knäckebröd
23 Tunnbröd
...
...
76 Lakkalikööri
77 Original Frankfurter grüne Soße

Sir Rodney's Scones sound interesting, we can grab an individual record just as we normally would from a dictionary, by using its key.

>>> scones=products[21]
INFO:root:Sending request to services.odata.org
INFO:root:GET /V2/Northwind/Northwind.svc/Products(21) HTTP/1.1
INFO:root:Finished Response, status 200
>>> for k,v in scones.DataItems(): print k,v.value
... 
ProductID 21
ProductName Sir Rodney's Scones
SupplierID 8
CategoryID 3
QuantityPerUnit 24 pkgs. x 4 pieces
UnitPrice 10.0000
UnitsInStock 3
UnitsOnOrder 40
ReorderLevel 5
Discontinued False

The scones object is an Entity object. It too behaves like a dictionary. The keys are the property names and the values are one of SimpleValue, Complex or DeferredValue. In the snippet above I've used a variation of iteritems which iterates only through the data properties, excluding the navigation properties. In this model, there are no complex properties. The simple values have a value attribute which contains a python representation of the value.

Deferred values (navigation properties) can be used to navigate between Entities. Although deferred values can be opened just like EntitySets, if the model dictates that at most 1 entity can be linked a convenience method called GetEntity can be used to open the collection and read the entity in one call. In this case, a product can have at most one supplier.

>>> supplier=scones['Supplier'].GetEntity()
INFO:root:Sending request to services.odata.org
INFO:root:GET /V2/Northwind/Northwind.svc/Products(21)/Supplier HTTP/1.1
INFO:root:Finished Response, status 200
>>> for k,v in supplier.DataItems(): print k,v.value
... 
SupplierID 8
CompanyName Specialty Biscuits, Ltd.
ContactName Peter Wilson
ContactTitle Sales Representative
Address 29 King's Way
City Manchester
Region None
PostalCode M14 GSD
Country UK
Phone (161) 555-4448
Fax None
HomePage None

Continuing with the dictionary-like theme, attempting to load a non existent entity results in a KeyError:

>>> p=products[211]
INFO:root:Sending request to services.odata.org
INFO:root:GET /V2/Northwind/Northwind.svc/Products(211) HTTP/1.1
INFO:root:Finished Response, status 404
Traceback (most recent call last):
  File "", line 1, in 
  File "/Library/Python/2.7/site-packages/pyslet/odata2/client.py", line 165, in __getitem__
 raise KeyError(key)
KeyError: 211

Finally, when we're done, it is a good idea to close the open collection. If we'd used the with statement this step would have been done automatically for us of course.

>>> products.close()

Limitations

Currently the client only supports OData version 2. Version 3 has now been published and I do intend to update the classes to speak version 3 at some point. If you try and connect to a version 3 service the client will complain when it tries to load the metadata document. There are ways around this limitation, if you are interested add a comment to this post and I'll add some documentation.

The client only speaks XML so if your service only speaks JSON it won't work at the moment. Most of the JSON code is done and tested so adding it shouldn't be a big issue if you are interested.

The client can be used to both read and write to a service, and there are even ways of passing basic authentication credentials. However, if calling an https URL it doesn't do certificate validation at the moment so be warned as your security could be compromised. Python 2.7 does now support certification validation using OpenSLL so this could change quite easily I think.

Moving to Python 3 is non-trivial - let me know if you are interested. I have taken the first steps (running unit tests with "python -3Wd" to force warnings) and, as much as possible, the code is ready for migration. I haven't tried it yet though and I know that some of the older code (we're talking 10-15 years here) is a bit sensitive to the raw/unicode string distinction.

The documentation is currently about 80% accurate and only about 50% useful. Trending upwards though.

Downloading and Installing Pyslet

Pyslet is pure-python. If you are only interested in OData you don't need any other modules, just Python 2.7 and a reasonable setuptools to help you install it. I just upgraded my machine to Mavericks which effectively reset my Python environment. Here's what I did to get Pyslet running.

Installed setuptools
Downloaded the pyslet package tgz and unpacked it (download from here)
Ran python setup.py install

Why?

Some lessons are hard! Ten years or so ago I wrote a migration tool to convert QTI version 1 to QTI version 2 format. I wrote it as a Python script and used it to validate the work the project team were doing on the version 2 specification itself. Realising that most people holding QTI content weren't able to easily run a Python script (especially on Windows PCs) my co-chair Pierre Gorissen wrote a small Windows-wrapper for the script using the excellent wxPython and published an installer via his website. From then on, everyone referred to it as "Pierre's migration tool". I'm not bitter, the lesson was clear. No point in writing the tool if you don't package it up in the way people want to use it.

This sentiment brings me to the latest developments with the tool. A few years back I wrote (and blogged about) a module for writing Basic LTI tools in Python. I did this partly to prove that LTI really was simple (I wrote the entire module on a single flight to the US) but also because I believed that the LTI specification was really on to something useful. LTI has been a huge success and offers a quick route for tool developers to gain access to users of learning management systems. It seems obvious that the next version of the QTI Migration Tool should be an LTI tool but moving from a desktop app to a server-based web-app means that I need a data access layer that can persist data and be smarter about things like multiple threads and processes.

2012-12-05

Writing a stream to a zipfile in Python, harder than you think!

So here's the problem, you have a stream (a file-like object) in Python and you want to spool the contents of it into a zip archive. Sounds like a common requirement? It turns out to be very hard. I propose a solution here with hooks.

There are two methods for writing data to a zip file in the Python zipfile module.

ZipFile.write(filename[, arcname[, compress_type]])

and

ZipFile.writestr(zinfo_or_arcname, bytes[, compress_type])

The first takes the name of a file, opens it and spools the contents in to the archive in 8K chunks. Sounds like a good fit for what I want except that I have a file-like object, not a file name, and ZipFile.write won't accept that. I could create a temporary file on disk and write my data to that, then pass the name of the file instead but that supposes (a) that I have access to the file system for writing and (b) I don't mind spooling the data twice, once to the disk and once back out again for storage in the archive.

Before you protest, the ZipFile object only requires a file-like object with support for seek and tell, it doesn't actually have to be a file in the file system so (a) is still a valid scenario. We will have to ditch any clever ideas of spooling a zip file directly over network connections though. A closer look at the implementation shows us that once the data has been compressed and written out to the archive the stream is wound back to the archive entry's header to update information about the compressed and uncompressed sizes. Still, even if you are buffering the output at least you are dealing with the smaller compressed data and not the original uncompressed source.

So if ZipFile.write doesn't work for streams what about using ZipFile.writestr instead? This takes the data as a string of bytes (in memory). For larger files this is unlikely to be practicable. I did wonder about tricking this method with a string-like object but even if I could do this the method will still attempt to create an ordinary string with the entire compressed data which won't work for large streams.

Solution 1

The first solution is taken from a suggestion on StackOverflow. The idea is to wrap the ZipFile object and write a new method. Clearly that would be something good for the module maintainers to consider but it requires considerable copying of code. If I'm going to be so dependent on the internals of the ZipFile object implementation I might as well look to see if there is a better way.

Solution 2

Looking at the ZipFile implementation the write method is clearly very close to what I want to do. If only it would accept a file-like object! A closer look reveals that it only does two things with the passed filename. It calls os.stat and then, shortly afterwards, calls open to get a file-like object.

This got me thinking whether or not I could trick the write method in to accepting something other than the name of a file. I created an object (which I called a VirtualFilePath) and gave it a stat and open method. The implementation is not important, but this object essentially wraps my file-like object simulating these two operating system functions.

Unfortunately, I can't pass a VirtualFilePath to the operating system open function. I'll get an error that it wasn't expecting an instance. The same goes for os.stat. However, I can write hooks to intercept these calls and redirect the calls to my methods if the argument is a VirtualFilePath. This is basically what my solution looks like:

import os,__builtin__

stat_pass=os.stat
open_pass=__builtin__.open

def stat_hook(path):
 if isinstance(path,VirtualFilePath):
  return path.stat()
 else:
  return stat_pass(path)

def open_hook(path,*params):
 if isinstance(path,VirtualFilePath):
  return path.open(*params)
 else:
  return open_pass(path,*params)

class ZipHooks(object):
 hookCount=0
 
 def __init__(self):
  if not ZipHooks.hookCount:
   os.stat=stat_hook
   __builtin__.open=open_hook
  ZipHooks.hookCount+=1
  
 def __enter__(self):
  return self
 
 def __exit__(self, type, value, traceback):
  self.Unhook()
      
 def Unhook(self):
  ZipHooks.hookCount-=1
  if not ZipHooks.hookCount:
   os.stat=stat_pass
   __builtin__.open=open_pass

This code adds hooks which detect my VirtualFilePath object when it is passed to open or stat and redirects those calls. To make it easier to manage the hooks we create a ZipHooks object with __enter__ and __exit__ methods allowing it to be used in a 'with' statement like this:

with ZipHooks() as zh:
 # add stuff to an archive using VirtualFilePath here

There's one final detail to clear up. stat is supposed to return the size of the file but what if I don't know it because I'm reading data from a stream? In fact, closer inspection of the ZipFile.write method's implementation shows that it doesn't really rely on the size returned by stat as it monitors both compressed and uncompressed sizes and re-stuffs the header when it back-tracks.

The only other bits of stat that ZipFile.write is interested in is the modification date of the file and the mode (which it uses to determine if the file is really a directory). So if your file-like object isn't very file-like at all it won't matter too much because you only have to fake these fields in the stat result.

2012-06-05

Lion, wxPython and py2app: targeting Carbon & Cocoa

About this time last year I wrote a blog entry on installing wxPython and py2app on my Mac running Snow Leopard. Well I have since upgraded to Lion and the same installation seemed to keep working just fine. This weekend I've actually upgraded to a new Mac and I thought this would be an excellent opportunity to grapple with this installation again and perhaps advance my understanding of what I'm doing.

Apple's Migration Assistant (a.k.a. Setup Assistant) had other ideas. You have to hand it to them, it took about 2-3 hours with the two machines plugged together and everything was copied across and working without any intervention. They really do make it easy to buy a new Mac and get productive straight away.

So is this blog post the Python equivalent of the "pass" statement? Just a no-op?

Well not quite. At the moment, I'm building my QTI migration tool using the Carbon version of wxPython which means forcing Python to work in 32bit mode. That's getting a bit out-dated now and I surely can't take advantage of my new 8GB Mac? I need to embrace 64bit builds, I need to figure out how to build for the Cocoa version of wxPython and I need to figure out how to do this while retaining my ability to create the 32-bit build for older hardware/versions of Mac OS.

So here is how I now recommend doing this...

Step 1: Install Python

The lesson from last time was that you can't rely on the python versions installed by Apple to do these builds for you. If you run python on a clean Lion install you'll get a 64-bit version of python 2.7.1:

$ which python
/usr/bin/python
$ python
Python 2.7.1 (r271:86832, Jul 31 2011, 19:30:53) 
[GCC 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2335.15.00)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys;"%X"%sys.maxsize
'7FFFFFFFFFFFFFFF'

Thanks to How to tell if my python shell is executing in 32bit or 64bit mode? for the tip on printing maxsize.

I want to keep python pointing here because the command-line version of the migration tool, and the supporting Pyslet package (which can be used independently) need to work 'out-of-the-box'. The Migration Assistant had helpfully copied over my .bash_profile which included modifications made by my custom Mac binary install of Python 2.7 last year. The modifications are well commented and help to explain the different paths we'll be dealing with:

# Setting PATH for Python 2.7
# The orginal version is saved in .bash_profile.pysave
PATH="/Library/Frameworks/Python.framework/Versions/2.7/bin:${PATH}"
export PATH

Firstly, note that the Mac binaries install in /Library/Frameworks/ whereas the pre-loaded Apple installations are all in /System/Library/Frameworks/. This is a fairly subtle difference so a little bit of concentration is required to prevent mistakes. Anyway, as per the instructions above I restored my .bash_profile from .bash_profile.pysave and confirmed (as above) that I was getting the Apple python.

It seems like 2.7.3 is the latest version available as a Mac binary build from the main python download page. This will make it a bit easier to check I'm running the right interpreter! So I downloaded the dmg from the following link and ran the installer: http://www.python.org/ftp/python/2.7.3/python-2.7.3-macosx10.6.dmg. For me this was an upgrade rather than a clean install. The resulting binaries are put on the path in /usr/local/bin. By default, the interpreter runs in 64bit mode but it can be invoked in 32-bit mode too:

$ /usr/local/bin/python
Python 2.7.3 (v2.7.3:70274d53c1dd, Apr  9 2012, 20:52:43) 
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys;"%X"%sys.maxsize
'7FFFFFFFFFFFFFFF'

$ which python-32
/usr/local/bin/python-32
$ python-32
Python 2.7.3 (v2.7.3:70274d53c1dd, Apr  9 2012, 20:52:43) 
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys;"%X"%sys.maxsize
'7FFFFFFF'

This might be enough, but the wxPython home page does recommend that you use different python installations if you want to run both the Carbon and Cocoa versions. So I'll repeat the installation with a python 2.6 build. The current binary build is 2.6.6, this is missing some important security fixes that have been included in 2.6.8 but it looks safe for the migration tool. I downloaded the 2.6 installer from here. When I ran the installer I made sure I only installed the framework, I don't want everything else getting in the way.

I now have a python 2.6 installation in /Library/Frameworks/Python.framework/Versions/2.6

$ /Library/Frameworks/Python.framework/Versions/2.6/bin/python2.6
Python 2.6.6 (r266:84374, Aug 31 2010, 11:00:51) 
[GCC 4.0.1 (Apple Inc. build 5493)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys;"%X"%sys.maxsize
'7FFFFFFF'

Notice that this is a 32-bit build. Apple actually ship a 64-bit build of 2.6.7 so care will be needed, typing python2.6 at the terminal will not bring up this new installation.

To make life a little bit easier I always create a bin directory in my home directory and add it to the path in my .bash_profile using lines like these:

PATH=~/bin:${PATH}
export PATH

This is going to come in very handy in the next step.

Step 2: setuptools and easy_install

setuptools is required by lots of Python packages, it is designed to make your life very easy but it takes a bit of fiddling to get it working with these custom installations. It's an egg which means it runs magically from the command line, I'll show you the process of installing it on python 2.6 but the instructions for putting it in 2.7 are almost identical (it's just a different egg).

I downloaded the egg from here: http://pypi.python.org/packages/2.6/s/setuptools/setuptools-0.6c11-py2.6.egg#md5=bfa92100bd772d5a213eedd356d64086 and then took a peek at the top of the script:

$ head -n 8 setuptools-0.6c11-py2.6.egg 
#!/bin/sh
if [ `basename $0` = "setuptools-0.6c11-py2.6.egg" ]
then exec python2.6 -c "import sys, os; sys.path.insert(0, os.path.abspath('$0')); from setuptools.command.easy_install import bootstrap; sys.exit(bootstrap())" "$@"
else
  echo $0 is not the correct name for this egg file.
  echo Please rename it back to setuptools-0.6c11-py2.6.egg and try again.
  exec false
fi

I've only shown the top 8 lines here, the rest is binary encoded gibberish. The thing to notice is that, on line 3 it invokes python2.6 directly so if I want to control which python installation setuptools is installed for I need to ensure that python2.6 invokes the correct interpreter. That's where my local bin directory and path manipulation comes in handy.

$ cd ~/bin
$ ln -s /Library/Frameworks/Python.framework/Versions/2.6/bin/python2.6 python2.6
$ ln -s /Library/Frameworks/Python.framework/Versions/2.6/bin/python2.7 python2.7

Now for me, and anyone who inherits my $PATH, invoking python2.6 will start my custom MacPython install.

$ sudo -l python2.6
/Users/swl10/bin/python2.6

Fortunately sudo is configured to inherit my environment. It was worth checking as this is configurable. I can now install setuptools from the egg:

$ sudo sh setuptools-0.6c11-py2.6.egg 
Password: [I had to type my root password here]
Processing setuptools-0.6c11-py2.6.egg
Copying setuptools-0.6c11-py2.6.egg to /Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages
...

I did the same for python 2.7 (note that you need a different egg) and then added links to easy_install to my bin directory:

$ cd ~/bin
ln -s /Library/Frameworks/Python.framework/Versions/2.7/bin/easy_install easy_install-2.7
ln -s /Library/Frameworks/Python.framework/Versions/2.6/bin/easy_install easy_install-2.6

Step 3: wxPython

wxPython is a binary installer tied to a particular python version. However, I believe it uses a scatter gun approach to search for python installations and will install itself everywhere with a single click. That is the reason why it is better to run completely different versions of python if you want completely different versions of wxPython. In fact, if you find yourself with multiple installs there is a wiki page that explains how to switch between versions. But a little playing reveals that this refers to Major.Minor version numbers. It can't cope with the subtlety of switching between builds or switching between Carbon and Cocoa as far as I can tell so this won't help us.

My plan is to install the Carbon wxPython (which is 32bit only) for python 2.6 and the newer Cocoa wxPython for python 2.7. The wxPython download page has a stable and unstable version but to get Cocoa I'll need to use the unstable version. The stability referred to is that of the API, rather than the quality of the code. Being cautious I downloaded the stable 2.8 (Carbon) installer for python 2.6 and the unstable 2.9 Cocoa installer for python 2.7. Installation is easy but look out for a useful script on the disk image which allows you to review and manage your installations. To invoke the script you can just run it from the command line:

$ /Volumes/wxPython2.9-osx-2.9.3.1-cocoa-py2.7/uninstall_wxPython.py

When I was done with the installations it reported the following configurations as being present:

  1.  wxPython2.8-osx-unicode-universal-py2.6     2.8.12.1
  2.  wxPython2.9-osx-cocoa-py2.7                 2.9.3.1

(If, like me, you are upgrading from previous installations you may have to clean up older builds here.) At this point I tested my wxPython based programs and confirmed that they were working OK. I was impressed that the Cocoa version seems to work unchanged.

Step 4: py2app

With the groundwork done right, the last step is very simple. The symlinks we put in for easy_install make it easy to install py2app.

$ sudo easy_install-2.6 -U py2app
Password: [type your root password here]
Searching for py2app
Reading http://pypi.python.org/simple/py2app/
Reading http://undefined.org/python/#py2app
Reading http://bitbucket.org/ronaldoussoren/py2app
Best match: py2app 0.6.4

...[snip]...

Installed /Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/altgraph-0.9-py2.6.egg
Finished processing dependencies for py2app

The plot is very similar for python 2.7.

I've now added the following to my setup script:

import wx
if "cocoa" in wx.version():
  suffix="-Cocoa"
else:
  suffix="-Carbon"

The suffix is then appended to the name passed to the setup call itself. The following command results in a 32bit, Carbon binary, compatible with OS X 10.3 onwards.

$ python2.6 setup.py py2app

While this command creates a Cocoa based 64bit binary for 10.5 and later.

$ python2.7 setup.py py2app

And that is how to target both Carbon and Cocoa in your wxPython projects.

2012-05-22

Common Cartridge, Namespaces and Dependency Injection

This post is about coping with a significant change to the newer (public) versions of the IMS Common Cartridge specification. This change won't affect everyone the same way, your implementation may just shrug it off. However, I found I had to make an important change to the QTI migration tool code to make it possible to read QTI version 1 files from the newer form of cartridges.

There have been three versions of this specification now, versions 1.0, 1.1 and most recently version 1.2. The significant change for me was between versions 1.0 (published October 2008) and 1.1 (revised May 2011).

Changing Namespaces

The key change between 1.0 and 1.1 was to the namespaces used in the XML files. In version 1.0, the default namespace for content packaging elements is used in the manifest file: http://www.imsglobal.org/xsd/imscp_v1p1.

Content Packaging has also been through several revisions. The v1p1 namespace (above) was defined in the widely used Content Packaging 1.1 (now on revision 4). The same namespace was used for most of the elements in the (public draft) of the newer IMS Content Packaging version 1.2 specification too. In this case, the decision was made to augment the revised specification with a new schema containing definitions of the new elements only. The existing elements would stay in the 1.1 namespace to ensure that tools that recognise version 1.1 packages continue to work, ignoring the unrecognised extension elements.

Confusingly though, the schema definition provided with the content packaging specification is located here: http://www.imsglobal.org/xsd/imscp_v1p1.xsd whereas the schema definition provided with the common cartridge specification (1.0), for the same namespace, is located here: http://www.imsglobal.org/profile/cc/ccv1p0/derived_schema/imscp_v1p2.xsd. That's two different definition files for the same namespace. Given this discrepancy it is not surprising that newer revisions of common cartridge have chosen to use a new namespace entirely. In the case of 1.1, the namespace used for the basic content packaging elements was changed to http://www.imsglobal.org/xsd/imsccv1p1/imscp_v1p1.

But this decision is not without consequences. The decision to retain a consistent namespace in the various revisions of the Content Packaging specification enabled existing tools to continue working. Sure enough, the decision to change the namespace in Common Cartridge means that some tools will not continue working. Including my Python libraries used in the QTI migration tool.

From Parser to Python Class

In the early days of XML, you could identify an element within a document by its name, scoped perhaps by the PUBLIC identifier given in the document type definition. The disadvantage being that all elements had to be defined in the same scope. Namespace prefixes were used to help sort this mess out. A namespace aware parser splits off the namespace prefix (everything up to the colon) from the element name and uses it to identify the element by a pair of strings: the namespace (a URI) and the remainder of the element name.

The XML parser at the heart of my python libraries uses these namespace/name pairs as keys into a dictionary which it uses to look up the class object it should use to represent the element. The advantage of this approach is that I can add behaviour to the XML elements when they are deserialized from their XML representations through the methods defined on the corresponding classes. Furthermore, a rich class hierarchy can be defined allowing concepts such as XHTML's organization of elements into groups like 'inline elements' to be represented directly in the class hierarchy.

If I need two different XML definitions to map to the same class I can easily do this by adding multiple entries to the dictionary and mapping them to the same class. So at first glance I seem to have avoided some of the problems inherent with tight-coupling of classes. The following two elements could be mapped to the same Manifest class in my program:

('http://www.imsglobal.org/xsd/imscp_v1p1', 'manifest')
('http://www.imsglobal.org/xsd/imsccv1p1/imscp_v1p1', 'manifest')

This would work fine when reading the manifest from the XML stream but what about writing manifests? How does my Manifest class know which namespace to use when I'm creating a new manifest? The following code snippet from the python interpreter shows me creating an instance of a Manifest (I pass None as the element's parent). The instance knows which namespace it should be in:

>>> import pyslet.imscpv1p2 as cp
>>> m=cp.Manifest(None)
>>> print m

<manifest xmlns="http://www.imsglobal.org/xsd/imscp_v1p1">
 <organizations/>
 <resources/>
</manifest>

This clearly won't work for the new common cartridges. The Manifest class 'knows' the namespace it is supposed to be in because its canonical XML name is provided as a class attribute on its definition. The obvious solution is to wrap the class with a special common cartridge Manifest that overrides this attribute. That is relatively easy to do, here is the updated definition:

class Manifest(cp.Manifest):
    XMLNAME=("http://www.imsglobal.org/xsd/imsccv1p1/imscp_v1p1",'manifest')

Unfortunately, this doesn't do enough. Continuing to use the python interpreter....

>>> class Manifest(cp.Manifest):
...     XMLNAME=("http://www.imsglobal.org/xsd/imsccv1p1/imscp_v1p1",'manifest')
... 
>>> m=Manifest(None)
>>> print m

<manifest xmlns="http://www.imsglobal.org/xsd/imsccv1p1/imscp_v1p1">
    <organizations xmlns="http://www.imsglobal.org/xsd/imscp_v1p1"/>
    <resources xmlns="http://www.imsglobal.org/xsd/imscp_v1p1"/>
</manifest>

Now we've got the namespace correct on the manifest but the required organizations and resources elements are still created in the old namespace.

The Return of Tight Coupling

If I'm going to fix this issue I'm going to have to wrap the classes used for all the elements in the Content Packaging specification. That sounds like a bit of a chore but remember that the reason why the namespace has changed is because Common Cartridge has added some additional constraints to the specification so we're likely to have to override at least some of the behaviours too.

Unfortunately, wrapping the classes still isn't enough. In the above example the organizations and resources elements are required children of the manifest. So when I created my instance of the Manifest class the Manifest's constructor needed to create instances of the related Organizations and Resources classes and it does this using the default implementations, not the wrapped versions I've defined in my Common Cartridge module. This is known as tight coupling, and the solution is to adopt a dependency injection solution. For a more comprehensive primer on common solutions to this pattern you could do worse than reading Martin Fowler's article Inversion of Control Containers and the Dependency Injection pattern.

The important point here is that the logic inside my Manifest class, including the logic that takes place during construction, needs to be decoupled from the decision to use a particular class object to instantiate the Organizations and Resources elements. These dependencies need to be injected into the code somehow.

I must admit, I find the example solutions in Java frameworks confusing because the additional coding required to satisfy the compiler makes it harder to see what is really going on. There aren't many good examples of how to solve the problem in python. The python wiki points straight to an article called Dependency Injection The Python Way. But this article describes a full feature broker (like the service locator solution) which seems like overkill for my coupling problem.

A simpler solution is to pass dependencies in (in my case on the constructor) following a pattern similar to the one in this blogpost. In fact, this poster is trying to solve a related problem of module-level dependeny but the basic idea is the same. I could pass the wrapped class objects to the constructor.

Dependency Injection using Class Attributes

The spirit of the python language is certainly one of adopting the simplest solution that solves the problem. So here is my dependency injection solution to this specific case of tight coupling.

I start by adding class attributes to set class dependencies. My base Manifest class now looks something like this:

class Manifest:
    XMLNAME=("http://www.imsglobal.org/xsd/imscp_v1p1",'manifest')
    MetadataClass=Metadata
    OrganizationsClass=Organizations
    ResourcesClass=Resources

    # method definitions and other attributes follow...

And in my Common Cartridge module it is overridden like this:

class Manifest(cp.Manifest):
    XMLNAME=("http://www.imsglobal.org/xsd/imsccv1p1/imscp_v1p1",'manifest')
    MetadataClass=Metadata
    OrganizationsClass=Organizations
    ResourcesClass=Resources

Although these look similar, in the first case the Metadata, Organizations and Resources names refer to classes in the base Content Packaging module whereas in the second definition they refer to overrides in the Common Cartridge Module (note the use of cp.Manifest to select the base class from the original Content Packaging module).

Now the original Manifest's constructor is modified to use these class attributes to create the required child elements:

    def __init__(self,parent):
        self.Metadata=None
        self.Organizations=self.OrganizationsClass(self)
        self.Resources=self.ResourcesClass(self)

The upshot is that when I create an instance of the Common Cartridge Manifest I don't need to override the constructor just to solve the dependency problem. The base class constructor will now create the correct Organizations and Resources members using the overridden class attributes.

I've abbreviated the code a bit, if you want to see the full implementation you can see it in the trunk of the pyslet framework.

2011-07-17

Using gencodec to make a custom character mapping

One of the problems I face in the QTI migration tool is markup that looks like this:

<mattext>The circumference of a circle diameter 1 is given by the mathematical constant: </mattext>
<mattext charset="greek">p</mattext>

In XML the charset used in a document is detected according to various rules, starting from information available before the XML stream is parsed and culminating in the encoding declaration in the XML declaration at the top of the file:

<?xml version = "1.0" encoding = "UTF-8">

For this reason, the use of the charset parameter in QTI version 1 is of limited value, at best it might provide a hint on an appropriate font to use when rendering the element. This is not a huge problem these days but when QTI v1 was written it was common for document renderings to be peppered with large squares indicating that the selected font had no glyph for the required character. These days renderers are smarter about selecting default fonts enabling developers to display arbitrary unicode text.

So you would think that charset is redundant but there is one situation where we do need to take note: the symbol font. The problem is explained well in this article: Symbol font – Unicode alternatives for Greek and special characters in HTML. The use of 'greek' in the QTI v1 examples is clearly intended to indicate use of the symbol font in a similar way - not the use of the 'greek' codepage in ISO-8859. The Symbol font is used a lot in older mathematical questions, you can play around with the codec on this neat little web page: Symbol font to Unicode converter.

According to the above article the unicode character representing the lower-case letter 'p', when rendered in the symbol font actually appears to the user like this: π - known as Greek small letter pi.

The problem for my Python script is that I need to map these characters to the target unicode forms before writing them out to the QTI version 2 file. This is where the neat gencodec.py script comes in. I don't know where this is documented other than in the gencodec source file itself. But this is a very useful utility!

The synopsis of the tool is:

This script parses Unicode mapping files as available from the Unicode
site (ftp://ftp.unicode.org/Public/MAPPINGS/) and creates Python codecmodules from them.

So I downloaded the following mapping to a directory called 'codecs' on my laptop:

ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/APPLE/SYMBOL.TXT

Then I ran the gencodec script:

$ python gencodec.py codecs pyslet
converting SYMBOL.TXT to pysletsymbol.py and pysletsymbol.mapping

And confirmed that the mapping was working using the interpreter:

$ python
Python 2.7.1 (r271:86882M, Nov 30 2010, 09:39:13) 
[GCC 4.0.1 (Apple Inc. build 5494)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> unicode('p','symbol')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
LookupError: unknown encoding: symbol
>>> import pysletsymbol
>>> reg=pysletsymbol.getregentry()
>>> import codecs
>>> def SymbolSearch(name):
...   if name=='symbol': return reg;
...   else: return None
... 
>>> codecs.register(SymbolSearch)
>>> unicode('p','symbol')
u'\u03c0'
>>> print unicode('p','symbol')
π

In previous versions of the migration tool I didn't include symbol font mapping because I thought it would be too laborious to create the mapping. I was wrong, future versions will do this mapping automatically.

2011-06-27

Amount of profanity in git commit messages per programming language

I spotted this blog page from a list I subscribe to the other day, those sensitive to profanity should look away now others can see the stats here...

Amount of profanity in git commit messages per programming language

Given that C# and Java are similar in many ways and are often used for the same things it is amusing that they both induce exactly equal levels of profanity in their developer communities.

The figures for different languages are significantly different (with C++ being the most sweary language to work in it seems) so I feel like this data is trying to tell us something.

And who are the nicest people to program with? PHP developers it seems (with Python not far behind).

2011-06-24

Visual C++ Redistributable Licensing: I'm just not seeing it

As part of putting together the latest builds of the QTI Migration tool I have had to repackage the updated tool into a new installer.

The migration tool is written in python and uses the py2exe tool to convert the Python script into a set of binaries that can be distributed to other Windows systems as a ready-to-run application without requiring Python (and various other packages, including wxPython: used for the GUI) to be installed first.

The output of py2exe is a folder containing the executable and all its supporting files ready to package up. Originally this was all done by Pierre, my co-chair of the QTI working group. I'm happy to report that updating the installation scripts went fine and I've been able to create a new Windows Installer using InnoSetup.

There is a recipe for using py2exe with wxPython published on pythonlibrary.org called "A py2exe tutorial". However, I did have one problem with this recipe - I too had trouble with MSVCP90.dll but I needed the help of stackoverflow (thread: py2exe fails to generate an executable) to actually get the build going. Once done, I was concerned with the warning messages about the need to have a license to redistribute the DLL in my installer. I found another blog post on distributing python apps for the windows platform which spelt out my options. As I don't personally own a Visual Studio license it seems like I need to use the redistributable package which can be downloaded from Microsoft.

Unfortunately, when I download this file the license in the resulting installer does not appear compatible with packaging it into my installer for distribution with my tool.

Several people on the net seem to suggest that the DLL is off-limits but the 'redistributable' does exactly what it says on the tin. Indeed, if you don't run the package it isn't clear what license you signed up to by downloading it but once you run the installer it clearly says that "You may make one backup copy of the software. You may use it only to reinstall the software." and that you may not "publish the software for others to copy". So I've played safe and am crossing my fingers that my users will have already installed these wretched DLLs on their system before they try the migration tool.

Previous versions of the migration tool installer were built by Pierre and he did have a Visual Studio license so could do the build and redistribute the software.

My experience and the time I wasted trying to find an answer to this question eventually turned up one discussion thread in which the complex issues that the team within Microsoft faces are exposed: see VC++ 2005 redistributable. Although this thread is a little old now the replies from Nikola Dudar are helpful in providing deeper insight into the issue and the conflict that having a chargeable development platform creates. On one hand Microsoft would like it to be easy for people to create software for their platform but they also have a paid-for development tool chain in Visual Studio. The existence of Visual Studio Express edition (a free lightweight development environment) appears to be suitable only for personal hobbyists and not for anyone wanting to build software for redistribution. There are lots of replies to the above article but if you search down for "release team" there is a reply that emphasises the difficulty of finding the balance between paid and express editions and a link to a blog post relating to the creation of the free to download redistributable packages. I like these types of forum discussions as they show that even 'evil empires' like Microsoft are full of ordinary people just trying to do their jobs.

2011-06-07

Reports of Mono's Death are Greatly Exaggerated...

This post was provoked by fears that following the acquisition of Novell by Attachmate the Mono project faces an uncertain future. I've documented my thoughts on the Java/C# schism and what it might mean for my attempts to get my own Python modules working in these environments.

Mono is an open source project that implements C# and the supporting .Net framework allowing this language to be used on platforms other than Microsoft's Windows operating systems.

The schism between C# and Java is (was?) very harmful in my opinion and represents a huge failure of the technology industry back in the 1990s when the key commercial players were unable or unwilling to reach an agreement over Java and Microsoft redirected their efforts to developing C#. (Just imagine where we would be if the C language had suffered the same fate!)

Since then, Java programmers have smugly assumed that their code would always work on a wider variety of platforms and represented the more "open" choice. I always felt that Mono did just enough to enable the C# community to retain its credibility even though it would be hard to argue that it was a more open choice, especially given the absence of any standard in versions 3 and 4 of the language. However, Oracle's acquisition of Sun has created a sense of uncertainty in the Java community too.

In both cases it seems natural to use the word 'community' because programming languages do tend to foster a community of users who interact and share knowledge. In the case of open source communities they also share code by contributing to the frameworks that support the language's users. This latter point is critical to me, the Java community goes way beyond the core framework. Java without the work of the Apache foundation would be significantly less useful for programming web applications.

That said, there is a new community of Java developers emerging because of its use on the Android mobile platform. This programming community may share the same syntax but could easily become quite distinct. In some ways it is a return to Java's roots. Java was invented as a language for embedded devices where the types of programming errors C/C++ developers were making could be fatal. The sandbox was a key part of this, ensuring a higher level of security for the system and protecting it from rogue applications. These are just the qualities you need on a mobile phone or consumer electronics device where the cost of bricking your customers' favourite toys is an expensive repair and replace programme. C# is also in this space, in this recent article on The Death of Mono notice that the knight in shining armour is driven by a mobile-based business case.

So if you want to use C# and .Net to develop web applications it seems to me that you are better sticking with Microsoft's technology stack and playing in that community because running your code on other platforms is likely to get harder, not easier. And so the Java/C# schism lives on in the web app world.

Python and the Java/C# Schism

Given that the C# and Java communities seem to be a playing out an each to their own strategy it got me wondering about the Python community and how IronPython and Jython fit in. Python started out as a scripting language implemented in C/C++. There is typically no virtual machine or sandbox, it is just a pleasant and convenient way to spend a few days writing programs that you would have previously wasted years of your life trying to implement in C++. The Python framework is a blend of modules written purely in Python with some bare-to-the-metal modules that wrap underlying C++ code.

Given that both Java and C# provide C/C++ like environments with the added safety of a sandbox and garbage collection, implementing Python on these platforms was a logical step and Jython (Python on Java) and IronPython (Python on C#) have even caused the word CPython to enter the vocabulary as the original Python interpreter.

In an earlier blog post I described my first steps with IronPython and described how previous attempts to implement PyAssess and my QTI migration tool had failed on Jython. With hindsight, I shouldn't be too surprised to see that the IronPython developers have made the same decisions and that my code fails on IronPython for the same reasons it fails on Jython. The technical issue I'm having is described in this discussion thread, which raises concerns of a schism in the Python community itself!

Actually, the trajectory of CPython towards Python 3 should solve this problem and Jython, IronPython and CPython should converge again on the unicode vs string issue, though when that will be is anyone's guess because Python 3 is not backwards compatible. Not only will code need to be reviewed and, in some cases, rewritten but the conversion process will effectively fork most projects into separate source trees which will make maintenance tricky.

As with the Java and C# communities, the framework is just as important as the language and probably more so in defining the community. Even if the basic language converges on the three platforms it seems likely that the C#/Java schism will mean that most projects for IronPython will exist as a more pleasant and convenient way of implementing a C#/.Net project rather than as a target platform for cross-platform projects. For example, Python frameworks like wxPython (a GUI toolkit for desktop apps) rely on the commonality of an underlying framework (the C++ wxWindows) so are unlikely to emerge while the Java/C# schism remains.