2014-05-26

Adding OData support to Django with Pyslet: First Thoughts

A couple of weeks ago I got an interesting tweet from @d34dl0ck, here it is:

This got me thinking, but as I know very little about Django I had to do a bit of research first. Here's my read-back of what Django's data layer does in the form of a concept mapping from OData to Django. In this table the objects are listed in containment order and the use case of using OData to expose data managed by a Django-based website is assumed. (See below for thoughts on consuming OData in Django as if it were a data source.)

OData ConceptDjango ConceptPyslet Concept
DataServices Django website: the purpose of OData is to provide access to your application's data-layer through a standard API for machine-to-machine communication rather than through an HTML-based web view for human consumption. Instance of the DataServices class, typically parsed from a metadata XML file.
Schema No direct equivalent. In OData, the purpose of the schema is to provide a namespace in which definitions of the other elements take place. In Django this information will be spread around your Python source code in the form of class definitions that support the remaining concepts. Instance of the Schema class, typically parsed from a metadata XML file.
EntityContainer The database. An OData service can define multiple containers but there is always a default container - something that corresponds closely with the way Django links to multiple databases. Most OData services probably only define a single container and I would expect that most Django applications use the default database. If you do define custom database routers to map different models to different databases then that information would need to be represented in the corresponding Schema(s). In Pyslet, an EntityContainer is defined by an instance of the EntityContainer class but this instance is handed to a storage layer during application startup and this storage layer class binds concrete implementations of the data access API to the EntitySets it contains.
EntitySet Your model class. A model class maps to a table in the Django database. In OData the metadata file contains the information about which container contains an EntitySet and the EntityType definition in that file contains the actual definitions of the types and field names. In contrast, in Django these are defined using class attributes in the Python code. Pyslet sticks closely to the OData API here and parses definitions from the metadata file. As a result an EntitySet instance is created that represents this part of the model and it is up to the object responsible for interfacing to the storage layer to provide concrete bindings.
Entity An instance of a model class. An instance of the Entity object, typically instantiated by the storage object bound to the EntitySet.

Where do you start?

Step 1: As you can see from the above table, Pyslet depends fairly heavily on the metadata file so a good way to start would be to create a metadata file that corresponds to the parts of your Django data model you want to expose. You have some freedom here but if you are messing about with multiple databases in Django it makes sense to organise these as separate entity containers. You can't create relationships across containers in Pyslet which mirrors the equivalent restriction in Django.

Step 2: You now need to provide a storage object that maps Pyslet's DAL onto the Django DAL. This involves creating a sub-class of the EntityCollection object from Pyslet. To get a feel for the API my suggestion would be to create a class for a specific model initially and then, once this is working, consider how you might use Python's built-in introspection to write a more general object.

To start with, you don't need to do too much. EntityCollection objects are just like dictionaries but you only need to override itervalues and __getitem__ to get some sort of implementation going. There are simple wrappers that will (inefficiently) handle ordering and filtering for you to start with so itervalues can be very simple...

def itervalues(self):
    return self.OrderEntities(
        self.ExpandEntities(
        self.FilterEntities(
        self.entityGenerator())))

All you need to do is write the entityGenerator method (the name is up to you) and yield Entity instances from your Django model. This looks pretty simple in Django, something like Customer.objects.all() where Customer is the name of a model class would appear to return all customer instances. You need to yield an Entity object from Pyslet's DAL for each customer instance and populate the property values from the fields of the returned model instance.

Implementing __getitem__ is probably also very easy, especially when you are using simple keys. Something like Customer.objects.get(pk=1) and then a similar mapping to the above seems like it would work for implementing basic resource look up by key. Look at the in-memory collection class implementation for the details of how to check the filter and populate the field values, it's in pyslet/odata2/memds.py.

Probably the hardest part of defining an EntityCollection object is getting the constructor right. You'll want to pass through the Model class from Django so that you can make calls like the above:

def __init__(self,djangoModel,**kwArgs):
    self.djangoModel=djangoModel
    super(DjangoCollection,self).__init__(**kwArgs)

Step 3: Load the metadata from a file, then bind your EntityCollection class or classes to the EntitySets. Something like this might work:

import pyslet.odata2.metadata as edmx
doc=edmx.Document()
with open('DjangoAppMetadata.xml','rb') as f:
    doc.Read(f)
customers=doc.root.DataServices['DjangoAppSchema.DjangoDatabase.Customers']
# customers is an EntitySet instance
customers.Bind(DjangoCollection,djangoModel=Customer)

The Customer object here is your Django model object for Customers and the DjangoCollection object is the EntityCollection object you created in Step 2. Each time someone opens the customers entity set a new DjangoCollection object will be created and Customer will be passed as the djangoModel parameter.

Step 4: Test that the model is working by using the interpreter or a simple script to open the customers object (the EntitySet) and make queries with the Pyslet DAL API. If it works, you can wrap it with an OData server class and just hook the resulting wsgi object to your web server and you have hacked something together.

Post hack

You'll want to look at Pyslet's expression objects and figure out how to map these onto the query objects used by Django. Although OData provides a rich query syntax you don't need to support it all, just reject stuff you don't want to implement. Simple queries look like they'd map to things you can pass to the filter method in Django fairly easily. In fact, one of the problems with OData is that it is very general - almost SQL over the web - and your application's data layer is probably optimised for some queries and not others. Do you want to allow people to search your zillion-record table using a query that forces a full table scan? Probably not.

You'll also want to look at navigation properties which map fairly neatly to the relationship fields. The Django DAL and Pyslet's DAL are not miles apart here so you should be able to create NavigationCollection objects (equivalent to the class you created in Step 2 above) for these. At this point, the power of OData will begin to come alive for you.

Making it Django-like

I'm not an expert on what is and is not Django like but I did notice that there is a Feed concept for exposing RSS in Django. If the post hack process has left you with a useful implementation then some sort of OData equivalent object might be a useful addition. Given that Django tends to do much of the heavy lifting you could think about providing an OData feed object. It probably isn't too hard to auto-generate the metadata from something like class attributes on such an object. Pyslet's OData server is a wsgi application so provided Django can route requests to it you'll probably end up with something that is fairly nicely integrated - even if it can't do that out of the box it should be trivial to provide a simple Django request handler that fakes a wsgi call.

Consuming OData

Normally you think of consuming OData as being easier than providing it but for Django you'd be tempted to consider exposing OData as a data source, perhaps as an auxiliary database containing some models that are externally stored. This would allow you to use the power of Django to create an application which mashed up data from OData sources as if that data were stored in a locally accessible database.

This appears to be a more ambitious project: Django non-rel appears to be a separate project and it isn't clear how easy it would be to intermingle data coming form an OData source with data coming from local databases. It is unlikely that you'd want to use OData for all data in your application. The alternative might be to try and write a Python DB API interface for Pyslet's DAL and then get Django treating it like a proper database. That would mean parsing SQL, which is nasty, but it might be the lesser of two evils.

Of course, there's nothing stopping you using Pyslet's builtin OData client class directly in your code to augment your custom views with data pulled from an external source. One of the features of Pyslet's OData client is that it treats the remote server like a data source, keeping persistent HTTP connections open, managing multi-threaded access and and pipelining requests to improve throughput. That should make it fairly easy to integrate into your Django application.

2014-05-12

A Dictionary-like Python interface for OData Part III: a SQL-backed OData Server

This is the third and last part of a series of three posts that introduce my OData framework for Python. To recap:

  1. In Part I I introduced a new data access layer I've written for Python that is modelled on the conventions of OData. In that post I validated the API by writing a concrete implementation in the form of an OData client.
  2. In Part II I used the same API and wrote a concrete implementation using a simple in-memory storage model. I also introduced the OData server functionality to expose the API via the OData protocol.
  3. In this part, I conclude this mini-series with a quick look at another concrete implementation of the API which wraps Python's DB API allowing you to store data in a SQL environment.

As before, you can download the source code from the QTIMigration Tool & Pyslet home page. I wrote a brief tutorial on using the SQL backed classes to take care of some of the technical details.

Rain or Shine?

To make this project a little more interesting I went looking for a real data set to play with. I'm a bit of a weather watcher at home and for almost 20 years I've enjoyed using a local weather station run by a research group at the University of Cambridge. The group is currently part of the Cambridge Computer Laboratory and the station has moved to the William Gates building.

The Database

The SQL implementation comes in two halves. The base classes are as close to standard SQL as I could get and then a small 'shim' sits over the top which binds to a specific database implementation. The Python DB API takes you most of the way, including helping out with the correct form of parameterisation to use. For this example project I used SQLite because the driver is typically available in Python implementations straight out of the box.

I wrote the OData-style metadata document first and used it to automatically generate the CREATE TABLE commands but in most cases you'll probably have an existing database or want to edit the generated scripts and run them by hand. The main table in my schema got created from this SQL:

CREATE TABLE "DataPoints" (
    "TimePoint" TIMESTAMP NOT NULL,
    "Temperature" REAL,
    "Humidity" SMALLINT,
    "DewPoint" REAL,
    "Pressure" SMALLINT,
    "WindSpeed" REAL,
    "WindDirection" TEXT,
    "WindSpeedMax" REAL,
    "SunRainStart" REAL,
    "Sun" REAL,
    "Rain" REAL,
    "DataPointNotes_ID" INTEGER,
    PRIMARY KEY ("TimePoint"),
    CONSTRAINT "DataPointNotes" FOREIGN KEY ("DataPointNotes_ID") REFERENCES "Notes"("ID"))

To expose the database via my new data-access-layer API you just load the XML metadata, create a SQL container object containing the concrete implementation and then you can access the data in exactly the same way as I did in Part's I and II. The code that consumes the API doesn't need to know if the data source is an OData client, an in memory dummy source or a full-blown SQL database. Once I'd loaded the data, here is a simple session with the Python interpreter that shows you the API in action.

>>> import pyslet.odata2.metadata as edmx
>>> import pyslet.odata2.core as core
>>> doc=edmx.Document()
>>> with open('WeatherSchema.xml','rb') as f: doc.Read(f)
... 
>>> from pyslet.odata2.sqlds import SQLiteEntityContainer
>>> container=SQLiteEntityContainer(filePath='weather.db',containerDef=doc.root.DataServices['WeatherSchema.CambridgeWeather'])
>>> weatherData=doc.root.DataServices['WeatherSchema.CambridgeWeather.DataPoints']
>>> collection=weatherData.OpenCollection()
>>> collection.OrderBy(core.CommonExpression.OrderByFromString('WindSpeedMax desc'))
>>> collection.SetPage(5)
>>> for e in collection.iterpage(): print "%s: Max wind speed: %0.1f mph"%(unicode(e['TimePoint'].value),e['WindSpeedMax'].value*1.15078)
... 
2002-10-27T10:30:00: Max wind speed: 85.2 mph
2004-03-20T15:30:00: Max wind speed: 82.9 mph
2007-01-18T14:30:00: Max wind speed: 80.6 mph
2004-03-20T16:00:00: Max wind speed: 78.3 mph
2005-01-08T06:00:00: Max wind speed: 78.3 mph

Notice that the container itself isn't needed when accessing the data because the SQLiteEntityContainer __init__ method takes care of binding the appropriate collection classes to the model passed in. Unfortunately the dataset doesn't go all the way back to the great storm of 1987 which is a shame as at the time I was living in a 5th floor flat perched on top of what I was reliably informed was the highest building in Cambridge not to have some form of structural support. I woke up when the building shook so much my bed moved across the floor.

Setting up a Server

I used the same technique as I did in Part II to wrap the API with an OData server and then had some real fun getting it up and running on Amazon's EC2. Pyslet requires Python 2.7 but EC2 Linux comes with Python 2.6 out of the box. Thanks to this blog article for help with getting Python 2.7 installed. I also had to build mod_wsgi from scratch in order to get it to pick up the version I wanted. Essentially here's what I did:

# Python 2.7 install
sudo yum install make automake gcc gcc-c++ kernel-devel git-core -y
sudo yum install python27-devel -y
# Apache install
#  Thanks to http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/install-LAMP.html
sudo yum groupinstall -y "Web Server"
sudo service httpd start
sudo chkconfig httpd on

And to get mod_wsgi working with Python2.7...

sudo bash
cd
yum install httpd-devel -y
mkdir downloads
cd downloads
wget http://modwsgi.googlecode.com/files/mod_wsgi-3.4.tar.gz
tar -xzvf mod_wsgi-3.4.tar.gz
cd mod_wsgi-3.4
./configure --with-python=/usr/bin/python2.7
make
make install
# Optional check to ensure that we've got the correct Python linked
# you should see the 2.7 library linked
ldd /etc/httpd/modules/mod_wsgi.so
service httpd restart

To drive the server with mod_wsgi I used a script like this:

#! /usr/bin/env python

import logging, os.path
import pyslet.odata2.metadata as edmx
from pyslet.odata2.sqlds import SQLiteEntityContainer
from pyslet.odata2.server import ReadOnlyServer

HOME_DIR=os.path.split(os.path.abspath(__file__))[0]
SERVICE_ROOT="http://odata.pyslet.org/weather"

logging.basicConfig(filename='/var/www/wsgi-log/python.log',level=logging.INFO)

doc=edmx.Document()
with open(os.path.join(HOME_DIR,'WeatherSchema.xml'),'rb') as f:
    doc.Read(f)

container=SQLiteEntityContainer(filePath=os.path.join(HOME_DIR,'weather.db'),
    containerDef=doc.root.DataServices['WeatherSchema.CambridgeWeather'])

server=ReadOnlyServer(serviceRoot=SERVICE_ROOT)
server.SetModel(doc)

def application(environ, start_response):
 return server(environ,start_response)

I'm relying on the fact that Apache is configured to run Python internally and that my server object persists between calls. I think by default mod_wsgi serialises calls to the application method but a smarter configuration with a multi-threaded daemon would be OK because the server and container objects are thread safe. There are limits to the underlying SQLite module of course so you may not gain a lot of performance this way but a proper database would help.

Try it out!

If you were watching carefully you'll see that the above script uses a public service root. So let's try the same query but this time using OData. Here it is in Firefox:

Notice that Firefox recognises that the OData feed is an Atom feed and displays the syndication title and updated date. I used the metadata document to map the temperature and the date of the observation to these (you can see they are the same data points as above by the matching dates). The windiest days are never particularly hot or cold in Cambridge because they are almost always associated with Atlantic storms and the sea temperature just doesn't change that much.

The server is hosted at http://odata.pyslet.org/weather