swl10: OData

Showing posts with label OData. Show all posts

2015-02-16

Accessing the ESA Sentinel Mission Data with Python and OData

I've had a couple of enquiries now about how to access the OData feeds on the ESA Sentinel mission science data hub. Sentinel 1 is the first of a new group of satellites in the Copernicus programme to monitor the Earth. That's about all I know I'm afraid. This data is not pretty desktop pictures (though doubtless there are some pretty pictures buried in there somewhere) but raw scientific data from instruments currently orbiting the Earth.

The source code described here is available in the samples directory on GitHub, you must be using the latest Pyslet from master for this script to enable the metadata override technique used here.

The data hub advertises access to the data through OData (version 1) but my Python library, Pyslet, was not able to access the feeds properly: hence the enquiries.

Turns out that the data feeds use a concept called containment in OData. The model of OData is one of entity sets (think SQL tables) with relations between them modelled by navigation properties. There's one particular use case that doesn't work very well in this scenario but seems popular. Given an entity (think table row or record) people want to add arbitrary key-value pairs. The ESA's data model does this by creating 'sub-tables' which define collections of attributes that hang off of each entity. The attribute name is the key in these collections. This doesn't really work in OData v1 (or v2) because these attribute values should still be entities in their own right and therefore they need a unique key and an entity set definition to contain them.

This isn't the only schema I've seen that attempts to do something like this either, SAP have published a similar schema suggesting that some early Java tools exposed OData this way.

The upshot is that you get nasty errors when you try and load these services with Pyslet. It complains of a rather obscure condition concerning (possibly multiple) unbound principals. When I wrote that error message, I didn't expect anyone to ever actually see it.

There's a proper way to do containment in earlier versions of OData, described in Containment is Coming with OData v4 which explains how to use composite keys. As the name of the article suggests though, this is written with hindsight after a better solution has been found for this use case in OData v4.

The fix for the ESA data feed is to download and edit a copy of the advertised metadata to get around the errors reported by Pyslet and then to initialise your OData client using this modified schema instead. It isn't a perfect fix, as far as Pyslet knows those attributes really are unique and do reside in their own entity set but it doesn't really matter for the purposes of using the OData client. You can navigate and formulate queries without tripping over data inconsistencies.

I've written a little script that I've added to Pyslet's sample code directory to illustrate the technique, along with a fixed up metadata file. The result is a little UNIX-style utility for downloading products from the ESA data hub:

$ ./download.py --help
Usage: download.py [options]

Options:
  -h, --help            show this help message and exit
  -u USER, --user=USER  user name for basic auth credentials
  -p PASSWORD, --password=PASSWORD
                        password for basic auth credentials
  -v                    increase verbosity of output up to 3x
  -c, --cert            download and trust site certificate

The data is available via https and requires a user name and password (you'll have to register on the data hub site but it's free to do so). To make it easier to set up the trust aspect I've added a -c option to download the site certificate and store it. If you don't have the site certificate you'll get an error like this:

ERROR:root:scihub.esa.int: closing connection after error failed to build secure connection to scihub.esa.int

Subsequent downloads verify that the site certificate hasn't changed: a bit like the way ssh offers to store a fingerprint the first time you connect to a remote host. Only use the -c option if you trust the network you are running on (you can use Firefox or some other 'trusted' browser to download the certificate too of course).

The password is optional, if you don't provide it you'll be prompted to enter it using Python's getpass function for privacy.

You pass the product identifiers as command line arguments, here is an example of a successful first-time run:

$ ./download.py -c -u swl10 8bf64ff9-f310-4027-b31f-8e95dd9bbf82
Password: 
ERROR:root:Entity set Attributes has more than one unbound principal
dropping mutliplicity of Attribute_Node to 0..1.  Continuing
ERROR:root:Entity set Attributes has more than one unbound principal
dropping mutliplicity of Attribute_Product to 0..1.  Continuing
S1A_EW_GRDM_1SDH_20150207T084156_20150207T084218_004515_0058AE_3051 150751068

After running this command I had a scihub.esa.int.crt file (from the -c option) and a 150MB zip file downloaded to the current directory.

If you run with -vv to provide a bit more information you can see the OData magic in operation:

./download.py -vv -u swl10 8bf64ff9-f310-4027-b31f-8e95dd9bbf82
Password: 
INFO:root:Sending request to scihub.esa.int
INFO:root:GET /dhus/odata/v1/ HTTP/1.1
INFO:root:Connected to scihub.esa.int with DHE-RSA-AES256-SHA, TLSv1/SSLv3, key length 256
INFO:root:Finished Response, status 401
INFO:root:Resending request to: https://scihub.esa.int/dhus/odata/v1/
INFO:root:Sending request to scihub.esa.int
INFO:root:GET /dhus/odata/v1/ HTTP/1.1
INFO:root:Connected to scihub.esa.int with DHE-RSA-AES256-SHA, TLSv1/SSLv3, key length 256
INFO:root:Finished Response, status 200
WARNING:root:Entity set Attributes has an unbound principal: Nodes
WARNING:root:Entity set Attributes has an unbound principal: Products
ERROR:root:Entity set Attributes has more than one unbound principal
dropping multiplicity of Attribute_Node to 0..1.  Continuing
ERROR:root:Entity set Attributes has more than one unbound principal
dropping multiplicity of Attribute_Product to 0..1.  Continuing
INFO:root:Sending request to scihub.esa.int
INFO:root:GET /dhus/odata/v1/Products('8bf64ff9-f310-4027-b31f-8e95dd9bbf82') HTTP/1.1
INFO:root:Connected to scihub.esa.int with DHE-RSA-AES256-SHA, TLSv1/SSLv3, key length 256
INFO:root:Finished Response, status 200
S1A_EW_GRDM_1SDH_20150207T084156_20150207T084218_004515_0058AE_3051 150751068
INFO:root:Sending request to scihub.esa.int
INFO:root:GET /dhus/odata/v1/Products('8bf64ff9-f310-4027-b31f-8e95dd9bbf82')/$value HTTP/1.1
INFO:root:Connected to scihub.esa.int with DHE-RSA-AES256-SHA, TLSv1/SSLv3, key length 256
INFO:root:Finished Response, status 200

As you can see, the fixed up metadata still generates error messages but these are no longer critical and the client is able to interact with the service.

I was given this product identifier as an example of something small to test with. I haven't researched what the data actually represents but the resulting zip file does contain a 'quick_look' image:

2014-11-14

Basic Authentication, SSL and Pyslet's HTTP/OData client

Pyslet is my Python package for Standards in Learning Education and Training and represents a packaging up of the core of my QTI migration script code in a form that makes it easier for other developers to use. Earlier this year I released Pyslet to PyPi and moved development to Github to make it easier for people to download, install and engage with the source code.

Note: this article updated 2017-05-24 with code correction (see comments for details).

Warning: The code in this article will work with the latest Pyslet master from Github, and with any distribution on or later than pyslet-0.5.20141113. At the time of writing the version on PyPi has not been updated!

A recent issue that came up concerns Pyslet's HTTP client. The client is the base class for Pyslet's OData client. In my own work I often use this client to access OData feeds protected with HTTP's basic authentication but I've never properly documented how to do it. There are two approaches...

The simplest way, and the way I used to do it, is to override the client object itself and add the Authorization header at the point where each request is queued.

from pyslet.http.client import Client

class MyAuthenticatedClient(Client):

    # add an __init__ method to set some credentials 
    # in the client

    def queue_request(self, request):
        # add in the authorization credentials
        if (self.credentials is not None and
                not request.has_header("Authorization")):
            request.set_header('Authorization',
                               str(self.credentials))
            super(MyAuthenticatedClient, self).queue_request(request)

This works OK but it forces the issue a bit and will result in the credentials being sent to all URLs, which you may not want. The credentials object should be an instance of pyslet.http.auth.BasicCredentials which takes care of correctly formatting the header. Here is some sample code to create that object:

from pyslet.http.auth import BasicCredentials
from pyslet.rfc2396 import URI

credentials = BasicCredentials()
credentials.userid = "user@example.com"
credentials.password = "secretPa$$word"
credentials.protectionSpace = URI.from_octets(
    'https://www.example.com/mypage').get_canonical_root()

With the above code, str(credentials) returns the string: 'Basic dXNlckBleGFtcGxlLmNvbTpzZWNyZXRQYSQkd29yZA==' which is what you'd expect to pass in the Authorization header.

To make this code play more nicely with the HTTP standard I added some core-support to the HTTP client itself, so you don't need to override the class anymore. The HTTP client now has a credential store and an add_credentials method. Once added, the following happens when a 401 response is received:

The client iterates through any received challenges
Each challenge is matched against the stored credentials
If matching credentials are found then an Authorization header is added and the request resent
If the request receives another 401 response the credentials are removed from the store and we go back to (1)

This process terminates when there are no more credentials that match any of the challenges or when a code other than 401 is received.

If the matching credentials are BasicCredentials (and that's the only type Pyslet supports out of the box!), then some additional logic gets activated on success. RFC 2617 says that for basic authentication, a challenge implies that all paths "at or deeper than the depth of the last symbolic element in the path field" fall into the same protection space. Therefore, when credentials are used successfully, Pyslet adds the path to the credentials using BasicCredentials.add_success_path. Next time a request is sent to a URL on the same server with a path that meets this criterium the Authorization header will be added pre-emptively.

If you want to pre-empt the 401 handling completely then you just need to add a suitable path to the credentials before you add them to the client. So if you know your credentials are good for everything in /website/~user/ you could continue the above code like this:

credentials.add_success_path('/website/~user/')

That last slash is really important, if you leave it off it will add everything in '/website/' to your protection space which is probably not what you want.

SSL

If you're going to pass basic auth credentials around you really should be using https. Python makes it a bit tricky to use HTTPs and be sure that you are using a trusted connection. Pyslet tries to make this a little bit easier. Here's what I do.

With Firefox, go to the site in question and check that SSL is working properly
Export the certificate from the site in PEM format and save to disk, e.g., www.example.com.crt
Repeat for any other sites I want my python script to work with.
Concatenate the files together and save them to, say, 'certificates.pem'
Pass this file name to the HTTP (or OData) client constructor.

from pyslet.http.client import Client

my_client = Client(ca_certs='certificates.pem')
my_client.add_credentials(credentials)

In this code, I've assumed that the credentials were created as above. To be really sure you are secure here, try grabbing a file from a different site or, even better, generate a self-signed certificate and use that instead. (The master version of Pyslet currently has such a certificate ready made in unittests/data_rfc2616/server.crt). Now pass that file for ca_certs and check that you get SSL errors! If you don't, something is broken and you should proceed with caution, or you may just be on a Mac (see notes in Is Python's SSL module correctly validating certificates... for details). And don't pass None for ca_certs as that tells the ssl module not to check at all!

If you don't like messing around with the certificates, and you are using a machine and network that is pretty trustworthy and from which you would happily do your internet banking then the following can be used to proxy for the browser method:

import ssl, string
import pyslet.rfc2396 as uri

certs = []
for s in ('https://www.example.com', 'https://www.example2.com', ):
    # add other sites to the above tuple as you like
    url = uri.URI.from_octets(s)
    certs.append(ssl.get_server_certificate(url.get_addr(),
                 ssl_version=ssl.PROTOCOL_TLSv1))
    with open('certificates.pem', 'wb') as f:
        f.write(string.join(certs,''))

Passing the ssl_version is optional above but the default setting in many Python installations will use the discredited SSLv3 or worse and your server may refuse to serve you, I know mine does! Set it to a protocol you trust.

Remember that you'll have to do this every so often because server certificates expire. You can always grab the certificate authority's certificate instead (and thereby trust a whole slew of sites at once) but if you're going that far then there are better recipes for finding and re-using the built-in machine certificate store anyway. The beauty of this method is that you can self-sign a server certificate you trust and connect to it securely with a Python client without having to mess around with certificate authorities at all, provided you can safely courier the certificate from your server to your client that is! If you are one of the growing number of people who think the whole trust thing is broken anyway since Snowden then this may be an attractive option.

With thanks to @bolhovsky on Github for bringing the need for this article to my attention.

2014-05-26

Adding OData support to Django with Pyslet: First Thoughts

A couple of weeks ago I got an interesting tweet from @d34dl0ck, here it is:

This got me thinking, but as I know very little about Django I had to do a bit of research first. Here's my read-back of what Django's data layer does in the form of a concept mapping from OData to Django. In this table the objects are listed in containment order and the use case of using OData to expose data managed by a Django-based website is assumed. (See below for thoughts on consuming OData in Django as if it were a data source.)

OData Concept	Django Concept	Pyslet Concept
DataServices	Django website: the purpose of OData is to provide access to your application's data-layer through a standard API for machine-to-machine communication rather than through an HTML-based web view for human consumption.	Instance of the DataServices class, typically parsed from a metadata XML file.
Schema	No direct equivalent. In OData, the purpose of the schema is to provide a namespace in which definitions of the other elements take place. In Django this information will be spread around your Python source code in the form of class definitions that support the remaining concepts.	Instance of the Schema class, typically parsed from a metadata XML file.
EntityContainer	The database. An OData service can define multiple containers but there is always a default container - something that corresponds closely with the way Django links to multiple databases. Most OData services probably only define a single container and I would expect that most Django applications use the default database. If you do define custom database routers to map different models to different databases then that information would need to be represented in the corresponding Schema(s).	In Pyslet, an EntityContainer is defined by an instance of the EntityContainer class but this instance is handed to a storage layer during application startup and this storage layer class binds concrete implementations of the data access API to the EntitySets it contains.
EntitySet	Your model class. A model class maps to a table in the Django database. In OData the metadata file contains the information about which container contains an EntitySet and the EntityType definition in that file contains the actual definitions of the types and field names. In contrast, in Django these are defined using class attributes in the Python code.	Pyslet sticks closely to the OData API here and parses definitions from the metadata file. As a result an EntitySet instance is created that represents this part of the model and it is up to the object responsible for interfacing to the storage layer to provide concrete bindings.
Entity	An instance of a model class.	An instance of the Entity object, typically instantiated by the storage object bound to the EntitySet.

Where do you start?

Step 1: As you can see from the above table, Pyslet depends fairly heavily on the metadata file so a good way to start would be to create a metadata file that corresponds to the parts of your Django data model you want to expose. You have some freedom here but if you are messing about with multiple databases in Django it makes sense to organise these as separate entity containers. You can't create relationships across containers in Pyslet which mirrors the equivalent restriction in Django.

Step 2: You now need to provide a storage object that maps Pyslet's DAL onto the Django DAL. This involves creating a sub-class of the EntityCollection object from Pyslet. To get a feel for the API my suggestion would be to create a class for a specific model initially and then, once this is working, consider how you might use Python's built-in introspection to write a more general object.

To start with, you don't need to do too much. EntityCollection objects are just like dictionaries but you only need to override itervalues and __getitem__ to get some sort of implementation going. There are simple wrappers that will (inefficiently) handle ordering and filtering for you to start with so itervalues can be very simple...

def itervalues(self):
    return self.OrderEntities(
        self.ExpandEntities(
        self.FilterEntities(
        self.entityGenerator())))

All you need to do is write the entityGenerator method (the name is up to you) and yield Entity instances from your Django model. This looks pretty simple in Django, something like Customer.objects.all() where Customer is the name of a model class would appear to return all customer instances. You need to yield an Entity object from Pyslet's DAL for each customer instance and populate the property values from the fields of the returned model instance.

Implementing __getitem__ is probably also very easy, especially when you are using simple keys. Something like Customer.objects.get(pk=1) and then a similar mapping to the above seems like it would work for implementing basic resource look up by key. Look at the in-memory collection class implementation for the details of how to check the filter and populate the field values, it's in pyslet/odata2/memds.py.

Probably the hardest part of defining an EntityCollection object is getting the constructor right. You'll want to pass through the Model class from Django so that you can make calls like the above:

def __init__(self,djangoModel,**kwArgs):
    self.djangoModel=djangoModel
    super(DjangoCollection,self).__init__(**kwArgs)

Step 3: Load the metadata from a file, then bind your EntityCollection class or classes to the EntitySets. Something like this might work:

import pyslet.odata2.metadata as edmx
doc=edmx.Document()
with open('DjangoAppMetadata.xml','rb') as f:
    doc.Read(f)
customers=doc.root.DataServices['DjangoAppSchema.DjangoDatabase.Customers']
# customers is an EntitySet instance
customers.Bind(DjangoCollection,djangoModel=Customer)

The Customer object here is your Django model object for Customers and the DjangoCollection object is the EntityCollection object you created in Step 2. Each time someone opens the customers entity set a new DjangoCollection object will be created and Customer will be passed as the djangoModel parameter.

Step 4: Test that the model is working by using the interpreter or a simple script to open the customers object (the EntitySet) and make queries with the Pyslet DAL API. If it works, you can wrap it with an OData server class and just hook the resulting wsgi object to your web server and you have hacked something together.

Post hack

You'll want to look at Pyslet's expression objects and figure out how to map these onto the query objects used by Django. Although OData provides a rich query syntax you don't need to support it all, just reject stuff you don't want to implement. Simple queries look like they'd map to things you can pass to the filter method in Django fairly easily. In fact, one of the problems with OData is that it is very general - almost SQL over the web - and your application's data layer is probably optimised for some queries and not others. Do you want to allow people to search your zillion-record table using a query that forces a full table scan? Probably not.

You'll also want to look at navigation properties which map fairly neatly to the relationship fields. The Django DAL and Pyslet's DAL are not miles apart here so you should be able to create NavigationCollection objects (equivalent to the class you created in Step 2 above) for these. At this point, the power of OData will begin to come alive for you.

Making it Django-like

I'm not an expert on what is and is not Django like but I did notice that there is a Feed concept for exposing RSS in Django. If the post hack process has left you with a useful implementation then some sort of OData equivalent object might be a useful addition. Given that Django tends to do much of the heavy lifting you could think about providing an OData feed object. It probably isn't too hard to auto-generate the metadata from something like class attributes on such an object. Pyslet's OData server is a wsgi application so provided Django can route requests to it you'll probably end up with something that is fairly nicely integrated - even if it can't do that out of the box it should be trivial to provide a simple Django request handler that fakes a wsgi call.

Consuming OData

Normally you think of consuming OData as being easier than providing it but for Django you'd be tempted to consider exposing OData as a data source, perhaps as an auxiliary database containing some models that are externally stored. This would allow you to use the power of Django to create an application which mashed up data from OData sources as if that data were stored in a locally accessible database.

This appears to be a more ambitious project: Django non-rel appears to be a separate project and it isn't clear how easy it would be to intermingle data coming form an OData source with data coming from local databases. It is unlikely that you'd want to use OData for all data in your application. The alternative might be to try and write a Python DB API interface for Pyslet's DAL and then get Django treating it like a proper database. That would mean parsing SQL, which is nasty, but it might be the lesser of two evils.

Of course, there's nothing stopping you using Pyslet's builtin OData client class directly in your code to augment your custom views with data pulled from an external source. One of the features of Pyslet's OData client is that it treats the remote server like a data source, keeping persistent HTTP connections open, managing multi-threaded access and and pipelining requests to improve throughput. That should make it fairly easy to integrate into your Django application.

2014-05-12

A Dictionary-like Python interface for OData Part III: a SQL-backed OData Server

This is the third and last part of a series of three posts that introduce my OData framework for Python. To recap:

In Part I I introduced a new data access layer I've written for Python that is modelled on the conventions of OData. In that post I validated the API by writing a concrete implementation in the form of an OData client.
In Part II I used the same API and wrote a concrete implementation using a simple in-memory storage model. I also introduced the OData server functionality to expose the API via the OData protocol.
In this part, I conclude this mini-series with a quick look at another concrete implementation of the API which wraps Python's DB API allowing you to store data in a SQL environment.

As before, you can download the source code from the QTIMigration Tool & Pyslet home page. I wrote a brief tutorial on using the SQL backed classes to take care of some of the technical details.

Rain or Shine?

To make this project a little more interesting I went looking for a real data set to play with. I'm a bit of a weather watcher at home and for almost 20 years I've enjoyed using a local weather station run by a research group at the University of Cambridge. The group is currently part of the Cambridge Computer Laboratory and the station has moved to the William Gates building.

The Database

The SQL implementation comes in two halves. The base classes are as close to standard SQL as I could get and then a small 'shim' sits over the top which binds to a specific database implementation. The Python DB API takes you most of the way, including helping out with the correct form of parameterisation to use. For this example project I used SQLite because the driver is typically available in Python implementations straight out of the box.

I wrote the OData-style metadata document first and used it to automatically generate the CREATE TABLE commands but in most cases you'll probably have an existing database or want to edit the generated scripts and run them by hand. The main table in my schema got created from this SQL:

CREATE TABLE "DataPoints" (
    "TimePoint" TIMESTAMP NOT NULL,
    "Temperature" REAL,
    "Humidity" SMALLINT,
    "DewPoint" REAL,
    "Pressure" SMALLINT,
    "WindSpeed" REAL,
    "WindDirection" TEXT,
    "WindSpeedMax" REAL,
    "SunRainStart" REAL,
    "Sun" REAL,
    "Rain" REAL,
    "DataPointNotes_ID" INTEGER,
    PRIMARY KEY ("TimePoint"),
    CONSTRAINT "DataPointNotes" FOREIGN KEY ("DataPointNotes_ID") REFERENCES "Notes"("ID"))

To expose the database via my new data-access-layer API you just load the XML metadata, create a SQL container object containing the concrete implementation and then you can access the data in exactly the same way as I did in Part's I and II. The code that consumes the API doesn't need to know if the data source is an OData client, an in memory dummy source or a full-blown SQL database. Once I'd loaded the data, here is a simple session with the Python interpreter that shows you the API in action.

>>> import pyslet.odata2.metadata as edmx
>>> import pyslet.odata2.core as core
>>> doc=edmx.Document()
>>> with open('WeatherSchema.xml','rb') as f: doc.Read(f)
... 
>>> from pyslet.odata2.sqlds import SQLiteEntityContainer
>>> container=SQLiteEntityContainer(filePath='weather.db',containerDef=doc.root.DataServices['WeatherSchema.CambridgeWeather'])
>>> weatherData=doc.root.DataServices['WeatherSchema.CambridgeWeather.DataPoints']
>>> collection=weatherData.OpenCollection()
>>> collection.OrderBy(core.CommonExpression.OrderByFromString('WindSpeedMax desc'))
>>> collection.SetPage(5)
>>> for e in collection.iterpage(): print "%s: Max wind speed: %0.1f mph"%(unicode(e['TimePoint'].value),e['WindSpeedMax'].value*1.15078)
... 
2002-10-27T10:30:00: Max wind speed: 85.2 mph
2004-03-20T15:30:00: Max wind speed: 82.9 mph
2007-01-18T14:30:00: Max wind speed: 80.6 mph
2004-03-20T16:00:00: Max wind speed: 78.3 mph
2005-01-08T06:00:00: Max wind speed: 78.3 mph

Notice that the container itself isn't needed when accessing the data because the SQLiteEntityContainer __init__ method takes care of binding the appropriate collection classes to the model passed in. Unfortunately the dataset doesn't go all the way back to the great storm of 1987 which is a shame as at the time I was living in a 5th floor flat perched on top of what I was reliably informed was the highest building in Cambridge not to have some form of structural support. I woke up when the building shook so much my bed moved across the floor.

Setting up a Server

I used the same technique as I did in Part II to wrap the API with an OData server and then had some real fun getting it up and running on Amazon's EC2. Pyslet requires Python 2.7 but EC2 Linux comes with Python 2.6 out of the box. Thanks to this blog article for help with getting Python 2.7 installed. I also had to build mod_wsgi from scratch in order to get it to pick up the version I wanted. Essentially here's what I did:

# Python 2.7 install
sudo yum install make automake gcc gcc-c++ kernel-devel git-core -y
sudo yum install python27-devel -y
# Apache install
#  Thanks to http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/install-LAMP.html
sudo yum groupinstall -y "Web Server"
sudo service httpd start
sudo chkconfig httpd on

And to get mod_wsgi working with Python2.7...

sudo bash
cd
yum install httpd-devel -y
mkdir downloads
cd downloads
wget http://modwsgi.googlecode.com/files/mod_wsgi-3.4.tar.gz
tar -xzvf mod_wsgi-3.4.tar.gz
cd mod_wsgi-3.4
./configure --with-python=/usr/bin/python2.7
make
make install
# Optional check to ensure that we've got the correct Python linked
# you should see the 2.7 library linked
ldd /etc/httpd/modules/mod_wsgi.so
service httpd restart

To drive the server with mod_wsgi I used a script like this:

#! /usr/bin/env python

import logging, os.path
import pyslet.odata2.metadata as edmx
from pyslet.odata2.sqlds import SQLiteEntityContainer
from pyslet.odata2.server import ReadOnlyServer

HOME_DIR=os.path.split(os.path.abspath(__file__))[0]
SERVICE_ROOT="http://odata.pyslet.org/weather"

logging.basicConfig(filename='/var/www/wsgi-log/python.log',level=logging.INFO)

doc=edmx.Document()
with open(os.path.join(HOME_DIR,'WeatherSchema.xml'),'rb') as f:
    doc.Read(f)

container=SQLiteEntityContainer(filePath=os.path.join(HOME_DIR,'weather.db'),
    containerDef=doc.root.DataServices['WeatherSchema.CambridgeWeather'])

server=ReadOnlyServer(serviceRoot=SERVICE_ROOT)
server.SetModel(doc)

def application(environ, start_response):
 return server(environ,start_response)

I'm relying on the fact that Apache is configured to run Python internally and that my server object persists between calls. I think by default mod_wsgi serialises calls to the application method but a smarter configuration with a multi-threaded daemon would be OK because the server and container objects are thread safe. There are limits to the underlying SQLite module of course so you may not gain a lot of performance this way but a proper database would help.

Try it out!

If you were watching carefully you'll see that the above script uses a public service root. So let's try the same query but this time using OData. Here it is in Firefox:

Notice that Firefox recognises that the OData feed is an Atom feed and displays the syndication title and updated date. I used the metadata document to map the temperature and the date of the observation to these (you can see they are the same data points as above by the matching dates). The windiest days are never particularly hot or cold in Cambridge because they are almost always associated with Atlantic storms and the sea temperature just doesn't change that much.

The server is hosted at http://odata.pyslet.org/weather

2014-02-24

A Dictionary-like Python interface for OData Part II: a Memory-backed OData Server

In my previous post, A Dictionary-like Python interface for OData I introduced a new sub-package I've added to Pyslet to implement support for OData version 2. You can download the latest version of the Pyslet package from the QTI Migration Tool & Pyslet home page.

To recap, I've decided to set about writing my own data access layer for Python that is modelled on the conventions of OData. I've validated the API by writing a concrete implementation in the form of an OData client. In this post I'll introduce the next step in the process which is a simple alternative implementation that uses a different underlying storage model, in other words, an implementation which uses something other than a remote OData server. I'll then expose this implementation as an OData server to validate that my data access layer API works from both perspectives.

Metadata

Unlike other frameworks for implementing OData services Pyslet starts with the metadata model, it is not automatically generated from your code, you must write it yourself. This differs from the object-first approach taken by other frameworks, illustrated here:

This picture is typical of a project using something like Microsoft's WCF. Essentially, there's a two-step process. You use something like Microsoft's entity framework to generate classes from a database schema, customise the classes a little and then the metadata model is auto-generated from your code model. Of course, you can go straight to code and implement your own code model that implements the appropriate queryable interface but this would typically be done for a specific model.

Contrast this with the approach taken by Pyslet where the entities are not model-specific classes. For example, when modelling the Northwind service there is no Python class called Product as there would be in the approach taken by other frameworks. Instead there is a generalised implementation of Entity which behaves like a dictionary. The main difference is probably that you'll use supplier['Phone'] instead of simply supplier.phone or, if you'd have gone down the getter/setter route, supplier.GetPhone(). In my opinion, this works better than a tighter binding for a number of reasons, but particularly because it makes the user more mindful of when data access is happening and when it isn't.

Using a looser binding also helps prevent the type of problems I had during the development of the QTI specification. Lots of people were using Java and JAXB to autogenerate classes from the XML specification (cf autogenerating classes from a database schema) but the QTI model contained a class attribute on most elements to allow for stylesheet support. This class attribute prevented auto-generation because class is a reserved word in the Java language. Trying to fix this up after auto-generation would be madness but fixing it up before turns out to be a little tricky and this glitch seriously damaged the specification's user-experience. We got over it, but I'm wary now and when modelling OData I stepped back from a tighter binding, in part, to prevent hard to fix glitches like the use of Python reserved words as property names.

Allocating Storage

For this blog post I'm using a lightweight in-memory data storage implementation which can be automatically provisioned from the metadata document and I'm going to cheat by making a copy of the metadata document used by the Northwind service. Exposing OData the Pyslet way is a little more work if you already have a SQL database containing your data because I don't have a tool that auto-generates the metadata document from the SQL database schema. Automating the other direction is easy, but more on that in Part III.

I used my web browser to grab a copy of http://services.odata.org/V2/Northwind/Northwind.svc/$metadata and saved it to a file called Northwind.xml. I can then load the model from the interpreter:

>>> import pyslet.odata2.metadata as edmx
>>> doc=edmx.Document()
>>> f=open('Northwind.xml')
>>> doc.Read(f)
>>> f.close()

This special Document class ensures that the model is loaded with the special Pyslet element implementations. The Products entity set can be looked up directly but at the moment it's empty!

>>> productSet=doc.root.DataServices['ODataWeb.Northwind.Model.NorthwindEntities.Products']
>>> products=productSet.OpenCollection()
>>> len(products)
0
>>> products.close()

This isn't surprising, there is nothing in the metadata model itself which binds it to the data service at services.odata.org. The model isn't linked to any actual storage for the data. By default, the model behaves as if it is bound to an empty read-only data store.

To help me validate that my API can be used for something other than talking to real OData services I've created an object that provisions storage for an EntityContainer (that's like a database in OData) using standard Python dictionaries. By passing the definition of an EntityContainer to the object's constructor I create a binding between the model and this new data store.

>>> from pyslet.odata2.memds import InMemoryEntityContainer
>>> container=InMemoryEntityContainer(doc.root.DataServices['ODataWeb.Northwind.Model.NorthwindEntities'])
>>> products=productSet.OpenCollection()
>>> len(products)
0

The collection of products is still empty but it is now writeable. I'm going to cheat again to illustrate this by borrowing some code from the previous blog post to open an OData client connected to the real Northwind service.

>>> from pyslet.odata2.client import Client
>>> c=Client("http://services.odata.org/V2/Northwind/Northwind.svc/")
>>> nwProducts=c.feeds['Products'].OpenCollection()

Here's a simple loop to copy the products from the real service into my own collection. It's a bit clumsy in the interpreter but careful typing pays off:

>>> for nwProduct in nwProducts.itervalues():
...   product=collection.CopyEntity(nwProduct)
...   product.SetKey(nwProduct.Key())
...   collection.InsertEntity(product)
... 
>>> len(collection)
77

To emphasise the difference between my in-memory collection and the live OData service I'll add another record to my copy of this entity set. Fortunately most of the fields are marked as Nullable in the model so to save my fingers I'll just set those that aren't.

>>> product=collection.NewEntity()
>>> product.SetKey(100)
>>> product['ProductName'].SetFromValue("The one and only Pyslet")
>>> product['Discontinued'].SetFromValue(False)
>>> collection.InsertEntity(product)
>>> len(collection)
78

Now I can do everything I can with the OData client using my copy of the service, I'll filter the entities to make it easier to see:

>>> import pyslet.odata2.core as core
>>> filter=core.CommonExpression.FromString("substringof('one',ProductName)")
>>> collection.Filter(filter)
>>> for p in collection.itervalues(): print p.Key(), p['ProductName'].value
... 
21 Sir Rodney's Scones
32 Mascarpone Fabioli
100 The one and only Pyslet

I can access my own data store using the same API that I used to access a remote OData service in the previous post. In that post, I also claimed that it was easy to wrap my own implementations of this API to expose it as an OData service.

Exposing an OData Server

My OData server class implements the wsgi protocol so it is easy to link it up to a simple http server and tell it to handle a single request.

>>> from pyslet.odata2.server import Server
>>> server=Server("http://localhost:8081/")
>>> server.SetModel(doc)
>>> from wsgiref.simple_server import make_server
>>> httpServer=make_server('',8081,server)
>>> httpServer.handle_request()

My interpreter session is hanging at this point waiting for a single HTTP connection. The Northwind service doesn't have any feed customisations on the Products feed and, as we slavishly copied it, the Atom-view in the browser is a bit boring so I used the excellent JSONView plugin for Firefox and the following URL to hit my service:

http://localhost:8081/Products?$filter=substringof('one',ProductName)&$orderby=ProductID desc&$format=json

This is the same filter as I used in the interpreter before but I've added an ordering and specified my preference for JSON format. Here's the result.

As I did this, Python's simple server object logged the following output to my console:

127.0.0.1 - - [24/Feb/2014 11:17:05] "GET /Products?$filter=substringof(%27one%27,ProductName)&$orderby=ProductID%20desc&$format=json HTTP/1.1" 200 1701
>>>

The in-memory data store is a bit of a toy, though some more useful applications might be possible. In the OData documentation I go through a tutorial on how to create a lightweight memory-cache of key-value pairs exposed as an OData service. I'm not really suggestion using it in a production environment to replace memcached. What this implementation is really useful for is developing and testing applications that consume the DAL API without needing to be connected to the real data source. Also, it can be wrapped in the OData Server class as shown above and used to provide a more realistic mock of an actual service for testing that your consumer application still works when the data service is remote. I've used it in Pyslet's unit-tests this way.

In the third and final part of this Python and OData series I'll cover a more interesting implementation of the API using the SQLite database.

2014-02-12

A Dictionary-like Python interface for OData

Overview

This blog post introduces some new modules that I've added to the Pyslet package I wrote. Pyslet's purpose is providing support for Standards for Learning, Education and Training in Python. The new modules implement the OData protocol by providing a dictionary-like interface. You can download pyslet from the QTIMigration Tool & Pyslet home page. There is some documentation linked from the main Pyslet wiki. This blog article is as good a way as any to get you started.

The Problem

Python has a database API which does a good job but it is not the whole solution for data access. Embedding SQL statements in code, grappling with the complexities of parameterization and dealing with individual database quirks makes it useful to have some type of layer between your web app and the database API so that you can tweak your code as you move between data sources.

If SQL has failed to be a really interoperable standard then perhaps OData, the new kid on the block, can fill the vacuum. The standard is sometimes referred to as "ODBC over the web" so it is definitely in this space (after all, who runs their database on the same server as their web app these days?).

My Solution

To solve this problem I decided to set about writing my own data access layer that would be modeled on the conventions of OData but that used some simple concepts in Python. I decided to go down the dictionary-like route, rather than simulating objects with attributes, because I find the code more transparent that way. Implementing methods like __getitem__, __setitem__ and itervalues keeps the data layer abstraction at arms length from the basic python machinery. It is a matter of taste. See what you think.

The vision here is to write a single API (represented by a set of base classes) that can be implemented in different ways to access different data sources. There are three steps:

An implementation that uses the OData protocol to talk to a remote OData service.
An implementation that uses python dictionaries to create a transient in-memory data service for testing.
An implementation that uses the python database API to access a real database.

This blog post is mainly about the first step, which should validate the API as being OData-like and set the groundwork for the others which I'll describe in subsequent blog posts. Incidentally, it turns out to be fairly easy to write an OData server that exposes a data service written to this API, more on that in future posts.

Quick Tutorial

The client implementation uses Python's logging module to provide logging. To make it easier to see what is going on during this walk through I'm going to turn logging up from the default "WARN" to "INFO":

>>> import logging
>>> logging.basicConfig(level=logging.INFO)

To create a new OData client you simply instantiate a Client object passing the URL of the OData service root. Notice that, during construction, the Client object downloads the list of feeds followed by the metadata document. The metadata document is used extensively by this module and is loaded into a DOM-like representation.

>>> from pyslet.odata2.client import Client
>>> c=Client("http://services.odata.org/V2/Northwind/Northwind.svc/")
INFO:root:Sending request to services.odata.org
INFO:root:GET /V2/Northwind/Northwind.svc/ HTTP/1.1
INFO:root:Finished Response, status 200
INFO:root:Sending request to services.odata.org
INFO:root:GET /V2/Northwind/Northwind.svc/$metadata HTTP/1.1
INFO:root:Finished Response, status 200

Client objects have a feeds attribute that is a plain dictionary mapping the exposed feeds (by name) onto EntitySet objects. These objects are part of the metadata model but serve a special purpose in the API as they can be opened (a bit like files or directories) to gain access to the (collections of) entities themselves. Collection objects can be used in the with statement and that's normally how you'd use them but I'm sticking with the interactive terminal for now.

>>> products=c.feeds['Products'].OpenCollection()
>>> for p in products: print p
... 
INFO:root:Sending request to services.odata.org
INFO:root:GET /V2/Northwind/Northwind.svc/Products HTTP/1.1
INFO:root:Finished Response, status 200
1
2
3
... [and so on]
...
20
INFO:root:Sending request to services.odata.org
INFO:root:GET /V2/Northwind/Northwind.svc/Products?$skiptoken=20 HTTP/1.1
INFO:root:Finished Response, status 200
21
22
23
... [and so on]
...
76
77

The products collection behaves like a dictionary, iterating through it iterates through the keys in the dictionary. In this case these are the keys of the entities in the collection of products in Microsoft's sample Northwind data service. Notice that the client logs several requests to the server interspersed with the printed output. That's because the server is limiting the maximum page size and the client is following the page links provided. These calls are made as you iterate through the collection allowing you to iterate through very large collections without loading everything in to memory.

The keys alone are of limited interest, let's try a similar loop but this time we'll print the product names as well:

>>> for k,p in products.iteritems(): print k,p['ProductName'].value
... 
INFO:root:Sending request to services.odata.org
INFO:root:GET /V2/Northwind/Northwind.svc/Products HTTP/1.1
INFO:root:Finished Response, status 200
1 Chai
2 Chang
3 Aniseed Syrup
...
...
20 Sir Rodney's Marmalade
INFO:root:Sending request to services.odata.org
INFO:root:GET /V2/Northwind/Northwind.svc/Products?$skiptoken=20 HTTP/1.1
INFO:root:Finished Response, status 200
21 Sir Rodney's Scones
22 Gustaf's Knäckebröd
23 Tunnbröd
...
...
76 Lakkalikööri
77 Original Frankfurter grüne Soße

Sir Rodney's Scones sound interesting, we can grab an individual record just as we normally would from a dictionary, by using its key.

>>> scones=products[21]
INFO:root:Sending request to services.odata.org
INFO:root:GET /V2/Northwind/Northwind.svc/Products(21) HTTP/1.1
INFO:root:Finished Response, status 200
>>> for k,v in scones.DataItems(): print k,v.value
... 
ProductID 21
ProductName Sir Rodney's Scones
SupplierID 8
CategoryID 3
QuantityPerUnit 24 pkgs. x 4 pieces
UnitPrice 10.0000
UnitsInStock 3
UnitsOnOrder 40
ReorderLevel 5
Discontinued False

The scones object is an Entity object. It too behaves like a dictionary. The keys are the property names and the values are one of SimpleValue, Complex or DeferredValue. In the snippet above I've used a variation of iteritems which iterates only through the data properties, excluding the navigation properties. In this model, there are no complex properties. The simple values have a value attribute which contains a python representation of the value.

Deferred values (navigation properties) can be used to navigate between Entities. Although deferred values can be opened just like EntitySets, if the model dictates that at most 1 entity can be linked a convenience method called GetEntity can be used to open the collection and read the entity in one call. In this case, a product can have at most one supplier.

>>> supplier=scones['Supplier'].GetEntity()
INFO:root:Sending request to services.odata.org
INFO:root:GET /V2/Northwind/Northwind.svc/Products(21)/Supplier HTTP/1.1
INFO:root:Finished Response, status 200
>>> for k,v in supplier.DataItems(): print k,v.value
... 
SupplierID 8
CompanyName Specialty Biscuits, Ltd.
ContactName Peter Wilson
ContactTitle Sales Representative
Address 29 King's Way
City Manchester
Region None
PostalCode M14 GSD
Country UK
Phone (161) 555-4448
Fax None
HomePage None

Continuing with the dictionary-like theme, attempting to load a non existent entity results in a KeyError:

>>> p=products[211]
INFO:root:Sending request to services.odata.org
INFO:root:GET /V2/Northwind/Northwind.svc/Products(211) HTTP/1.1
INFO:root:Finished Response, status 404
Traceback (most recent call last):
  File "", line 1, in 
  File "/Library/Python/2.7/site-packages/pyslet/odata2/client.py", line 165, in __getitem__
 raise KeyError(key)
KeyError: 211

Finally, when we're done, it is a good idea to close the open collection. If we'd used the with statement this step would have been done automatically for us of course.

>>> products.close()

Limitations

Currently the client only supports OData version 2. Version 3 has now been published and I do intend to update the classes to speak version 3 at some point. If you try and connect to a version 3 service the client will complain when it tries to load the metadata document. There are ways around this limitation, if you are interested add a comment to this post and I'll add some documentation.

The client only speaks XML so if your service only speaks JSON it won't work at the moment. Most of the JSON code is done and tested so adding it shouldn't be a big issue if you are interested.

The client can be used to both read and write to a service, and there are even ways of passing basic authentication credentials. However, if calling an https URL it doesn't do certificate validation at the moment so be warned as your security could be compromised. Python 2.7 does now support certification validation using OpenSLL so this could change quite easily I think.

Moving to Python 3 is non-trivial - let me know if you are interested. I have taken the first steps (running unit tests with "python -3Wd" to force warnings) and, as much as possible, the code is ready for migration. I haven't tried it yet though and I know that some of the older code (we're talking 10-15 years here) is a bit sensitive to the raw/unicode string distinction.

The documentation is currently about 80% accurate and only about 50% useful. Trending upwards though.

Downloading and Installing Pyslet

Pyslet is pure-python. If you are only interested in OData you don't need any other modules, just Python 2.7 and a reasonable setuptools to help you install it. I just upgraded my machine to Mavericks which effectively reset my Python environment. Here's what I did to get Pyslet running.

Installed setuptools
Downloaded the pyslet package tgz and unpacked it (download from here)
Ran python setup.py install

Why?

Some lessons are hard! Ten years or so ago I wrote a migration tool to convert QTI version 1 to QTI version 2 format. I wrote it as a Python script and used it to validate the work the project team were doing on the version 2 specification itself. Realising that most people holding QTI content weren't able to easily run a Python script (especially on Windows PCs) my co-chair Pierre Gorissen wrote a small Windows-wrapper for the script using the excellent wxPython and published an installer via his website. From then on, everyone referred to it as "Pierre's migration tool". I'm not bitter, the lesson was clear. No point in writing the tool if you don't package it up in the way people want to use it.

This sentiment brings me to the latest developments with the tool. A few years back I wrote (and blogged about) a module for writing Basic LTI tools in Python. I did this partly to prove that LTI really was simple (I wrote the entire module on a single flight to the US) but also because I believed that the LTI specification was really on to something useful. LTI has been a huge success and offers a quick route for tool developers to gain access to users of learning management systems. It seems obvious that the next version of the QTI Migration Tool should be an LTI tool but moving from a desktop app to a server-based web-app means that I need a data access layer that can persist data and be smarter about things like multiple threads and processes.

2013-03-15

RSS Readers: in the dog house

So farewell Google Reader, I will miss you.

This week's announcement of the demise of Google Reader as part of the Second Spring of Cleaning seems to be an important milestone for the internet.

There's a lot of new blog articles lamenting its demise (to some extent, this is one of them) but we shouldn't be too shocked. The original concept behind RSS has been under threat for some time, in fact if you Google "War on RSS" you'll see an established idea that companies that have a powerful influence on the way we use the internet have been deprecating RSS for some time.

Perhaps the most interesting of these contribution comes from @vambenepe who wrote The war on RSS in February last year. It's a good overview of the way RSS reading features are going missing in systems we use to access the internet and contains this worrying quote:

Google has done a lot for RSS, but as a result it has put itself in position to kill it, either accidentally or on purpose. [...snip...] [... If] Google closed Reader, would RSS survive? Doubtful.

This particular commentator is interesting because since writing this he has moved on to become "Product Manager on Google Cloud Platform". Don't expect a follow up article but he did tweet yesterday:

"1 year ago, I asked: "If Google closed Reader, would RSS survive?" http://stage.vambenepe.com/archives/1932 We'll now find out but I won't be able to comment."

One of the takeaways here is that we're not just talking about RSS specifically. When we say RSS we can include Atom and readers of this blog will know that I'm a fan of Atom and the emerging OData standard that is based upon it. But let's not get carried away. This war is not on the protocol but on the use of RSS as a way of end users discovering content on the internet. The emergence of OData (based on the Atom Publishing Protocol, not the read-only RSS) as a protocol that sits between the web app and the data source is likely to get even stronger.

Even HTTP has changed. This blog post uses HTTP in an old fashioned way. I'm writing an article, inserting anchors that form hypertext links to other resources on the internet. I'm banking on the idea that these resources won't go away and that this article will join a persistant web of information. If you're reading this you're probably thinking, duh, that's what the internet is. In the early days this was true but the internet is no longer like this for the majority of users. HTTP sits as a protocol behind the web apps we use to check Twitter, Facebook and iTunes but the concept behind the way most people consume information on the internet bears no relation to the classic hypertext visions we used to cite when we were all researchers working in universities in the early 90s.

Go back and read the seminal As we may think or review the goals of Ted Nelson's Xanadu Project and you won't recognise the origins of iTunes, on-demand TV, micro-blogging or ad-supported social networks. From a UK point of view, we didn't even have commercial broadcast television until 1955 (when ITV was launched) which is 10 years after As we may think was published. The existence of these modern uses of the internet do not preclude the research use envisaged by these information scientists, it just relegates it to a niche.

The problem for people like you and me, who occupy this niche, is that the divergence of consumer internet technology from the original research oriented web is eventually going to make it more expensive. There's no law that says that Google has to provide an RSS reading tool for free (or a blogging service for that matter). In fact, the withdrawal of this service may actually provide a shot in the arm for the makers of RSS readers who have been starved by people like me who use the freebie Google Reader instead of their more tailored offerings. Yes, I would be prepared to pay to have something like Google Reader that stays in sync across my tablet, phone and laptop.

Ad, ad, ad...

While I'm on the subject of money, I do want to draw your attention to Xanadu's rule 9:

Every document can contain a royalty mechanism at any desired degree of granularity to ensure payment on any portion accessed, including virtual copies ("transclusions") of all or part of the document.

I really think it is time that technology providers started to look again at this goal. In the early days of the internet this was considered unrealistic. In fact, I remember sitting through meetings in which people responsible for creating the infrastructure that made the internet possible were highly doubtful that traffic accounting would ever be possible. The growth in internet traffic would always outpace the ability of switching gear and routers to count bits and report on usage. That prediction turned out to be wrong. I think they underestimated the strength of the business case behind bit-counting, which is routine on mobile platforms. My cheap router counts my own internet usage and I know my service provider has realtime stats too, if only to enforce their acceptable usage policy.

There have been a lot of haters for charging based on consumption of bits and this, in my opinion, has distorted the business models available to service providers towards ad-based services and away from the Xanadu-like micro payments.

Most of the rhetoric about the demise of Google Reader is taken from the point of view of the consumer, not the information publisher. Of course I want to consume content for free using free technology over an unlimited internet connection. But none of these things are really free. We've all heard the adage that if something is free then you're the product. As an RSS consumer, my costs just outstripped my marketable value to Google. I'm not a cash cow anymore, I'm a dog.

From Reader to Blogger

But as I type, I'm not just consuming the content I used to research it. I'm also publishing content of my own. At the moment for free. I don't want to enable ads on this blog but the technology doesn't yet make it easy for me, or anyone between me and you, to collect revenue and experiment with pricing. It's more complicated than you might think.

Rule 8 of Xanadu reads "Permission to link to a document is explicitly granted by the act of publication." Early internet sites seriously considered violating this principal. Content providers considered themselves to be so valuable that someone creating a site that aggregated links to their gems were somehow cheating the system. This has been turned completely on its head now, these days information providers are hungry for links and when those links result in product sales they are prepared to pay real money to the aggregator. This is the basis on which all the market comparison sites are run.

If content publishers got revenue from people viewing their materials (Xanadu style) then linking to someone's content becomes a valuable lead. How would payments trickle back to the owner of the <a> tag?

We know that the ad-model works. YouTube generates huge revenues for people like PSY. But for people outside the mainstream who occupy this niche, typified by users of Google Reader, we need another way to solve the money problem. Perhaps the new technology that emerges to take the place of Reader will come up with a creative way to address this issue. Especially if they start getting paid by their users.

2013-01-30

OData: Open for Comments at OASIS

Browsing around this morning I noticed that on Friday (25th Jan) there was a test posting to a new mailing list set up by the OASIS technical committee that is taking forward the OData specification.

To recap, OData is a specification that extends the popular Atom Publishing Protocol (APP) with conventions that make it easy to expose data sources (think relational databases) in a standard way. OData has been driven by Microsoft and is now at version 3, but it seems to be making the transition to a work item at OASIS where it seems likely that a more open specification process will be observed.

I've written about OData before but the best way to play with it is to look at some sample feeds, the Netflix database is one I tend to use for my examples because the data is real and something that is widely understood.

With the work now at a more formal standards body I hope that some of the rough edges of the existing specification can be knocked off. This type of thing is important if OData is to make the transition from a specification which works well if you have client and server libraries from the same vendor to one which can be truly interoperable.

For example, the current specification makes a mess of defining the simple concept of a string-literal parsed from a URL. As a result, it is impossible to make a conforming URI which will get you information about an actor like Peter O'Toole. Here's a URL that a naive user might construct:

http://odata.netflix.com/catalog/People?$filter=Name%20eq%20'Peter%20O'Toole'

Notice that the single-quote character in O'Toole terminates the literal and, sure enough, Netflix returns an error.

Syntax error at position 22.

In fact, there is an undocumented way to get around the problem, using the SQL-convention of doubling the quote character:

http://odata.netflix.com/catalog/People?$filter=Name%20eq%20'Peter%20O''Toole'

I've posted a comment to highlight this issue to the new OData comment list, let's see what happens! It's a public forum so anyone can join though the work of the technical committee itself is behind closed doors (OASIS is a subscription-based membership organization).

I'm a fan of what the basic OData specification is trying to do so getting things like this fixed is important. Just looking at the XML file you get back from the above URI immediately opens up the wonderful world of linked data, giving me relative links like People(69540)/TitlesActedIn from which you can see details of all the films Peter O'Toole has acted in. Don't like XML? Just add ?$format=json to the URL and you can consume the list directly into your web-page.

Last year I gave a lightning talk at a CETIS event in which I encouraged people who were creating REST-based protocols as part of their technical standards development process to have a really close look at OData. Building new specifications using existing protocols can dramatically save time when drafting and make it much easier for people to implement afterwards. And even if OData is not for you, if your application is a good fit for a REST-based approach why not just use APP as it is? Forgot the additional complications of things like WADL, you don't need them. What's more, if you use APP then you can take advantage of existing implementations in web browsers to provide basic and easy to consume views of your data.

2012-06-15

Viewing OData $metadata with XSLT

In a recent post I talked about work I've been doing on OData. See Atom, OData and Binary Blobs for a primer on Atom and some examples of OData feeds.

In a nutshell, OData adds a definition language for database-like Entities (think SQL tables) to Atom and provides conventions for representing their properties (think SQL columns) in XML. Their definition language is based on ADO.NET, yes they could have chosen different bindings which would have made more work for their engineers, less for the rest of us and improved the chances of widespread adoption. But it is what it is (a phrase I seem to be hearing a lot of recently).

One of the OData-defined conventions is that data services can publish a metadata document which describes the entities that you are likely to encounter in the Atom feeds it publishes. This can be useful if you want to POST a new entry and you don't know what type of data property X is supposed to have. To get the metadata document you just get a special URL, for the sample data service published by Microsoft:

http://services.odata.org/OData/OData.svc/$metadata

To see what this might look like in the real world you can also look at the $metadata document published as part of the Netflix OData service I used as the source of my examples last time.

http://odata.netflix.com/v2/Catalog/$metadata

Wouldn't it be nice to have a simple documented form of this information? The schema on which it is based even allows annotations and Documentation elements. I browsed the web a bit but couldn't find anyone who had done this so I wrote a little XSLT myself. Here is a picture of part of the output from transforming the Netflix feed

Now there is an issue here. One of the things that I've commented on before is the annoying habit of specification writers to change the namespace they use when a new version is published. I can see why some people might do this but when 90% of the spec is the same it just causes frustration as tools which look for one namespace have to have significant revisions just to work with a minor addition of some optional elements.

As a result, I've published two versions of the XSLT that I used to create the above picture. If somebody out there knows how I can do all of this in one XSL file without lots of copying and pasting I'd love to know.

The first xslt uses the namespace from CSDL 1.1 and can be used to transform the metadata from the sample OData service published by Microsoft. The second xslt uses the namespace from CSDL 2.0 and must be used to transform the metadata from Netflix. If you get a blank HTML file with just a heading then try the other one. When clicking on these links your browser may attempt to render them as HTML, you can "View Source" to see the XSLT as plain text.

Here is how I've used these files on my Mac:

$ curl http://services.odata.org/OData/OData.svc/\$metadata > sample.xml
$ xsltproc odata2html-v1p1.xsl sample.xml > sample.html

$ curl http://odata.netflix.com/v2/Catalog/\$metadata > netflix.xml
$ xsltproc odata2html-v2.xsl netflix.xml > netflix.html

This transform could obviously be improved a lot, it only shows you the Entities, Complex Types and Associations though I am rather proud of the hyperlinking between them.

2012-06-01

Atom, OData and Binary Blobs

I've been doing a lot of work on Atom and OData recently. I'm a real fan of Atom and the related Atom Publishing Protocol (APP for short). OData is a specification from Microsoft which builds on these two basic building blocks of the internet to provide standard conventions for querying feeds and representing properties using a SQL-like model.

Given that OData can be used to easily expose data currently residing in SQL databases it is not surprising that the issue of binary blobs is one that takes a little research to figure out. At first sight it isn't obvious how OData deals with them, in fact, it isn't even obvious how APP deals with them!

Atom Primer

Most of us are familiar with the idea of an RSS feed for following news articles and blogs like this one (this article prompted me to add the gadget to my blogger templates to make it easier to subscribe). Atom is a slightly more formal definition of the same concept and is available as an option for subscribing to this blog too. Understanding the origins of Atom helps when trying to understand the Atom data model, especially if you are coming to Atom from a SQL/OData point of view.

Atom is all about feeds (lists) of entries. The data you want, be it a news article, blog post or a row in your database table is an entry. A feed might be everything, such as all the articles in your blog or all the rows in your database table, or it may be a filtered subset such as all the articles in your blog with a particular tag or all the rows in your table that match a certain query.

Atom adheres closely to the REST-based service concept. Each entry has its own unique URI. Feeds also have their own URIs. For example, the Atom feed URL for this blog is:

http://swl10.blogspot.com/feeds/posts/default

But if you are only interested in the Python language then you might want to use a different feed:

http://swl10.blogspot.com/feeds/posts/default/-/Python

Obviously the first feed contains all the entries in the second feed too!

Atom is XML-based so an entry is represented by an <entry> element and the content of an entry is represented by a <content> child element. Here's an abbreviated example from this blog's Atom feed. Note that the atom-defined metadata elements appear as siblings of the content...

<entry>
  <id>tag:blogger.com,1999:blog-8659912959976079554.post-4875480159917130568</id>
  <published>2011-07-17T16:00:00.000+01:00</published>
  <updated>2011-07-17T16:00:06.090+01:00</updated>
  <category scheme="http://www.blogger.com/atom/ns#" term="QTI"/>
  <category scheme="http://www.blogger.com/atom/ns#" term="Python"/>
  <title type="text">Using gencodec to make a custom character mapping</title>
  <content type="html">One of the problems I face...</content>
  <link rel="edit" type="application/atom+xml"
    href="http://www.blogger.com/feeds/8659912959976079554/posts/default/4875480159917130568"/>
  <link rel="self" type="application/atom+xml"
    href="http://www.blogger.com/feeds/8659912959976079554/posts/default/4875480159917130568"/>
</entry>

For blog articles, this content is typically html text (yes, horribly escaped to allow it to pass through XML parsers). Atom actually defines three types of native content, 'html', 'text' and 'xhtml'. It also allows the content element to contain a single child element corresponding to other XML media types. OData uses this method to represent the property name/value pairs that might correspond to the column names and values for a row in the database table your are exposing.

Here's another abbreviated example taken from the Netflix OData People feed:

<entry>
  <id>http://odata.netflix.com/v2/Catalog/People(189)</id>
  <title type="text">Bruce Abbott</title>
  <updated>2012-06-01T07:55:17Z</updated>
  <category term="Netflix.Catalog.v2.Person" scheme="http://schemas.microsoft.com/ado/2007/08/dataservices/scheme" />
  <content type="application/xml">
    <m:properties>
   <d:Id m:type="Edm.Int32">189</d:Id>
   <d:Name>Bruce Abbott</d:Name>
    </m:properties>
  </content>
</entry>

Notice the application/xml content type and the single properties element from Microsoft's metadata schema.

Any other type of content is considered to be external media. But Atom can still describe it, it can still associate metadata with it and it can still organize it into feeds...

Binary Blobs as Media

There is nothing stopping the content of an entry from being a non-text binary blob of data. You just change the type attribute to be your favourite blob format and add a src attribute to point to an external file or base-64 encode it and include it in the entry itself (this second method is rarely used I think).

Obviously the URL of the entry (the XML document containing the <entry> tag) is not the same as the URL of the media resource, but they are closely related. The entry is referred to as a Media Link because it contains the metadata about the media file (such as the title, updated date etc) and it links to it. The media file itself is known as a media resource.

There's a problem with OData though. OData requires the child of the content element to be the properties element (see example above) and the type attribute to be application/xml. But Atom says there can only be one content element per entry. So how can OData be used for binary blobs?

The answer is a bit of a hack. When the entry is a media link entry the properties move into the metadata area of the entry. Here's another abbreviated example from Netflix which illustrates the technique:

<entry>
  <id>http://odata.netflix.com/v2/Catalog/Titles('13aly')</id>
  <title type="text">Red Hot Chili Peppers: Funky Monks</title>
  <summary type="html">Lead singer Anthony Kiedis...</summary>
  <updated>2012-01-31T09:45:16Z</updated>
  <author>
    <name />
  </author>
  <category term="Netflix.Catalog.v2.Title"
    scheme="http://schemas.microsoft.com/ado/2007/08/dataservices/scheme" />
  <content type="image/jpeg" src="http://cdn-0.nflximg.com/en_us/boxshots/large/5632678.jpg" />
  <m:properties xmlns:m="http://schemas.microsoft.com/ado/2007/08/dataservices/metadata"
    xmlns:d="http://schemas.microsoft.com/ado/2007/08/dataservices">
    <d:Id>13aly</d:Id>
    <d:Name>Red Hot Chili Peppers: Funky Monks</d:Name>
    <d:ShortName>Red Hot Chili Peppers: Funky Monks</d:ShortName>
    <d:Synopsis>Lead singer Anthony Kiedis...</d:Synopsis>
    <d:ReleaseYear m:type="Edm.Int32">1991</d:ReleaseYear>
    <d:Url>http://www.netflix.com/Movie/Red_Hot_Chili_Peppers_Funky_Monks/5632678</d:Url>
    <!-- more properties.... -->
  </m:properties>
</entry>

This entry is taken from the Titles feed, notice that the entry is a media-link to the large box graphic for the film.

Binary Blobs and APP

APP adds a protocol for publishing information to Atom feeds and OData builds on APP to allow data feeds to be writable, not just read-only streams. You can't upload your own titles to Netflix as far as I know so I don't have an example here. The details are all in section 9.6 of RFC 5023 but in a nutshell, if you post a binary blob to a feed the server should store the blob and create a media link entry that points to it (populated with a minimal set of metadata). Once created, you can then update the metadata with HTTP's PUT method on the media link's edit URI directly, or update the binary blob by using HTTP's PUT method on the edit-media URI of the media resource. (These links are given in the <link> elements in the entries, see the first example for examples.)

There is no reason why binary blobs can't be XML files of course. Many of the technical standards for education that I work with are very data-centric. They define the format of XML documents such as QTI, which are designed to be opaque to management systems like item banks (an item bank is essentially a special-purpose content management system for questions used in assessment).

So publishing feeds using OData or APP from an item bank would most likely use these techniques for making the underlying content available to third party systems. Questions often contain media resources (e.g., images) of course but even the question content itself is typically marked up using XML, as it is in QTI. This data is not easy to represent as a simple list of property values and would typically be stored as a blob in a database or as a file in a repository. Therefore, it is probably better to think of this data as a media resource when exposing it via APP/OData.