Accessing the ESA Sentinel Mission Data with Python and OData

I've had a couple of enquiries now about how to access the OData feeds on the ESA Sentinel mission science data hub. Sentinel 1 is the first of a new group of satellites in the Copernicus programme to monitor the Earth. That's about all I know I'm afraid. This data is not pretty desktop pictures (though doubtless there are some pretty pictures buried in there somewhere) but raw scientific data from instruments currently orbiting the Earth.

The source code described here is available in the samples directory on GitHub, you must be using the latest Pyslet from master for this script to enable the metadata override technique used here.

The data hub advertises access to the data through OData (version 1) but my Python library, Pyslet, was not able to access the feeds properly: hence the enquiries.

Turns out that the data feeds use a concept called containment in OData. The model of OData is one of entity sets (think SQL tables) with relations between them modelled by navigation properties. There's one particular use case that doesn't work very well in this scenario but seems popular. Given an entity (think table row or record) people want to add arbitrary key-value pairs. The ESA's data model does this by creating 'sub-tables' which define collections of attributes that hang off of each entity. The attribute name is the key in these collections. This doesn't really work in OData v1 (or v2) because these attribute values should still be entities in their own right and therefore they need a unique key and an entity set definition to contain them.

This isn't the only schema I've seen that attempts to do something like this either, SAP have published a similar schema suggesting that some early Java tools exposed OData this way.

The upshot is that you get nasty errors when you try and load these services with Pyslet. It complains of a rather obscure condition concerning (possibly multiple) unbound principals. When I wrote that error message, I didn't expect anyone to ever actually see it.

There's a proper way to do containment in earlier versions of OData, described in Containment is Coming with OData v4 which explains how to use composite keys. As the name of the article suggests though, this is written with hindsight after a better solution has been found for this use case in OData v4.

The fix for the ESA data feed is to download and edit a copy of the advertised metadata to get around the errors reported by Pyslet and then to initialise your OData client using this modified schema instead. It isn't a perfect fix, as far as Pyslet knows those attributes really are unique and do reside in their own entity set but it doesn't really matter for the purposes of using the OData client. You can navigate and formulate queries without tripping over data inconsistencies.

I've written a little script that I've added to Pyslet's sample code directory to illustrate the technique, along with a fixed up metadata file. The result is a little UNIX-style utility for downloading products from the ESA data hub:

$ ./download.py --help
Usage: download.py [options]

  -h, --help            show this help message and exit
  -u USER, --user=USER  user name for basic auth credentials
  -p PASSWORD, --password=PASSWORD
                        password for basic auth credentials
  -v                    increase verbosity of output up to 3x
  -c, --cert            download and trust site certificate

The data is available via https and requires a user name and password (you'll have to register on the data hub site but it's free to do so). To make it easier to set up the trust aspect I've added a -c option to download the site certificate and store it. If you don't have the site certificate you'll get an error like this:

ERROR:root:scihub.esa.int: closing connection after error failed to build secure connection to scihub.esa.int

Subsequent downloads verify that the site certificate hasn't changed: a bit like the way ssh offers to store a fingerprint the first time you connect to a remote host. Only use the -c option if you trust the network you are running on (you can use Firefox or some other 'trusted' browser to download the certificate too of course).

The password is optional, if you don't provide it you'll be prompted to enter it using Python's getpass function for privacy.

You pass the product identifiers as command line arguments, here is an example of a successful first-time run:

$ ./download.py -c -u swl10 8bf64ff9-f310-4027-b31f-8e95dd9bbf82
ERROR:root:Entity set Attributes has more than one unbound principal
dropping mutliplicity of Attribute_Node to 0..1.  Continuing
ERROR:root:Entity set Attributes has more than one unbound principal
dropping mutliplicity of Attribute_Product to 0..1.  Continuing
S1A_EW_GRDM_1SDH_20150207T084156_20150207T084218_004515_0058AE_3051 150751068

After running this command I had a scihub.esa.int.crt file (from the -c option) and a 150MB zip file downloaded to the current directory.

If you run with -vv to provide a bit more information you can see the OData magic in operation:

./download.py -vv -u swl10 8bf64ff9-f310-4027-b31f-8e95dd9bbf82
INFO:root:Sending request to scihub.esa.int
INFO:root:GET /dhus/odata/v1/ HTTP/1.1
INFO:root:Connected to scihub.esa.int with DHE-RSA-AES256-SHA, TLSv1/SSLv3, key length 256
INFO:root:Finished Response, status 401
INFO:root:Resending request to: https://scihub.esa.int/dhus/odata/v1/
INFO:root:Sending request to scihub.esa.int
INFO:root:GET /dhus/odata/v1/ HTTP/1.1
INFO:root:Connected to scihub.esa.int with DHE-RSA-AES256-SHA, TLSv1/SSLv3, key length 256
INFO:root:Finished Response, status 200
WARNING:root:Entity set Attributes has an unbound principal: Nodes
WARNING:root:Entity set Attributes has an unbound principal: Products
ERROR:root:Entity set Attributes has more than one unbound principal
dropping multiplicity of Attribute_Node to 0..1.  Continuing
ERROR:root:Entity set Attributes has more than one unbound principal
dropping multiplicity of Attribute_Product to 0..1.  Continuing
INFO:root:Sending request to scihub.esa.int
INFO:root:GET /dhus/odata/v1/Products('8bf64ff9-f310-4027-b31f-8e95dd9bbf82') HTTP/1.1
INFO:root:Connected to scihub.esa.int with DHE-RSA-AES256-SHA, TLSv1/SSLv3, key length 256
INFO:root:Finished Response, status 200
S1A_EW_GRDM_1SDH_20150207T084156_20150207T084218_004515_0058AE_3051 150751068
INFO:root:Sending request to scihub.esa.int
INFO:root:GET /dhus/odata/v1/Products('8bf64ff9-f310-4027-b31f-8e95dd9bbf82')/$value HTTP/1.1
INFO:root:Connected to scihub.esa.int with DHE-RSA-AES256-SHA, TLSv1/SSLv3, key length 256
INFO:root:Finished Response, status 200

As you can see, the fixed up metadata still generates error messages but these are no longer critical and the client is able to interact with the service.

I was given this product identifier as an example of something small to test with. I haven't researched what the data actually represents but the resulting zip file does contain a 'quick_look' image:


Yosemite Spotlight issues with HP drivers: check your console

I recently imported a bunch of email into Outlook for OS X and was disappointed that I was unable to search its contents. Outlook uses Apple's native spotlight search so, in theory, all I need to do is wait for spotlight to churn through the new material and I should be done. Hours passed and nothing seemed to happen.

The first thing I tried was to simply force a re-index of my hard-disk. I just added my main drive to the list of places to exclude from spotlight searching and then after waiting a minute or two (for superstitious reasons) I removed that item again and sat back waiting for the inevitable slowdown as mdworker launches into life and starts scanning all my data.


The next step I took was to try to figure out how to see if Spotlight was actually doing any indexing at all. There's no simple control panel or dashboard view of the indexing process. The only way I could find was to press command-space and search for something. It should show the indexing progress-bar if it is indexing (but if it is complete you'll see nothing). I was still getting nowhere and now I'd lost the ability to search for anything.

Check the Console

In these situations it is always worth checking the console. I don't mean the termainal, just the console utility app that spools system messages onto your screen and allows you to see what is happening on your Mac. There's a handy search box (which doesn't use Spotlight!) at the top which filters the current day's logs. Putting just 'md' in that box was enough to filter all the other stuff enabling me to see a constant stream of output from Spotlight's indexing application: mdworker.

Here's a sample:

Jan 12 23:06:45 LernaBookPro com.apple.xpc.launchd[1] (com.apple.mdworker.bundles[24329]): Could not find uid associated with service: 0: Undefined error: 0 1422
Jan 12 23:06:45 LernaBookPro com.apple.xpc.launchd[1] (com.apple.mdworker.bundles): Service only ran for 0 seconds. Pushing respawn out by 10 seconds.

You don't need to be an expert to see that this is some type of unexpected condition and the second line tells me that the resulting process exited out straight away. True to its word, every 10 seconds I got a pair of lines like this in my console log. Interestingly, the problem had been going on for some time, when I searched back through my logs I realised that Spotlight had probably not been indexing properly for ages. I have a feeling that things like new mail that arrives in your inbox gets indexed through some other mechanism and this can mask the fact that the long running indexer has not made a proper index of your hard disk. But that's just a hunch.

Check the web

Armed with this information I was able to find a thread on the internet that clearly dealt with the problem I was having: [...] com.apple.mdworker.bundles pollute logs with errors. Clearly the person who posted this thread didn't think (or perhaps realise) that Spotlight had actually failed and was chasing up unexpected slowness.

Interestingly, from this thread it is clear that the last number in the log entry is a numeric user id. In the case of the poster this was 502 which is typical of the range Apple uses for real users you add to your machine from the control panel. I think they start at 501. If you delete a user from your machine but leave lots of data lying around that is owned by that user then Yosemite seems to be having trouble and it is killing the Spotlight indexer.

The user id causing trouble for me is 1422, which is outside of this range so although the remedy might be similar the origin of my problem is different. I put 1422 into my search and found this thread: 27 inch iMac suddenly running very slowly where the person erases their disk and re-installs (yeah, that fixed it!).

Now use the Terminal

With the clues in the first thread it seems like I need to find some files on my disk owned by user id 1422. The terminal can do that using the Unix find command:

cd /
sudo find . -user 1422

This type of thing takes ages, especially if you have a large backup drive.

The Culprit

Turns out that the files owned by user 1422 are all part of the HP printer drivers. These may have been inherited from a previous installation, I'm not that sure, but either way there was information in /Library and inside the Application bundles themselves from HP that were identified this way. I had to use chown -R to change those to root ownership instead (I'm not sure what they are supposed to be).

You'll also find a few files in /private/var/ with paths similar to:


The exact names will be different but the critical thing is the end part, which is a directory created just for spotlight. It is named after the missing user and it has its ownership set to this non-existent user. It is actually these latter files which are causing the problem, Spotlight is trying to create an index for that user but is surprised when it finds the user doesn't exist. Just removing this directory isn't enough though because Spotlight will re-index your disk and as soon as it finds a file owned by 1422 again it will create this folder in /private/var and grind to a halt again. You must remove or re-own everything that spotlight might see: a real hassle if you have a large backup drive because of the way Time Machine works. I've solved that problem by just excluding my backup drive from Spotlight.

FWIW, unix systems are usually very tolerant of non-existent user ids. Many archive programs will restore files from other machines and systems and, if run as root, will update their ownership to match the original ownership before the files were archived. On networks of Unix workstations that share a user directory this is useful because you can tar up files on one machine and transfer them to another and all the ownership information comes across too. On a personal machine this is less useful and perhaps even dangerous, hence the '--insecure' option on tar.

I considered removing and reinstalling the HP software but I'm not convinced that it isn't a problem with the installer itself. It works fine on the machine I upgraded from 10.8, through 10.9 to 10.10 (after these fixes) but I noticed that on a different machine that came with Mavericks and was upgraded to Yosemite I had to re-install the driver even though I used the migration assistant to set it up from its predecessor (running 10.8) which did have the HP drivers installed. I struggled to find the right download for my HP Officejet 6310 and perhaps now I know why!


I couldn't find any way of telling mdworker to give up on user id 1422. Removing it's special directory didn't seem to help so I assume that somewhere a process has a cache of that information and the only way I could figure to get it going again was a restart.


If you are using Yosemite and have an HP printer or scanner check your console just in case Spotlight has died for you too. Do battle with the terminal. Restart. Enjoy Spotlight indexing and smoother performance from your Mac.


LTI Tools: Putting Cookies in the Frame

Last year I ran into a problem with IMS LTI in some versions of Internet Explorer. It turns out that my LTI tool was assuming it was OK to save cookies in the browser but IE was assuming it wasn't. Chuck has written about this briefly on his blog and the advice given there is basically now best practice and documented in the specification itself. See LTI, Frames and Cookies – Oh MY!

So I just spent the day trying to implement something like this ready for my QTI migration tool to become an LTI tool rather than a desktop application and it was much harder than I thought it would be.

Firstly, why is this happening? Well if your tool is launched in a frame inside the tool consumer then you are probably falling in to the definition of 'third party' as far as cookies are concerned. There is clearly a balance to be had between privacy and utility here. If you think about it, framed pages probably know who framed them and so something innocuous like an advert for an online store could use cookies to piece together a browser history of every site you visited that included their ad. That's why when you visit your favourite online store after browsing some informational site to do product research the store seems to already know what you want to buy.

Cynics will see a battle between companies like Google and Amazon who make money by encouraging you to find products online (and buy them) and Microsoft who are more into the business of selling software and services, especially to businesses who may not appreciate having their purchasing department gamed. Perhaps it is no wonder that Amazon found themselves in court in Seattle having to defend their technical implementation of P3P.

It turns out that IE can/could be persuaded to accept your tool's cookie if your tool publishes a privacy policy which IE isn't able to parse. I don't want to appear too dismissive but I simply cannot understand how decisions about privacy can be codified into policies that computers can then read and execute with any degree of usefulness. Stackoverflow has some advice on how to do it 'properly' if you are that way inclined, however.

For the rest of us, we're going to have to get used to the idea that cookies may not be accepted and follow Chuck's advice to workaround the issue by opening a new window. It isn't just IE now anyway, by default Safari will block cookies in this situation too. Yes, the solution is horrible, but how horrible?

Some time ago I wrote some Python classes to help implement basic LTI. That felt like a good starting point for my new LTI tool. Here's Safari, pointing at the IMS LTI test harness which I've just primed with the URL of my tool: http://localhost:8081/launch

In the next shot, I've scrolled down to where the action is going to take place. I've opened Safari's developer tools so we can see something of what is happening with cookies.

So what happens when I hit "Press to Launch"? It all happens rather quickly but the frame POSTs to my launch page, which then redirects to a cookie test page. That page realises that the cookies didn't stick and shows a form which autosubmits, popping up a new window with my tool content in. The content is not very interesting but here's how my browser looked immediately after:

There's a real risk that this window will be blocked by a pop-up blocker. Indeed, Chrome does seem to block this type of pop-up window. In fact, Chrome allows the original frame to use cookies so on that browser the pop-up doesn't need to fire. I only tripped the blocker during simulated tests. Still, it is clear that you do need to consider that LTI tools may neither have access to cookies nor be able to pop-up a new window automatically. To cover this case my app has a button that allows you to open the window manually instead, if I change tabs back to the Tool Consumer page you'll see how my pop-up sequence has left the original page/frame.

Notice that Safari now shows my two cookies. These cookies were set by the new window, not by the content of this frame, even though they show up in this view. They show me that my tool has not only displayed its content but has successfully created a cookie-trackable session.

The code required to make this happen is more complex than I thought. I've tried to represent my session construction logic in diagramatic form. Each box represents a request. The top line contains the incoming cookie values ('-' means missing) and the page name. The next line indicates the values of some key query parameters used to pass Session information, Return URL and whether or not a pop-up Window has been opened. The final line shows the out-going cookies. If you follow the right hand path from top to bottom you see a hit to the home page of the tool (aka launch page) with no cookies. It sets the values 0/A and redirects to the test page which confirms that cookies work, outputs a new session identifier (B) and redirects to the content. The left-turns in the sequence show the paths taken when cookies that were set are blocked by the browser.

Why does the session get rewritten from value A to B during the sequence? Just a bit of paranoia. Exposing session identifiers in URLs is considered bad form because they can be easily leaked. Having taken the risk of passing the initial session identifier through the query string we regenerate it to prevent any form of session hijacking or fixation attacks. This is quite a complex issue and is related to other weaknesses such as cross-site request forgery. I found Cross-Site Request Forgery (CSRF) Prevention Cheat Sheet very useful reading!

I will check the code for this into my Pyslet GitHub project shortly, if you want a preview please get in touch or comment on this post. The code is too complex to show in the blog post itself.


Basic Authentication, SSL and Pyslet's HTTP/OData client

Pyslet is my Python package for Standards in Learning Education and Training and represents a packaging up of the core of my QTI migration script code in a form that makes it easier for other developers to use. Earlier this year I released Pyslet to PyPi and moved development to Github to make it easier for people to download, install and engage with the source code.

Warning: The code in this article will work with the latest Pyslet master from Github, and with any distribution on or later than pyslet-0.5.20141113. At the time of writing the version on PyPi has not been updated!

A recent issue that came up concerns Pyslet's HTTP client. The client is the base class for Pyslet's OData client. In my own work I often use this client to access OData feeds protected with HTTP's basic authentication but I've never properly documented how to do it. There are two approaches...

The simplest way, and the way I used to do it, is to override the client object itself and add the Authorization header at the point where each request is queued.

from pyslet.http.client import Client

class MyAuthenticatedClient(Client):

    # add an __init__ method to set some credentials 
    # in the client

    def queue_request(self, request):
        # add in the authorization credentials
        if (self.credentials is not None and
                not request.has_header("Authorization")):
            super(AuthClient, self).queue_request(request)

This works OK but it forces the issue a bit and will result in the credentials being sent to all URLs, which you may not want. The credentials object should be an instance of pyslet.http.auth.BasicCredentials which takes care of correctly formatting the header. Here is some sample code to create that object:

from pyslet.http.auth import BasicCredentials
from pyslet.rfc2396 import URI

credentials = BasicCredentials()
credentials.userid = "user@example.com"
credentials.password = "secretPa$$word"
credentials.protectionSpace = URI.from_octets(

With the above code, str(credentials) returns the string: 'Basic dXNlckBleGFtcGxlLmNvbTpzZWNyZXRQYSQkd29yZA==' which is what you'd expect to pass in the Authorization header.

To make this code play more nicely with the HTTP standard I added some core-support to the HTTP client itself, so you don't need to override the class anymore. The HTTP client now has a credential store and an add_credentials method. Once added, the following happens when a 401 response is received:

  1. The client iterates through any received challenges
  2. Each challenge is matched against the stored credentials
  3. If matching credentials are found then an Authorization header is added and the request resent
  4. If the request receives another 401 response the credentials are removed from the store and we go back to (1)

This process terminates when there are no more credentials that match any of the challenges or when a code other than 401 is received.

If the matching credentials are BasicCredentials (and that's the only type Pyslet supports out of the box!), then some additional logic gets activated on success. RFC 2617 says that for basic authentication, a challenge implies that all paths "at or deeper than the depth of the last symbolic element in the path field" fall into the same protection space. Therefore, when credentials are used successfully, Pyslet adds the path to the credentials using BasicCredentials.add_success_path. Next time a request is sent to a URL on the same server with a path that meets this criterium the Authorization header will be added pre-emptively.

If you want to pre-empt the 401 handling completely then you just need to add a suitable path to the credentials before you add them to the client. So if you know your credentials are good for everything in /website/~user/ you could continue the above code like this:


That last slash is really important, if you leave it off it will add everything in '/website/' to your protection space which is probably not what you want.


If you're going to pass basic auth credentials around you really should be using https. Python makes it a bit tricky to use HTTPs and be sure that you are using a trusted connection. Pyslet tries to make this a little bit easier. Here's what I do.

  1. With Firefox, go to the site in question and check that SSL is working properly
  2. Export the certificate from the site in PEM format and save to disk, e.g., www.example.com.crt
  3. Repeat for any other sites I want my python script to work with.
  4. Concatenate the files together and save them to, say, 'certificates.pem'
  5. Pass this file name to the HTTP (or OData) client constructor.
from pyslet.http.client import Client

my_client = Client(ca_certs='certificates.pem')

In this code, I've assumed that the credentials were created as above. To be really sure you are secure here, try grabbing a file from a different site or, even better, generate a self-signed certificate and use that instead. (The master version of Pyslet currently has such a certificate ready made in unittests/data_rfc2616/server.crt). Now pass that file for ca_certs and check that you get SSL errors! If you don't, something is broken and you should proceed with caution, or you may just be on a Mac (see notes in Is Python's SSL module correctly validating certificates... for details). And don't pass None for ca_certs as that tells the ssl module not to check at all!

If you don't like messing around with the certificates, and you are using a machine and network that is pretty trustworthy and from which you would happily do your internet banking then the following can be used to proxy for the browser method:

import ssl, string
import pyslet.rfc2396 as uri

certs = []
for s in ('https://www.example.com', 'https://www.example2.com', ):
    # add other sites to the above tuple as you like
    url = uri.URI.from_octets(s)
    with open('certificates.pem', 'wb') as f:

Passing the ssl_version is optional above but the default setting in many Python installations will use the discredited SSLv3 or worse and your server may refuse to serve you, I know mine does! Set it to a protocol you trust.

Remember that you'll have to do this every so often because server certificates expire. You can always grab the certificate authority's certificate instead (and thereby trust a whole slew of sites at once) but if you're going that far then there are better recipes for finding and re-using the built-in machine certificate store anyway. The beauty of this method is that you can self-sign a server certificate you trust and connect to it securely with a Python client without having to mess around with certificate authorities at all, provided you can safely courier the certificate from your server to your client that is! If you are one of the growing number of people who think the whole trust thing is broken anyway since Snowden then this may be an attractive option.

With thanks to @bolhovsky on Github for bringing the need for this article to my attention.


VMWare Fusion, Windows 8 and British Keyboards

I've been running VMWare Fusion for years to run a dual monitor setup with one screen Mac and the other screen 'PC'.  I recently upgraded to VMWare Fusion 7 and my Windows 7 virtual machine kept working just fine.  However I recently had to start from scratch with a new VM to install Windows 8.1.  I realise that there may be upgrade routes via Windows 8 but, after two hours on the phone with Microsoft technical support for an unrelated issue I personally decided that a clean install would be the way to go.

The problem is, the keyboard just doesn't work properly in the default setup.  It is years since I had to play with these things and it took me a while to figure out how to put mappings in place that make my British keyboard do the right thing.  There may actually be a problem with Fusion 7 if this forum thread is anything to go by.

The symptoms are that the '@' symbol, which is typed with shift-2 on the Mac keyboard was coming out as double-quote like it would on a PC keyboard and try as I might I could not figure out how to type a back-slash at first at all.

The solution is to go to the VM's settings, duplicate the profile you are using (VMWare created a Windows 8 profile for me) and then add some custom mappings. I found this a bit counter intuitive because you need to start by opening something like Notepad in the VM and experimenting until you find the characters you want, note down the key combination you pressed to get them and then switch to the profile screen and add a mapping from the keys you want to press to the key combinations you just discovered work. A picture will help...

This is the double quote on my keyboard but without the custom profile it types '@', a simple switch with Shift-2 that almost all Mac users are probably already familiar with.
The reverse mapping of the above
Turns out that the backslash key types a hash/number/sharp sign on the PC so we map Option-3 to that if, like me, you have become habituated to the Apple way.
Turns out that the backslash is now typed by the section sign (that's the curly thing that looks like a tiny spiral-arm galaxy).
Lastly, you may have to hunt for the '~', but it gets typed when you hit shift-backslash on the PC. On the Mac it appears over the back-quote or back-tick symbol.

That's the most important ones. Note that shift-3 correctly types the pound-sterling sign '£' so I didn't have to touch that at all. I have no idea how one types a section sign or the combined plus-minus '±' sign on the PC but in my line of work there is very little call for them. I warmly invite readers to add their own suggested mappings as comments to this post!


Adding OData support to Django with Pyslet: First Thoughts

A couple of weeks ago I got an interesting tweet from @d34dl0ck, here it is:

This got me thinking, but as I know very little about Django I had to do a bit of research first. Here's my read-back of what Django's data layer does in the form of a concept mapping from OData to Django. In this table the objects are listed in containment order and the use case of using OData to expose data managed by a Django-based website is assumed. (See below for thoughts on consuming OData in Django as if it were a data source.)

OData ConceptDjango ConceptPyslet Concept
DataServices Django website: the purpose of OData is to provide access to your application's data-layer through a standard API for machine-to-machine communication rather than through an HTML-based web view for human consumption. Instance of the DataServices class, typically parsed from a metadata XML file.
Schema No direct equivalent. In OData, the purpose of the schema is to provide a namespace in which definitions of the other elements take place. In Django this information will be spread around your Python source code in the form of class definitions that support the remaining concepts. Instance of the Schema class, typically parsed from a metadata XML file.
EntityContainer The database. An OData service can define multiple containers but there is always a default container - something that corresponds closely with the way Django links to multiple databases. Most OData services probably only define a single container and I would expect that most Django applications use the default database. If you do define custom database routers to map different models to different databases then that information would need to be represented in the corresponding Schema(s). In Pyslet, an EntityContainer is defined by an instance of the EntityContainer class but this instance is handed to a storage layer during application startup and this storage layer class binds concrete implementations of the data access API to the EntitySets it contains.
EntitySet Your model class. A model class maps to a table in the Django database. In OData the metadata file contains the information about which container contains an EntitySet and the EntityType definition in that file contains the actual definitions of the types and field names. In contrast, in Django these are defined using class attributes in the Python code. Pyslet sticks closely to the OData API here and parses definitions from the metadata file. As a result an EntitySet instance is created that represents this part of the model and it is up to the object responsible for interfacing to the storage layer to provide concrete bindings.
Entity An instance of a model class. An instance of the Entity object, typically instantiated by the storage object bound to the EntitySet.

Where do you start?

Step 1: As you can see from the above table, Pyslet depends fairly heavily on the metadata file so a good way to start would be to create a metadata file that corresponds to the parts of your Django data model you want to expose. You have some freedom here but if you are messing about with multiple databases in Django it makes sense to organise these as separate entity containers. You can't create relationships across containers in Pyslet which mirrors the equivalent restriction in Django.

Step 2: You now need to provide a storage object that maps Pyslet's DAL onto the Django DAL. This involves creating a sub-class of the EntityCollection object from Pyslet. To get a feel for the API my suggestion would be to create a class for a specific model initially and then, once this is working, consider how you might use Python's built-in introspection to write a more general object.

To start with, you don't need to do too much. EntityCollection objects are just like dictionaries but you only need to override itervalues and __getitem__ to get some sort of implementation going. There are simple wrappers that will (inefficiently) handle ordering and filtering for you to start with so itervalues can be very simple...

def itervalues(self):
    return self.OrderEntities(

All you need to do is write the entityGenerator method (the name is up to you) and yield Entity instances from your Django model. This looks pretty simple in Django, something like Customer.objects.all() where Customer is the name of a model class would appear to return all customer instances. You need to yield an Entity object from Pyslet's DAL for each customer instance and populate the property values from the fields of the returned model instance.

Implementing __getitem__ is probably also very easy, especially when you are using simple keys. Something like Customer.objects.get(pk=1) and then a similar mapping to the above seems like it would work for implementing basic resource look up by key. Look at the in-memory collection class implementation for the details of how to check the filter and populate the field values, it's in pyslet/odata2/memds.py.

Probably the hardest part of defining an EntityCollection object is getting the constructor right. You'll want to pass through the Model class from Django so that you can make calls like the above:

def __init__(self,djangoModel,**kwArgs):

Step 3: Load the metadata from a file, then bind your EntityCollection class or classes to the EntitySets. Something like this might work:

import pyslet.odata2.metadata as edmx
with open('DjangoAppMetadata.xml','rb') as f:
# customers is an EntitySet instance

The Customer object here is your Django model object for Customers and the DjangoCollection object is the EntityCollection object you created in Step 2. Each time someone opens the customers entity set a new DjangoCollection object will be created and Customer will be passed as the djangoModel parameter.

Step 4: Test that the model is working by using the interpreter or a simple script to open the customers object (the EntitySet) and make queries with the Pyslet DAL API. If it works, you can wrap it with an OData server class and just hook the resulting wsgi object to your web server and you have hacked something together.

Post hack

You'll want to look at Pyslet's expression objects and figure out how to map these onto the query objects used by Django. Although OData provides a rich query syntax you don't need to support it all, just reject stuff you don't want to implement. Simple queries look like they'd map to things you can pass to the filter method in Django fairly easily. In fact, one of the problems with OData is that it is very general - almost SQL over the web - and your application's data layer is probably optimised for some queries and not others. Do you want to allow people to search your zillion-record table using a query that forces a full table scan? Probably not.

You'll also want to look at navigation properties which map fairly neatly to the relationship fields. The Django DAL and Pyslet's DAL are not miles apart here so you should be able to create NavigationCollection objects (equivalent to the class you created in Step 2 above) for these. At this point, the power of OData will begin to come alive for you.

Making it Django-like

I'm not an expert on what is and is not Django like but I did notice that there is a Feed concept for exposing RSS in Django. If the post hack process has left you with a useful implementation then some sort of OData equivalent object might be a useful addition. Given that Django tends to do much of the heavy lifting you could think about providing an OData feed object. It probably isn't too hard to auto-generate the metadata from something like class attributes on such an object. Pyslet's OData server is a wsgi application so provided Django can route requests to it you'll probably end up with something that is fairly nicely integrated - even if it can't do that out of the box it should be trivial to provide a simple Django request handler that fakes a wsgi call.

Consuming OData

Normally you think of consuming OData as being easier than providing it but for Django you'd be tempted to consider exposing OData as a data source, perhaps as an auxiliary database containing some models that are externally stored. This would allow you to use the power of Django to create an application which mashed up data from OData sources as if that data were stored in a locally accessible database.

This appears to be a more ambitious project: Django non-rel appears to be a separate project and it isn't clear how easy it would be to intermingle data coming form an OData source with data coming from local databases. It is unlikely that you'd want to use OData for all data in your application. The alternative might be to try and write a Python DB API interface for Pyslet's DAL and then get Django treating it like a proper database. That would mean parsing SQL, which is nasty, but it might be the lesser of two evils.

Of course, there's nothing stopping you using Pyslet's builtin OData client class directly in your code to augment your custom views with data pulled from an external source. One of the features of Pyslet's OData client is that it treats the remote server like a data source, keeping persistent HTTP connections open, managing multi-threaded access and and pipelining requests to improve throughput. That should make it fairly easy to integrate into your Django application.


A Dictionary-like Python interface for OData Part III: a SQL-backed OData Server

This is the third and last part of a series of three posts that introduce my OData framework for Python. To recap:

  1. In Part I I introduced a new data access layer I've written for Python that is modelled on the conventions of OData. In that post I validated the API by writing a concrete implementation in the form of an OData client.
  2. In Part II I used the same API and wrote a concrete implementation using a simple in-memory storage model. I also introduced the OData server functionality to expose the API via the OData protocol.
  3. In this part, I conclude this mini-series with a quick look at another concrete implementation of the API which wraps Python's DB API allowing you to store data in a SQL environment.

As before, you can download the source code from the QTIMigration Tool & Pyslet home page. I wrote a brief tutorial on using the SQL backed classes to take care of some of the technical details.

Rain or Shine?

To make this project a little more interesting I went looking for a real data set to play with. I'm a bit of a weather watcher at home and for almost 20 years I've enjoyed using a local weather station run by a research group at the University of Cambridge. The group is currently part of the Cambridge Computer Laboratory and the station has moved to the William Gates building.

The Database

The SQL implementation comes in two halves. The base classes are as close to standard SQL as I could get and then a small 'shim' sits over the top which binds to a specific database implementation. The Python DB API takes you most of the way, including helping out with the correct form of parameterisation to use. For this example project I used SQLite because the driver is typically available in Python implementations straight out of the box.

I wrote the OData-style metadata document first and used it to automatically generate the CREATE TABLE commands but in most cases you'll probably have an existing database or want to edit the generated scripts and run them by hand. The main table in my schema got created from this SQL:

CREATE TABLE "DataPoints" (
    "Temperature" REAL,
    "Humidity" SMALLINT,
    "DewPoint" REAL,
    "Pressure" SMALLINT,
    "WindSpeed" REAL,
    "WindDirection" TEXT,
    "WindSpeedMax" REAL,
    "SunRainStart" REAL,
    "Sun" REAL,
    "Rain" REAL,
    "DataPointNotes_ID" INTEGER,
    PRIMARY KEY ("TimePoint"),
    CONSTRAINT "DataPointNotes" FOREIGN KEY ("DataPointNotes_ID") REFERENCES "Notes"("ID"))

To expose the database via my new data-access-layer API you just load the XML metadata, create a SQL container object containing the concrete implementation and then you can access the data in exactly the same way as I did in Part's I and II. The code that consumes the API doesn't need to know if the data source is an OData client, an in memory dummy source or a full-blown SQL database. Once I'd loaded the data, here is a simple session with the Python interpreter that shows you the API in action.

>>> import pyslet.odata2.metadata as edmx
>>> import pyslet.odata2.core as core
>>> doc=edmx.Document()
>>> with open('WeatherSchema.xml','rb') as f: doc.Read(f)
>>> from pyslet.odata2.sqlds import SQLiteEntityContainer
>>> container=SQLiteEntityContainer(filePath='weather.db',containerDef=doc.root.DataServices['WeatherSchema.CambridgeWeather'])
>>> weatherData=doc.root.DataServices['WeatherSchema.CambridgeWeather.DataPoints']
>>> collection=weatherData.OpenCollection()
>>> collection.OrderBy(core.CommonExpression.OrderByFromString('WindSpeedMax desc'))
>>> collection.SetPage(5)
>>> for e in collection.iterpage(): print "%s: Max wind speed: %0.1f mph"%(unicode(e['TimePoint'].value),e['WindSpeedMax'].value*1.15078)
2002-10-27T10:30:00: Max wind speed: 85.2 mph
2004-03-20T15:30:00: Max wind speed: 82.9 mph
2007-01-18T14:30:00: Max wind speed: 80.6 mph
2004-03-20T16:00:00: Max wind speed: 78.3 mph
2005-01-08T06:00:00: Max wind speed: 78.3 mph

Notice that the container itself isn't needed when accessing the data because the SQLiteEntityContainer __init__ method takes care of binding the appropriate collection classes to the model passed in. Unfortunately the dataset doesn't go all the way back to the great storm of 1987 which is a shame as at the time I was living in a 5th floor flat perched on top of what I was reliably informed was the highest building in Cambridge not to have some form of structural support. I woke up when the building shook so much my bed moved across the floor.

Setting up a Server

I used the same technique as I did in Part II to wrap the API with an OData server and then had some real fun getting it up and running on Amazon's EC2. Pyslet requires Python 2.7 but EC2 Linux comes with Python 2.6 out of the box. Thanks to this blog article for help with getting Python 2.7 installed. I also had to build mod_wsgi from scratch in order to get it to pick up the version I wanted. Essentially here's what I did:

# Python 2.7 install
sudo yum install make automake gcc gcc-c++ kernel-devel git-core -y
sudo yum install python27-devel -y
# Apache install
#  Thanks to http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/install-LAMP.html
sudo yum groupinstall -y "Web Server"
sudo service httpd start
sudo chkconfig httpd on

And to get mod_wsgi working with Python2.7...

sudo bash
yum install httpd-devel -y
mkdir downloads
cd downloads
wget http://modwsgi.googlecode.com/files/mod_wsgi-3.4.tar.gz
tar -xzvf mod_wsgi-3.4.tar.gz
cd mod_wsgi-3.4
./configure --with-python=/usr/bin/python2.7
make install
# Optional check to ensure that we've got the correct Python linked
# you should see the 2.7 library linked
ldd /etc/httpd/modules/mod_wsgi.so
service httpd restart

To drive the server with mod_wsgi I used a script like this:

#! /usr/bin/env python

import logging, os.path
import pyslet.odata2.metadata as edmx
from pyslet.odata2.sqlds import SQLiteEntityContainer
from pyslet.odata2.server import ReadOnlyServer



with open(os.path.join(HOME_DIR,'WeatherSchema.xml'),'rb') as f:



def application(environ, start_response):
 return server(environ,start_response)

I'm relying on the fact that Apache is configured to run Python internally and that my server object persists between calls. I think by default mod_wsgi serialises calls to the application method but a smarter configuration with a multi-threaded daemon would be OK because the server and container objects are thread safe. There are limits to the underlying SQLite module of course so you may not gain a lot of performance this way but a proper database would help.

Try it out!

If you were watching carefully you'll see that the above script uses a public service root. So let's try the same query but this time using OData. Here it is in Firefox:

Notice that Firefox recognises that the OData feed is an Atom feed and displays the syndication title and updated date. I used the metadata document to map the temperature and the date of the observation to these (you can see they are the same data points as above by the matching dates). The windiest days are never particularly hot or cold in Cambridge because they are almost always associated with Atlantic storms and the sea temperature just doesn't change that much.

The server is hosted at http://odata.pyslet.org/weather