2012-06-01

Atom, OData and Binary Blobs

I've been doing a lot of work on Atom and OData recently. I'm a real fan of Atom and the related Atom Publishing Protocol (APP for short). OData is a specification from Microsoft which builds on these two basic building blocks of the internet to provide standard conventions for querying feeds and representing properties using a SQL-like model.

Given that OData can be used to easily expose data currently residing in SQL databases it is not surprising that the issue of binary blobs is one that takes a little research to figure out. At first sight it isn't obvious how OData deals with them, in fact, it isn't even obvious how APP deals with them!

Atom Primer

Most of us are familiar with the idea of an RSS feed for following news articles and blogs like this one (this article prompted me to add the gadget to my blogger templates to make it easier to subscribe). Atom is a slightly more formal definition of the same concept and is available as an option for subscribing to this blog too. Understanding the origins of Atom helps when trying to understand the Atom data model, especially if you are coming to Atom from a SQL/OData point of view.

Atom is all about feeds (lists) of entries. The data you want, be it a news article, blog post or a row in your database table is an entry. A feed might be everything, such as all the articles in your blog or all the rows in your database table, or it may be a filtered subset such as all the articles in your blog with a particular tag or all the rows in your table that match a certain query.

Atom adheres closely to the REST-based service concept. Each entry has its own unique URI. Feeds also have their own URIs. For example, the Atom feed URL for this blog is:

http://swl10.blogspot.com/feeds/posts/default

But if you are only interested in the Python language then you might want to use a different feed:

http://swl10.blogspot.com/feeds/posts/default/-/Python

Obviously the first feed contains all the entries in the second feed too!

Atom is XML-based so an entry is represented by an <entry> element and the content of an entry is represented by a <content> child element. Here's an abbreviated example from this blog's Atom feed. Note that the atom-defined metadata elements appear as siblings of the content...

<entry>
  <id>tag:blogger.com,1999:blog-8659912959976079554.post-4875480159917130568</id>
  <published>2011-07-17T16:00:00.000+01:00</published>
  <updated>2011-07-17T16:00:06.090+01:00</updated>
  <category scheme="http://www.blogger.com/atom/ns#" term="QTI"/>
  <category scheme="http://www.blogger.com/atom/ns#" term="Python"/>
  <title type="text">Using gencodec to make a custom character mapping</title>
  <content type="html">One of the problems I face...</content>
  <link rel="edit" type="application/atom+xml"
    href="http://www.blogger.com/feeds/8659912959976079554/posts/default/4875480159917130568"/>
  <link rel="self" type="application/atom+xml"
    href="http://www.blogger.com/feeds/8659912959976079554/posts/default/4875480159917130568"/>
</entry>

For blog articles, this content is typically html text (yes, horribly escaped to allow it to pass through XML parsers). Atom actually defines three types of native content, 'html', 'text' and 'xhtml'. It also allows the content element to contain a single child element corresponding to other XML media types. OData uses this method to represent the property name/value pairs that might correspond to the column names and values for a row in the database table your are exposing.

Here's another abbreviated example taken from the Netflix OData People feed:

<entry>
  <id>http://odata.netflix.com/v2/Catalog/People(189)</id>
  <title type="text">Bruce Abbott</title>
  <updated>2012-06-01T07:55:17Z</updated>
  <category term="Netflix.Catalog.v2.Person" scheme="http://schemas.microsoft.com/ado/2007/08/dataservices/scheme" />
  <content type="application/xml">
    <m:properties>
   <d:Id m:type="Edm.Int32">189</d:Id>
   <d:Name>Bruce Abbott</d:Name>
    </m:properties>
  </content>
</entry>

Notice the application/xml content type and the single properties element from Microsoft's metadata schema.

Any other type of content is considered to be external media. But Atom can still describe it, it can still associate metadata with it and it can still organize it into feeds...

Binary Blobs as Media

There is nothing stopping the content of an entry from being a non-text binary blob of data. You just change the type attribute to be your favourite blob format and add a src attribute to point to an external file or base-64 encode it and include it in the entry itself (this second method is rarely used I think).

Obviously the URL of the entry (the XML document containing the <entry> tag) is not the same as the URL of the media resource, but they are closely related. The entry is referred to as a Media Link because it contains the metadata about the media file (such as the title, updated date etc) and it links to it. The media file itself is known as a media resource.

There's a problem with OData though. OData requires the child of the content element to be the properties element (see example above) and the type attribute to be application/xml. But Atom says there can only be one content element per entry. So how can OData be used for binary blobs?

The answer is a bit of a hack. When the entry is a media link entry the properties move into the metadata area of the entry. Here's another abbreviated example from Netflix which illustrates the technique:

<entry>
  <id>http://odata.netflix.com/v2/Catalog/Titles('13aly')</id>
  <title type="text">Red Hot Chili Peppers: Funky Monks</title>
  <summary type="html">Lead singer Anthony Kiedis...</summary>
  <updated>2012-01-31T09:45:16Z</updated>
  <author>
    <name />
  </author>
  <category term="Netflix.Catalog.v2.Title"
    scheme="http://schemas.microsoft.com/ado/2007/08/dataservices/scheme" />
  <content type="image/jpeg" src="http://cdn-0.nflximg.com/en_us/boxshots/large/5632678.jpg" />
  <m:properties xmlns:m="http://schemas.microsoft.com/ado/2007/08/dataservices/metadata"
    xmlns:d="http://schemas.microsoft.com/ado/2007/08/dataservices">
    <d:Id>13aly</d:Id>
    <d:Name>Red Hot Chili Peppers: Funky Monks</d:Name>
    <d:ShortName>Red Hot Chili Peppers: Funky Monks</d:ShortName>
    <d:Synopsis>Lead singer Anthony Kiedis...</d:Synopsis>
    <d:ReleaseYear m:type="Edm.Int32">1991</d:ReleaseYear>
    <d:Url>http://www.netflix.com/Movie/Red_Hot_Chili_Peppers_Funky_Monks/5632678</d:Url>
    <!-- more properties.... -->
  </m:properties>
</entry>

This entry is taken from the Titles feed, notice that the entry is a media-link to the large box graphic for the film.

Binary Blobs and APP

APP adds a protocol for publishing information to Atom feeds and OData builds on APP to allow data feeds to be writable, not just read-only streams. You can't upload your own titles to Netflix as far as I know so I don't have an example here. The details are all in section 9.6 of RFC 5023 but in a nutshell, if you post a binary blob to a feed the server should store the blob and create a media link entry that points to it (populated with a minimal set of metadata). Once created, you can then update the metadata with HTTP's PUT method on the media link's edit URI directly, or update the binary blob by using HTTP's PUT method on the edit-media URI of the media resource. (These links are given in the <link> elements in the entries, see the first example for examples.)

There is no reason why binary blobs can't be XML files of course. Many of the technical standards for education that I work with are very data-centric. They define the format of XML documents such as QTI, which are designed to be opaque to management systems like item banks (an item bank is essentially a special-purpose content management system for questions used in assessment).

So publishing feeds using OData or APP from an item bank would most likely use these techniques for making the underlying content available to third party systems. Questions often contain media resources (e.g., images) of course but even the question content itself is typically marked up using XML, as it is in QTI. This data is not easy to represent as a simple list of property values and would typically be stored as a blob in a database or as a file in a repository. Therefore, it is probably better to think of this data as a media resource when exposing it via APP/OData.