How to Build an SQL Storage Adapter for RDF Data with Ruby

2010/04/06 by Ben

RDF.rb is approaching two thousand downloads on RubyGems, and while it has good documentation it could still use some more tutorials. I recently needed to get RDF.rb working with a PostgreSQL storage backend in order to work with RDF data in a Rails 3.0 application hosted on Heroku. I thought I'd keep track of what I did so that I could discuss the notable parts.

In this tutorial we'll be implementing an RDF.rb storage adapter called RDF::DataObjects::Repository, which is a simplified version of what I eventually ended up with. If you want the real thing, check it out on GitHub and read the docs. This tutorial will only cover the SQLite backend and won't concern itself with database indexes, performance tweaks, or any other distractions from the essential RDF.rb interfaces we'll focus on. There's a copy of the simplified code used in the tutorial at the tutorial's project page. And should you be inspired to build something similar of your own, I have set up an RDF.rb storage adapter skeleton at GitHub. Click fork, grep for lines containing a TODO comment, and dive right in.

I'll mention, briefly, that I chose DataObjects as the database abstraction layer, but I don't want to dwell on that -- this post is about RDF. DataObjects is just a way to use common methods to talk to different databases at the SQL level. It's a leaky abstraction, because we'll want to be using some SQL constraints to enforce statement uniqueness but those constraints need to be done differently for different databases. That means we still have to get down to the level of database-specific SQL, distasteful as that may be in this day and age. However, given that I wanted to be able to target PostgreSQL and SQLite both, DataObjects is still helpful.

Requirements

You just need a few gems for the example repository. This ought to get you going. Even if you have these, make sure you have the latest; RDF.rb gets updated frequently.

$ sudo gem install rdf rdf-spec rspec do_sqlite3

Testing First

So where do we start? Tests, of course. RDF.rb has factored out its mixin specs to the RDF::Spec gem, which provides the RSpec shared example groups that are also used by RDF.rb for its own tests. Thus, here is the complete spec file for the in-memory reference implementation of RDF::Repository:

require File.join(File.dirname(__FILE__), 'spec_helper')
require 'rdf/spec/repository'

describe RDF::Repository do
  before :each do
    @repository = RDF::Repository.new
  end

  # @see lib/rdf/spec/repository.rb
  it_should_behave_like RDF_Repository
end

If you haven't seen something like this before, that's an RSpec shared example group, and it's awesome. Anything can use the same specs as RDF.rb itself to verify that it conforms to the interfaces defined by RDF.rb, and that's exactly what we'll be doing in this tutorial. Let's implement that for our repository:

# spec/sqlite3.spec
$:.unshift File.dirname(__FILE__) + "/../lib/"

require 'rdf'
require 'rdf/do'
require 'rdf/spec/repository'
require 'do_sqlite3'

describe RDF::DataObjects::Repository do
  context "The SQLite adapter" do
    before :each do
      @repository = RDF::DataObjects::Repository.new "sqlite3::memory:"
    end

    after :each do
      # DataObjects pools connections, and only allows 8 at once.  We have
      # more than 60 tests.
      DataObjects::Sqlite3::Connection.__pools.clear
    end

    it_should_behave_like RDF_Repository
  end
end

If you're new to RSpec, run the tests with the spec command:

$ spec -cfn spec/sqlite3.spec

These fail miserably right now, of course, since we don't have an implementation. So let's make one.

Initial implementation

RDF.rb's interface for an RDF store is RDF::Repository. That interface is itself composed of a number of mixins: RDF::Enumerable, RDF::Queryable, RDF::Mutable, and RDF::Durable.

RDF::Queryable has a base implementation that works on anything which implements RDF::Enumerable. And RDF::Durable only provides boolean methods for clients to ask if it is durable? or not; the default is that a repository reports that it is indeed durable, so we don't need to do anything there.

The takeaway is that to create an RDF.rb storage adapter, we need to implement RDF::Enumerable and RDF::Mutable, and the rest will fall into place. Indeed, the reference implementation is little more than an array which implements these interfaces.

It turns out we can get away with just three methods to implement those two interfaces: RDF::Enumerable#each, RDF::Mutable#insert_statement, and RDF::Mutable#delete_statement. The default implementations will use these to build up any missing methods. That means we need to implement those first, so that we have a base to pass our tests. Then we can iterate further, replacing methods which iterate over every statement with methods more appropriate for our backend.

Here's a repository which doesn't implement much more than those three methods. We'll use it as a starting point.

# lib/rdf/do.rb

require 'rdf'
require 'rdf/ntriples'
require 'data_objects'
require 'do_sqlite3'
require 'enumerator'

module RDF
  module DataObjects
    class Repository < ::RDF::Repository

      def initialize(options)
        @db = ::DataObjects::Connection.new(options)
        exec('CREATE TABLE IF NOT EXISTS quads (
              `subject` varchar(255), 
              `predicate` varchar(255),
              `object` varchar(255), 
              `context` varchar(255), 
              UNIQUE (`subject`, `predicate`, `object`, `context`))')
      end

      # @see RDF::Enumerable#each.
      def each(&block)
        if block_given?
          reader = result('SELECT * FROM quads')
          while reader.next!
            block.call(RDF::Statement.new(
                  :subject   => unserialize(reader.values[0]),
                  :predicate => unserialize(reader.values[1]),
                  :object    => unserialize(reader.values[2]),
                  :context   => unserialize(reader.values[3])))

          end
        else
          ::Enumerable::Enumerator.new(self,:each)
        end
      end

      # @see RDF::Mutable#insert_statement
      def insert_statement(statement)
        sql = 'REPLACE INTO `quads` (subject, predicate, object, context) VALUES (?, ?, ?, ?)'
        exec(sql,serialize(statement.subject),serialize(statement.predicate), 
                 serialize(statement.object), serialize(statement.context)) 
      end

      # @see RDF::Mutable#delete_statement
      def delete_statement(statement)
        sql = 'DELETE FROM `quads` where (subject = ? AND predicate = ? AND object = ? AND context = ?)'
        exec(sql,serialize(statement.subject),serialize(statement.predicate), 
                 serialize(statement.object), serialize(statement.context)) 
      end

      ## These are simple helpers to serialize and unserialize component
      # fields.  We use an explicit empty string for null values for clarity in
      # this example; we cannot use NULL, as SQLite considers NULLs as
      # distinct from each other when using the uniqueness constraint we
      # added when we created the table.  It would let us insert duplicate
      # with a NULL context.
      def serialize(value)
        RDF::NTriples::Writer.serialize(value) || ''
      end
      def unserialize(value)
        value == '' ? nil : RDF::NTriples::Reader.unserialize(value)
      end

      ## These are simple helpers for DataObjects
      def exec(sql, *args)
        @db.create_command(sql).execute_non_query(*args)
      end
      def result(sql, *args)
        @db.create_command(sql).execute_reader(*args)
      end

    end
  end
end

And we have a repository. Poof, done, that's it. You can get a copy of this intermediate repository at the tutorial page and run the specs for yourself. It's not very efficient for SQL yet, but this is all it takes, strictly speaking.

Since they are so important, the three main methods deserve a little more attention:

each

Each is the only thing we have to implement to get information out after we've put it in. RDF::Enumerable will provide us tons of things like each_subject, has_subject?, each_predicate, has_predicate?, etc. If you were watching the spec output, you'll notice we ran tests for RDF::Queryable. The default implementation will use RDF::Enumerable's methods to implement basic querying. This means we can already do things like:

# Note that #load actually comes from insert_statement, see below
repo.load('http://datagraph.org/jhacker/foaf.nt')
repo.query(:subject => RDF::URI.new('http://datagraph.org/jhacker/foaf'))
#=> RDF::Enumerable of statements with given URI as subject

Note that if a block is not sent, it's defined to return an Enumerable::Enumerator.

RDF::Queryable, which defines #query, is probably the thing we can improve the most on with SQL as opposed to the reference implementation. We'll revisit it below.

insert_statement

insert_statement inserts an RDF::Statement into the repository. It's pretty straightforward. It gives us access to default implementations of things like RDF::Mutable#load, which will load a file by name or import a remote resource:

repo.load('http://datagraph.org/jhacker/foaf.nt')
repo.count
#=> 10

delete_statement

delete_statement deletes an RDF::Statement. Again, it's straightforward, and it's used to implement things like RDF::Mutable#clear, which empties the repository:

repo.load('http://datagraph.org/jhacker/foaf.nt')
repo.clear
repo.count
#=> 0

Iterate and Improve

Since we already have a nice test suite that we can pass, we can add functionality incrementally. For example, let's implement RDF::Enumerable#count in a fashion that does not require us to enumerate each statement, which is clearly not ideal for a SQL-based system:

# lib/rdf/do.rb

def count
  result = result('SELECT COUNT(*) FROM quads')
  result.next!
  result.values.first
end

The tests still pass, we can move on. Wash, rinse, repeat; probably every method in RDF::Enumerable and RDF::Mutable can be done more efficiently with SQL.

RDF::Queryable

RDF::Queryable is worth mentioning on its own, because the interface takes a lot of options. Specifically, it can take a Hash, a smashed Array, an RDF::Statement, or a Query object. Fortunately, we can call super to defer to the reference implementation if we get arguments we don't understand, so we can again be iterative here.

We can start by implementing the hash version, which is the most convienent for doing the actual SQL query later. The hash version takes a hash which may have keys for :subject, :predicate, :object, and :context, and returns an RDF::Enumerable which contains all statements matching those parameters

# lib/rdf/do.rb

      def query(pattern, &block)
        case pattern
          when Hash
            statements = []
            reader = query_hash(pattern)
            while reader.next!
              statements << RDF::Statement.new(
                      :subject   => unserialize(reader.values[0]),
                      :predicate => unserialize(reader.values[1]),
                      :object    => unserialize(reader.values[2]),
                      :context   => unserialize(reader.values[3]))
            end
            case block_given?
              when true
                statements.each(&block)
              else
                statements.extend(RDF::Enumerable, RDF::Queryable)
            end
          else
            super(pattern) 
        end
      end

      def query_hash(hash)
        conditions = []
        params = []
        [:subject, :predicate, :object, :context].each do |resource|
          unless hash[resource].nil?
            conditions << "#{resource.to_s} = ?"
            params     << serialize(hash[resource])
          end
        end
        where = conditions.empty? ? "" : "WHERE "
        where << conditions.join(' AND ')
        result('SELECT * FROM quads ' + where, *params)
      end

Our specs still pass. Note this trick:

statements.extend(RDF::Enumerable, RDF::Queryable)

RDF::Queryable is defined to return something which implements RDF::Enumerable and RDF::Queryable. Since the only thing we need to implement RDF::Enumerable is #each, and Array already implements that, we can simply extend this Array instance with the mixins and return it.

Note also that while we have taken care of the hard part, we're still calling the reference implementation if we don't know how to handle our arguments. Now we can start adding those other query arguments:

# lib/rdf/do.rb

      def query(pattern, &block)
        case pattern
          when RDF::Statement
            query(pattern.to_hash)
          when Array
            query(RDF::Statement.new(*pattern))
          when Hash
      .
      .
      .

Our specs still pass! Moving on, there's a lot more we can implement. And once we have implemented it in a straightforward way, we can still implement things like multiple inserts, paging, and more, all transparant to the user. You can see the full list of methods to implement in the docs, but don't be afraid to dive into the code.

If you do, don't forget that RDF.rb is completely public domain, so if you want to copy-paste to bootstrap your implementation, feel free.

Any questions?

Hopefully this is enough to get you started. Remember, the code is at the tutorial page, and don't forget to check out the storage adapter skeleton. The RDF.rb documentation have a lot of information on the APIs you'll be using.

And last but not least, a good place to ask questions or leave a comment is on the W3C RDF-Ruby mailing list.


blog comments powered by Disqus