Monday, June 15, 2009

librdf Python API summary

I reviewed my code written using the Redland librdf Python API, and made a brief summary as a memoir. For more advanced and powerful parsing, I'm turning to some Java libraries, such as Jena, owlapi, Pellet, ...

Update: but I have been working with python-librdf for all the time and got my Java stuff put away...

** RDF.Model object

import RDF

model = RDF.Model(RDF.MemoryStorage())
Model using in-memory storage

model = RDF.Model(RDF.HashStorage(bdb_location,options="hash-type='bdb'"))
Model using Berkeley DB storage

p = RDF.Parser('raptor')

file_uri = RDF.Uri('file:/path/to/rdf_file')
Create a URI indicating a local file

p.parse_into_model(model,f_uri)
Parse a rdf file into model. Return boolean to indicate whether this operation is successful or not. (can parse multiple files into one model)

len(model)
Returns number of statements in model, only applies for models with in-memory storage. Won't work for Berkeley DB storage.

** RDF.Node object type

The node.type attribute is an integer indicating type of the node:
1: Resource node, can get RDF.Uri object by node.uri
2: literal node, the value of which can be extracted as node.literal_value['string']
4: blank node, which usually appears in owl as object of rdfs:subClassOf as a restriction on properties.

Better to use node.is_literal(), node.is_resource(), node.is_blank() to make judgment on node types, to avoid confusion.

** Simple query methods (bound to RDF.Model object)

All simple query methods supported by RDF.Model object can accept RDF.Node object. These methods also returns RDF.Node objects.

(model indicates RDF.Model object)

result = model.get_target(a,b)

result = model.get_predicate(a,b)
result = model.get_source(a,b)
Returns a RDF.Node object, or None upon failure

results = model.get_targets(a,b)
results = model.get_predicates(a,b)
results = model.get_sources(a,b)
Always return a RDF.Iterator object, containing the sequence of RDF.Node objects.

Iteration:
for result in results:

Check for end:
results.end() # return 0 or 1 on whether it is exhausted.

Membership test:
my_node in results # return boolean

** Not simple query methods

Create a RDF.Query object:
query = RDF.Query(query_string,query_language='xxx')

Query languages are rdql or sparql, default rdql

Sparql query with new string format syntax:
query = RDF.Query('SELECT ?s WHERE {{ ?s <{0}> <{1}> }}'.format(...),query_language='sparql')

results = query.execute(model)
results is a RDF.QueryResults object, also an iterator

Check for end:
results.finished()
for this_re in results:
print this_re['s']

** Blank node

Blank node is specially noted here because it is frequently used in collection-type domain/range declaration, and property restriction for class. All happen in OWL.

Blank node does not have uri attribute, and cannot be converted to RDF.Uri object. It can be easily used in RDF.Model object-bound queries, as they readily accepts node object as arguments.

To use it in ``not-simple'' query, Sparql query syntax (but not rdql) has to be used:

# node is a blank node
node_str = '_:'+node.blank_identifier
q = RDF.Query('SELECT ?predicate ?object WHERE {{ {0} ?predicate ?object }}'.format(node_str),query_language='sparql')
results = q.execute(model)

Problem: when working with Uniprot OWL, such a query would retrieve all the blank nodes. For example, a sparql query of blank node with predicate owl:unionOf will retrieve 26 blank nodes as objects, but a get_targets() query will retrieve only 1 blank node correctly. Will check if this is a bug or not.

Nodes with collection parseType are frequently used, eg. in domain/range specification.

No comments: