The W3C SPARQL working group (previously the Data Access Working Group) has recently released their first versions of the updated SPARQL standards, or SPARQL 1.1. The group's roadmap has these finalized a year from now, but they have asked for comments and I suppose these are mine.
I believe that these documents are a step further down a wrong path for SPARQL and, to a lesser degree, for RDF in general.
The latest round of changes includes a number of changes to SPARQL, including aggregate functions, subqueries, projection expressions, negations, updates and deletions, more specific HTTP protocol bindings, service discovery, entailment regimes, and a RESTful protocol for managing RDF graphs (the last one is not really just SPARQL, but it's in the updates).
So I'll start with my comments, which are mostly critical.
To start, an RDF-specific complaint, not really related to the rest of the post. Why would the one mandated format to be supported in the new RESTful RDF graph management interface be RDF/XML? What would it take for a the semweb community to move on from this failed standard, which has had known issues for more than 5 years? (those two issues were raised in 2001 and are currently marked 'postponed') Why should such an increasingly irrelevant standard as RDF/XML be chosen instead of the widely-supported and easy to implement N3, N-Triples, or Turtle?
As for SPARQL, the 1.1 standards continue to give named graphs first class citizen status, both in the web APIs and in more SPARQL syntax than they had before. It's not so much triples as quads these days. Other meta-metadata, such as time of assertion or validity time, are not covered. While named graphs are admittedly a particularly often-found case, why does it need to invade the syntax of SPARQL? Not every use case needs named graphs, but every SPARQL implementor must support them. The 1.1 standard now includes precedence rules when for named graph and base URIs when they conflict in HTTP query options and inside the query itself, attempting to solve this self-created problem.
How about subqueries? What about variables during insertions? What about subqueries during insertions? Do we really need implementors to consider these kinds of things for every SPARQL endpoint on the web?
None of these things is really all that bad by itself, but one must consider the bigger picture. SPARQL 1.0 was released in January of 2008 (with some comment period before that) and there is still no implementation of a SPARQL engine in PHP or Ruby (exceptions apply, see [1]). One does not increase the participation of that ecosystem by adding a selection of entailment regimes to the standard.
While a SPARQL implementation exists for the excellent RDFLib in Python, it's only one of the current big 3 (with Ruby and PHP) in web development, and there's only one. The fact that no SPARQL engines exist for Ruby or PHP should be considered a failure of the standard. Why are we adding complexity when there is no SQLite for SPARQL? Why are there at least 3 monolithic Java implementations (Jena, Sesame, Boca), all financially sponsored to some degree or another, but so little 'in the wild'? How long can RDFLib herd 16 cats as committers on the project? While I don't have a lot of direct experience with RDFLib, I pity the project 'leads' (I cannot find evidence that the project is sponsored or that anyone is 'in charge') trying to look towards the future of implementing 6 working papers of new standards.
One of the biggest success stories for semweb in widespread use is the Drupal RDF module, which has found wide acceptance in the Drupal community and started an ecosystem of modules. Drupal 7 will output RDFa by default and Drupal 6 supports a ton of wonderful features, including reversing the RSS 1.0 to 2.0 downgrade back to RDF. But Drupal remains a producer of simple triples and a consumer of SPARQL queries generated by other endpoints. Data in those sites remains locked down. Why? Because implementing SPARQL in PHP is nontrivial, and in a chicken-egg problem, nobody's paying for it before someone has a need for SPARQL.
I could go on, but these are symptoms (well, not that RDF/XML thing, I don't think there's a good reason for that). I feel that the working group is attempting to solve the wrong problem. Namely, it is attempting to define a somewhat-human-readable query language, SPARQL that works for almost all use cases. But why must the whole 'kitchen sink' be well-defined? Such a standards body should be attempting to define the easiest possible thing to implement and extend, not the the last tool anyone would ever use.
The SPARQL 1.0 standard's grammar was well-defined as a context free grammar. It also had extension functions, which were uniquely defined by URIs. Why the distinction between CFG elements and extension functions? Why not make syntax elements like named graphs and aggregate functions as discoverable as extensions? Well, the reason is that it's hard to write a parser of a human-readable format and make those things optional and discoverable. (Here's a SPARQL parser implementation in Scala, a language with powerful pattern matching features for good parsing, and it's 500 lines of code. It compiles to S-expressions, the parsing of which is about 30 lines. Hmm.)
If the protocol had been defined as S-expressions, the distinction would not exist and the syntax could be as expandable as the current functions (the current syntax would just be more functions). The new 1.1 service discovery mechanism is excellent and extendible and would allow the standard to grow dynamically instead of becoming bogged down in features for particular use cases. New baseline implementations of SPARQL would be easy to implement and grow incrementally, and the current human-readable format can be implemented in terms of these expressions.
The web of ontologies has grown with ad-hoc definitions created by people used to fill their needs. Standards grow organically around the ones that are needed most, others languish. Why should SPARQL functions have this kind of flexibility, but not the syntax? The distinction makes implementation overly difficult and is slowing the expansion of the Semantic Web.
In fact, it turns out that Jena has been parsing to S-expressions for some time. If you're an implementor, why would you do it any other way, especially when the standard can change as much as it does in 1.1? Any implementation will have to come up with something equivalent to S-expressions if you are going to be able to upgrade your engine implementation to meet standards like this when they are finalized. If people are doing it anyway, why not just make it the standard?
The SPARQL Working Group should be working on a definition for a function list and discovery protocol for S-expressions, and not for what we currently call SPARQL. What we call SPARQL is something that should compile to a simpler standard if various vendors want to implement it. S-expressions allow maximally simple parsing maximally simple serialization, and the ability to do feature discovery on core features of the language, not just portions which are blessed with the ability to be extended. S-expressions are easier for machines to generate for wide variety of automated use cases, far wider, I would venture, than the set of use cases for the human-readable queries.
Please, please, please do not doom the world to write the SPARQL equivalent of SQLAlchemy and ActiveRecord for the next 20 years! We can define a standard that machines can use natively. Now's the time.
At any rate, that's my beef in a nutshell. The working group won't come up with a successful standard until it's easy enough to implement it that workable implementations appear in the languages that are defining the web today. And when people can use those languages to implement that standard without an army of VC-funded engineers.
The SPARQL 1.1 proposals make the standard better than before, but it's not the standard we need. The SPARQL algebra is what needed expansion and specification, not the syntax.
[1]: The PHP ARC project has an implementation, but it attempts to directly convert SPARQL to an SQL query on particular table layout in MySQL, and is difficult to convert to general use. Despite SPARQL's complexity, ARC managed to implement this in just 6400 lines of code. The parser alone is 2000 lines and the engine another 4400. The serialization/parsing libraries, however, are fine, and were integrated successfully into the Drupal RDF module. The PHP RAP project has also done some good work and is perhaps more wrappable than ARC, but implements only a subset of SPARQL.
We finished the video for Dries' keynote just under the wire, as pretty much all such events need to be. Arto, Miglius and I had stayed up until past sunup for the last few days to make it happen. First Dan left, and then Miglius left on Saturday morning so that he could get stuck in Frankfurt for 24 hours. Once he got in to Boston, he logged on quick like a bunny and went back at it. Arto and I worked another 30-odd hours during Saturday and Sunday. Sometime during Monday, which I largely slept through, some of the office folk sent out a message noting that that our pile of pizza remains, chicken bones and coffee stains was not particularly helpful to the kitchen's ambiance. I don't think they know who did it, and I'm kind of afraid to fess up. Sorry, ladies.
Unlike most demo work, a ton of what went into this will be useful later. If our organizations were not keen on using RDF, we'd not have worked on this so hard. Arto's module stuff is anything but smoke and mirrors, and we figured out a lot of limitations to Exhibit and Potluck that will be important to understand later. These are now posted in our internal wiki and I will go and post them on the Simile project's site if I ever get a chance. It's worth a whole post in and of itself.
While Arto busied himself turning Drupal into the world's easiest to use RDF endpoint, Miglius and I combed datasets that would make for a decent demo and messed with Exhibit views. There's a lot of RDF data out there, but it doesn't all lend itself to being shown on a map, and people can only read so much on a video screen during a presentation. At the end of the day, I'm the only one with Leopard (and thus Screenflow), so I ended up doing the actual screencast.
Screencasting is an interesting thing. It's easier to script than a regular movie, but difficult to properly realize. There's a fine line between too little and too much data, without having awkward pauses and without skipping over too much. You have to take into account that different viewers have different levels of experience with the material, different reading speeds, whatever. I made a detailed narration that was a bit too fast paced for the keynote; that wasn't a problem, as Dries had already communicated that he'd prefer to do the narration himself.
On Monday, Arto and I woke up about a half hour before the talk and got on IM. As the talk began, we realized that we really needed to have this data up where people could get it. And we really wanted them to be able to get it--we'd worked ridiculous hours on this thing. So that's when we decided the site needed to be public.
We started to make that happen. There was a fair bit of configuration to be done to make it useful; Arto got the video onto s3 while I messed about with some permissions and redirects. I typoed just about everything I did related to that--I don't think I did a single thing once. Halfway through the whole thing I realized I had stage fright; I couldn't type because my hands were shaking. The video I had worked so hard on was about to be placed up to awe or bore a sizable number of people, on whom much depends. And there was still a possibility that Dries would use my narration, in my mind, as we'd given him the final cut of the video with extremely little time to rehearse anything he wanted to say. So there I was, still in bed, with the door shut and the window blocking out what passes for sunshine in Stuttgart, and I was nervous as hell about being up in front of a crowd.
Stage frightened of nobody at all. What a cool world we live in, that such a feeling can now be transferred over the wire.
Anyways, we did a good job (well, mostly Arto did a good job) of getting the video out there for anyone who wanted it, and at least a couple of people did. Here's another copy, if you're curious: