Saturday 23 April 2011

Simple SPARQL Query Examples Part 1: SELECT

Some understanding of the RDF triple format will help you understand these basic examples of SPARQL queries.

Again, when using Snorql, the following namespaces are included with all SPARQL queries:

PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX dc: <http://purl.org/dc/elements/1.1/>
PREFIX : <http://dbpedia.org/resource/>
PREFIX dbpedia2: <http://dbpedia.org/property/>
PREFIX dbpedia: <http://dbpedia.org/>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#> 


fig 1

I've centered the following examples around the Corydoras genus of armored catfish.

The scientific taxonomy for these fish is shown in the screenshot from Wikipedia's infobox on the Corydoras Aneus species page ( fig 1).

This will help you understand the correlation between the variables I've chosen in my example.

Example 1

Which predicates (properties) are used to refer to :Corydoras:

SELECT DISTINCT ?p
WHERE { ?s ?p :Corydoras  }

This query translates to:
Show the predicates where some subject  predicates the Corydoras resource (object).

The most notable of the result set is: dbpedia:ontology/genus, which says there's at least one subject that is of genus Corydora.

This example demonstrates that SPARQL query triples and matching result graphs are always in the form subject, predicate, object.


Example 2

Find all the species that have genus :Corydoras:

SELECT DISTINCT ?s
WHERE { ?s dbpedia-owl:genus :Corydoras }

The subjects that have as their genus Corydoras are all the Corydoras species


Example 3

Which family (or families) does the :Corydoras genus belong to?:

SELECT DISTINCT ?f
WHERE { ?s dbpedia-owl:genus :Corydoras.
?s dbpedia-owl:family ?f }


This query is asking to match all objects where some subject has as its genus Corydoras AND where that subject has as its family the variable we're looking for.

Or, more succinctly, if something is in the Corydoras genus, what is its family?

The result set: Callichthyidae and Corydoradinae.

The shorthand version of this query:

SELECT DISTINCT ?f
WHERE { ?s dbpedia-owl:genus :Corydoras;
dbpedia-owl:family ?f }


Example 4

List all species of the Characidae family along with their genus and binomial nomenclature. Order the result set by genus name:

SELECT DISTINCT ?species ?binomial ?genus
WHERE {?species dbpedia-owl:family :Characidae.
?species dbpedia-owl:genus ?genus.
?species dbpedia2:binomial ?binomial }
ORDER BY ?genus

Again, the shorthand version is less redundant:

SELECT DISTINCT ?species ?binomial ?genus
WHERE { ?species dbpedia-owl:family :Characidae;
dbpedia-owl:genus ?genus;
dbpedia2:binomial ?binomial }
ORDER BY ?genus

See the result set.


Example 5

Use the COUNT function to count the number of species in the preceding result set:

SELECT DISTINCT (COUNT(?species) AS ?num_species)
WHERE { ?species dbpedia-owl:family :Characidae;
dbpedia-owl:genus ?genus;
dbpedia2:binomial ?binomial }

The result is 77.

Thursday 21 April 2011

A simple SPARQL query

And oh so simple it is too.

PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX dc: <http://purl.org/dc/elements/1.1/>
PREFIX : <http://dbpedia.org/resource/>
PREFIX dbpedia2: <http://dbpedia.org/property/>
PREFIX dbpedia: <http://dbpedia.org/>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>


PREFIX dbpedia-owl: <http://dbpedia.org/ontology/>
SELECT * WHERE { ?s dbpedia-owl:genus :Corydoras }

All but the last prefix are included in SNORQL queries, and it was the matter of prefixes that was actually confusing me all last night through to this morning.

A big frustration for me has been that many of the examples of SPARQL queries I've found online don't work for one reason or another.

I finally broke down and asked in Freenode's Semantic Web Interest Group IRC channel (#swig) what I was doing wrong, and was quickly enlightened.

Thank you lawd! (And William Waites AKA ww on Freenode!)

DBpedia facts and figures

Content of DBpedia's data set

Every resource in DBpedia (URls in the form http://dbpedia.org/page/Name) is tied directly to an English Wikipedia page (in the form http://en.wikipedia.org/wiki/Name).
  • DBpedia derives its ontology from Wikipedia.
  • DBpedia's data set describes over 3.5 million "things" with over half a billion "facts" as of January 2010.
  • 1.67 million of these "things" are classified in a consistent Ontology, including:
    • 364,000 persons
    • 462,000 places (including 340,000 populated places
    • 99,000 music albums
    • 54,000 films
    • 17,000 video games
    • 148,000 organizations (including 35,000 companies and 34,000 educational institutions)
    • 169,000 species
    • 5,200 diseases
  • DBpedia uses the Resource Description Framework (RDF) to publish data extracted from Wikipedia.
  • Development toolkits are available for many programming languages to process DBpedia data.
  • The DBpedia data set features
    • labels and abstracts for these 3.5 million things in up to 97 different languages
    • 1,850,000 links to images
    • 5,900,000 links to external web pages
    • 6,500,000 external links into other RDF datasets
    • 633,000 Wikipedia categories
    • 2,900,000 YAGO categories
  • The knowledge base consists of over 672 million pieces of information (RDF triples), where
    • 286 million were extracted from the English edition of Wikipedia
    • 386 million were extracted from other language editions

The differences between URL, URI and URN

The W3C's specs are great, and I love them, but they can be overkill, so I'm reluctantly resorting to quoting Wikipedia:
URI
A Uniform Resource Identifier is a string of characters used to identify a name or a resource on the Internet... One can classify URIs as locators (URLs), or as names (URNs), or as both.

Simply put, a URI is a string of characters that identifies a resource on the Internet, either by location, a name, or by both.

URN
A Uniform Resource Name (URN) functions like a person's name, while a Uniform Resource Locator (URL) resembles that person's street address. In other words: the URN defines an item's identity, while the URL provides a method for finding it.

Here are 2 examples of URNS:

urn:isbn:0451450523
The URN for "The Last Unicorn", identified by its book number.
urn:mpeg:mpeg7:schema:2001
Default Namespace Rules for MPEG-7 video metadata.

URL
A Uniform Resource Locator is a Uniform Resource Identifier (URI) that specifies where an identified resource is available and the mechanism for retrieving it.

In other words, a URL includes an access mechanism, or network location, such as http:// or ftp://. Examples:

  • http://www.springboardseo.com/
  • telnet://192.0.1.22:80/
  • mailto:mail@springboardseo.com
  • ftp://springboardseo/robots.txt

Note that subdomain.springboardseo.com is not a URL; it's a URI, as this address might respond to different protocols, such and with ftp://subdomain.springboardseo.com or http://subdomain.springboardseo.com

.

tl;dr

When describing web addresses, the proper term to use is URI; it will always be technically correct, whichever web address you're referring to. In fact it's possible that URL is near deprecation. That being said, you may be labelled pretentious in less geeky circles for nitpicking, so unless you enjoy questioning grimaces of the less informed, keep URL in your vocabulary.

Tuesday 19 April 2011

Next stop on the XML express is RDF station. Toot toot.

(I'm on my way to meet my new friend, SPARQL)
The Semantic Web is not a very fast growing technology.
One of the reasons for that is the learning curve. RDF was developed by people with academic background in logic and artificial intelligence. For traditional developers it is not very easy to understand.
That's from the first page of W3School's semantic Web tutorial. My knowledge of the semantic Web goes a little beyond what that page covers, but I was brushing up on my XML there anyway, so...

Note: W3Schools is notorious for inaccuracies—you've been warned.

...but in this case,
<?xml version="1.0"?>
<words type="famous last">I deemed it adequate to brush up on a pretty simple markup language.</words>

Now I'm not a traditional developer. In fact, I'm not really a developer at all. I'm great with well formed HTML and CSS, and I know obviously bad php scripting when I see it, but I couldn't OOP my way out of a wet paper bag with a machete. It would probably take me 3 hours just to connect to one of my databases and perform the simplest queries. So that passage from W3Schools seems a little bit ominous.

Ah, what the hell. I think I know the XML "train system" well enough to get to RDF "station". Ah, wait, here we are. I'm getting off, I heard there's lots to do at this stop.

On to RDF!

Monday 18 April 2011

Back to Basics: XML

So I began my day today with more reading on XML. This led me to look closer at DTDs. In fact I just looked at the HTML 4 doctype closely for the first time. Kind of funny if you consider that it's the first line added to well formed markup and that I've been using it religiously for years.

I never realized CDATA and PCDATA were considered building blocks of XML (and HTML) documents. Apparently there are more than just elements ;)
  • Elements
  • Attributes
  • Entities
  • CDATA (character data)
  • PCDATA (parsed character data)
I originally only wanted to learn how to scrape data sets from Wikipedia, but I've realized now that to get a firm understanding of information retrieval, the logical order of study (in this case) is:
  • XML (Come on, you know this one)
  • RDF (Resource Description Framework)
  • SPARQL (SPARQL Protocol and RDF Query Language)
I've always been interested in the semantic Web. Finally, I have some goals that will keep me motivated to dig into web 3.0!
If HTML and the Web made all the online documents look like one huge book, RDF, schema, and inference languages will make all the data in the world look like one huge database.
Tim Berners-Lee, Weaving the Web, 1999

Learning to Scrape Data from Wikipedia

So I’m learning a bit about scraping data.
Not the kind of lame data scraping that the script-kiddie black hats use to duplicate sites for crappy made for Adsense sites; more along the lines of data sets from info boxes—datasets that I can grab, throw in my own databases and use with my own site architecture and content.
Sites like Wikipedia have info box templates on the top right of most of their pages, and dbpedia.org compiles and offers the data to the public. Now I just have to learn how to access the data I need from selected pages (in the thousands or more) and import in into .csv or similar. 
I’ll keep you updated…
Update:
SPARQL (an SQL-like language for querying RDF) is used to communicate with dbpedia’s endpoint. Nice.
Semantic Web, here I come.