Wednesday, 9 April 2014

Creating HTML Web Pages from Linked Data Using PHP, SPARQL, and ARC2

Here you'll learn how to output SPARQL queries directly to an HTML web page, instead of using a public endpoint client such as SNORQL

I'll expand on the SPARQL query I described 3 years ago, which outputs all members of the Characidae family of freshwater fish, with corresponding genus and Binomial name (ordered by genus):


SELECT DISTINCT ?species ?binomial ?genus
WHERE { ?species dbpedia-owl:family :Characidae;
dbpedia-owl:genus ?genus;
dbpedia2:binomial ?binomial }
ORDER BY ?genus


The first thing we'll need is a PHP library that queries SPARQL endpoints. We'll be using Semsol's ARC2 library. Huge thanks to Gilles Falquet (associate professor at the University of Geneva), and his tutorial on Creating web pages from linked data with PHP and SPARQL for getting us past this initial hump. His post is one of the few of its kind out there, and he was even generous enough to clarify some of the points I had difficulty with.

  1. Download Semsol's ARC2 library from their Github repository
  2. Create a directory on your web server for your project. Let's name the directory proj
  3. Decompress the downloaded archive, and place it into your proj directory. Rename it from semsol-arc2-bc67abe to something simpler. Let's make it semsol
  4. Create a php file in your project directory, and add the following:
    <html>
      <body>
     
      <?php
      /* ARC2 static class inclusion */ 
      include_once('semsol/ARC2.php');  
     
      $dbpconfig = array(
      "remote_store_endpoint" => "http://dbpedia.org/sparql",
       );
     
      $store = ARC2::getRemoteStore($dbpconfig); 
     
      if ($errs = $store->getErrors()) {
         echo "<h1>getRemoteSotre error<h1>" ;
      }
     
      $query = '
      PREFIX dbpedia-owl: <http://dbpedia.org/ontology/>
      PREFIX owl: <http://www.w3.org/2002/07/owl#>
      PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
      PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
      PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
      PREFIX foaf: <http://xmlns.com/foaf/0.1/>
      PREFIX dc: <http://purl.org/dc/elements/1.1/>
      PREFIX : <http://dbpedia.org/resource/>
      PREFIX dbpedia2: <http://dbpedia.org/property/>
      PREFIX dbpedia: <http://dbpedia.org/>
      PREFIX dbpprop: <http://dbpedia.org/property/>
    
      SELECT DISTINCT ?species ?binomial ?genus ?label
      WHERE { ?species dbpedia-owl:family :Characidae;
            dbpprop:genus ?genus;
            rdfs:label ?label;
            dbpedia2:binomial ?binomial.
            filter ( langMatches(lang(?label), "en") ) }
      ORDER BY ?genus';
      
      /* execute the query */
      $rows = $store->query($query, 'rows'); 
     
        if ($errs = $store->getErrors()) {
           echo "Query errors" ;
           print_r($errs);
        }
     
        /* display the results in an HTML table */
        echo "<table border='1'>
        <thead>
            <th>#</th>
            <th>Species (Label)</th>
            <th>Binomial</th>
            <th>Genus</th>
        </thead>";
    
        /* loop for each returned row */
        foreach( $rows as $row ) { 
        print "<tr><td>".++$id. "</td>
        <td><a href='". $row['species'] . "'>" . 
        $row['label']."</a></td><td>" . 
        $row['binomial']. "</td><td>" . 
        $row['genus']. "</td></tr>";
        }
        echo "</table>" 
    
      ?>
      </body>
    </html>
    

In this SPARQL query, I went a couple of steps further than I did with my previous 'Characidae family' query, just to make things a bit more interesting:

  • I queried genus by entity type ObjectProperty (Prefix of dbpedia-owl:genus), instead of entity type Property (Prefix of dbpprop:genus)
  • I added rdfs:label to the SELECT clause. rdfs:label is an instance of rdf:Property that provides a human-readable version of a resource name
  • I linked to the dbpedia resource page for each species using its rdfs:label as the link's anchor text. This originally resulted in duplicate columns, due to labels available for multiple languages. This was resolved using filter ( langMatches(lang(?label), "en") )

Output:


More to come!

Saturday, 23 April 2011

Simple SPARQL Query Examples Part 1: SELECT

Some understanding of the RDF triple format will help you understand these basic examples of SPARQL queries.

Again, when using Snorql, the following namespaces are included with all SPARQL queries:

PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX dc: <http://purl.org/dc/elements/1.1/>
PREFIX : <http://dbpedia.org/resource/>
PREFIX dbpedia2: <http://dbpedia.org/property/>
PREFIX dbpedia: <http://dbpedia.org/>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#> 


fig 1

I've centered the following examples around the Corydoras genus of armored catfish.

The scientific taxonomy for these fish is shown in the screenshot from Wikipedia's infobox on the Corydoras Aneus species page ( fig 1).

This will help you understand the correlation between the variables I've chosen in my example.

Example 1

Which predicates (properties) are used to refer to :Corydoras:

SELECT DISTINCT ?p
WHERE { ?s ?p :Corydoras  }

This query translates to:
Show the predicates where some subject  predicates the Corydoras resource (object).

The most notable of the result set is: dbpedia:ontology/genus, which says there's at least one subject that is of genus Corydora.

This example demonstrates that SPARQL query triples and matching result graphs are always in the form subject, predicate, object.


Example 2

Find all the species that have genus :Corydoras:

SELECT DISTINCT ?s
WHERE { ?s dbpedia-owl:genus :Corydoras }

The subjects that have as their genus Corydoras are all the Corydoras species


Example 3

Which family (or families) does the :Corydoras genus belong to?:

SELECT DISTINCT ?f
WHERE { ?s dbpedia-owl:genus :Corydoras.
?s dbpedia-owl:family ?f }


This query is asking to match all objects where some subject has as its genus Corydoras AND where that subject has as its family the variable we're looking for.

Or, more succinctly, if something is in the Corydoras genus, what is its family?

The result set: Callichthyidae and Corydoradinae.

The shorthand version of this query:

SELECT DISTINCT ?f
WHERE { ?s dbpedia-owl:genus :Corydoras;
dbpedia-owl:family ?f }


Example 4

List all species of the Characidae family along with their genus and binomial nomenclature. Order the result set by genus name:

SELECT DISTINCT ?species ?binomial ?genus
WHERE {?species dbpedia-owl:family :Characidae.
?species dbpedia-owl:genus ?genus.
?species dbpedia2:binomial ?binomial }
ORDER BY ?genus

Again, the shorthand version is less redundant:

SELECT DISTINCT ?species ?binomial ?genus
WHERE { ?species dbpedia-owl:family :Characidae;
dbpedia-owl:genus ?genus;
dbpedia2:binomial ?binomial }
ORDER BY ?genus

See the result set.


Example 5

Use the COUNT function to count the number of species in the preceding result set:

SELECT DISTINCT (COUNT(?species) AS ?num_species)
WHERE { ?species dbpedia-owl:family :Characidae;
dbpedia-owl:genus ?genus;
dbpedia2:binomial ?binomial }

The result is 77.

Thursday, 21 April 2011

A simple SPARQL query

And oh so simple it is too.

PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX dc: <http://purl.org/dc/elements/1.1/>
PREFIX : <http://dbpedia.org/resource/>
PREFIX dbpedia2: <http://dbpedia.org/property/>
PREFIX dbpedia: <http://dbpedia.org/>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>


PREFIX dbpedia-owl: <http://dbpedia.org/ontology/>
SELECT * WHERE { ?s dbpedia-owl:genus :Corydoras }

All but the last prefix are included in SNORQL queries, and it was the matter of prefixes that was actually confusing me all last night through to this morning.

A big frustration for me has been that many of the examples of SPARQL queries I've found online don't work for one reason or another.

I finally broke down and asked in Freenode's Semantic Web Interest Group IRC channel (#swig) what I was doing wrong, and was quickly enlightened.

Thank you lawd! (And William Waites AKA ww on Freenode!)

DBpedia facts and figures

Content of DBpedia's data set

Every resource in DBpedia (URls in the form http://dbpedia.org/page/Name) is tied directly to an English Wikipedia page (in the form http://en.wikipedia.org/wiki/Name).
  • DBpedia derives its ontology from Wikipedia.
  • DBpedia's data set describes over 3.5 million "things" with over half a billion "facts" as of January 2010.
  • 1.67 million of these "things" are classified in a consistent Ontology, including:
    • 364,000 persons
    • 462,000 places (including 340,000 populated places
    • 99,000 music albums
    • 54,000 films
    • 17,000 video games
    • 148,000 organizations (including 35,000 companies and 34,000 educational institutions)
    • 169,000 species
    • 5,200 diseases
  • DBpedia uses the Resource Description Framework (RDF) to publish data extracted from Wikipedia.
  • Development toolkits are available for many programming languages to process DBpedia data.
  • The DBpedia data set features
    • labels and abstracts for these 3.5 million things in up to 97 different languages
    • 1,850,000 links to images
    • 5,900,000 links to external web pages
    • 6,500,000 external links into other RDF datasets
    • 633,000 Wikipedia categories
    • 2,900,000 YAGO categories
  • The knowledge base consists of over 672 million pieces of information (RDF triples), where
    • 286 million were extracted from the English edition of Wikipedia
    • 386 million were extracted from other language editions

The differences between URL, URI and URN

The W3C's specs are great, and I love them, but they can be overkill, so I'm reluctantly resorting to quoting Wikipedia:
URI
A Uniform Resource Identifier is a string of characters used to identify a name or a resource on the Internet... One can classify URIs as locators (URLs), or as names (URNs), or as both.

Simply put, a URI is a string of characters that identifies a resource on the Internet, either by location, a name, or by both.

URN
A Uniform Resource Name (URN) functions like a person's name, while a Uniform Resource Locator (URL) resembles that person's street address. In other words: the URN defines an item's identity, while the URL provides a method for finding it.

Here are 2 examples of URNS:

urn:isbn:0451450523
The URN for "The Last Unicorn", identified by its book number.
urn:mpeg:mpeg7:schema:2001
Default Namespace Rules for MPEG-7 video metadata.

URL
A Uniform Resource Locator is a Uniform Resource Identifier (URI) that specifies where an identified resource is available and the mechanism for retrieving it.

In other words, a URL includes an access mechanism, or network location, such as http:// or ftp://. Examples:

  • http://www.springboardseo.com/
  • telnet://192.0.1.22:80/
  • mailto:mail@springboardseo.com
  • ftp://springboardseo/robots.txt

Note that subdomain.springboardseo.com is not a URL; it's a URI, as this address might respond to different protocols, such and with ftp://subdomain.springboardseo.com or http://subdomain.springboardseo.com

.

tl;dr

When describing web addresses, the proper term to use is URI; it will always be technically correct, whichever web address you're referring to. In fact it's possible that URL is near deprecation. That being said, you may be labelled pretentious in less geeky circles for nitpicking, so unless you enjoy questioning grimaces of the less informed, keep URL in your vocabulary.

Tuesday, 19 April 2011

Next stop on the XML express is RDF station. Toot toot.

(I'm on my way to meet my new friend, SPARQL)
The Semantic Web is not a very fast growing technology.
One of the reasons for that is the learning curve. RDF was developed by people with academic background in logic and artificial intelligence. For traditional developers it is not very easy to understand.
That's from the first page of W3School's semantic Web tutorial. My knowledge of the semantic Web goes a little beyond what that page covers, but I was brushing up on my XML there anyway, so...

Note: W3Schools is notorious for inaccuracies—you've been warned.

...but in this case,
<?xml version="1.0"?>
<words type="famous last">I deemed it adequate to brush up on a pretty simple markup language.</words>

Now I'm not a traditional developer. In fact, I'm not really a developer at all. I'm great with well formed HTML and CSS, and I know obviously bad php scripting when I see it, but I couldn't OOP my way out of a wet paper bag with a machete. It would probably take me 3 hours just to connect to one of my databases and perform the simplest queries. So that passage from W3Schools seems a little bit ominous.

Ah, what the hell. I think I know the XML "train system" well enough to get to RDF "station". Ah, wait, here we are. I'm getting off, I heard there's lots to do at this stop.

On to RDF!

Monday, 18 April 2011

Back to Basics: XML

So I began my day today with more reading on XML. This led me to look closer at DTDs. In fact I just looked at the HTML 4 doctype closely for the first time. Kind of funny if you consider that it's the first line added to well formed markup and that I've been using it religiously for years.

I never realized CDATA and PCDATA were considered building blocks of XML (and HTML) documents. Apparently there are more than just elements ;)
  • Elements
  • Attributes
  • Entities
  • CDATA (character data)
  • PCDATA (parsed character data)
I originally only wanted to learn how to scrape data sets from Wikipedia, but I've realized now that to get a firm understanding of information retrieval, the logical order of study (in this case) is:
  • XML (Come on, you know this one)
  • RDF (Resource Description Framework)
  • SPARQL (SPARQL Protocol and RDF Query Language)
I've always been interested in the semantic Web. Finally, I have some goals that will keep me motivated to dig into web 3.0!
If HTML and the Web made all the online documents look like one huge book, RDF, schema, and inference languages will make all the data in the world look like one huge database.
Tim Berners-Lee, Weaving the Web, 1999