Monday, 18 April 2011

Learning to Scrape Data from Wikipedia

So I’m learning a bit about scraping data.
Not the kind of lame data scraping that the script-kiddie black hats use to duplicate sites for crappy made for Adsense sites; more along the lines of data sets from info boxes—datasets that I can grab, throw in my own databases and use with my own site architecture and content.
Sites like Wikipedia have info box templates on the top right of most of their pages, and dbpedia.org compiles and offers the data to the public. Now I just have to learn how to access the data I need from selected pages (in the thousands or more) and import in into .csv or similar. 
I’ll keep you updated…
Update:
SPARQL (an SQL-like language for querying RDF) is used to communicate with dbpedia’s endpoint. Nice.
Semantic Web, here I come.