View on GitHub

benchmark-leapfrog

Code

The code for our leapfrog implementation for Apache Jena is available here

Repeating the experiments

Prerequisites

any x64 linux distribution with glib support
java 8
python (both 2 or 3 works)
bzip2
- On a debian-based distro: sudo apt install bzip2
pip
SPARQLWrapper

Some of the following steps can take hours to complete, so we recommend using tmux to execute them.

Getting the repo and the dataset

Clone this repository.
- git clone git@github.com:GQgH5wFgzT/benchmark-leapfrog.git if you use ssh keys
or
- git clone https://github.com/GQgH5wFgzT/benchmark-leapfrog.git if you don’t.
Download the dataset used and move it to the benchmark folder
Extract it bzip2 -d wikidata-filtered.nt.bz2
Or you can construct the dataset from the latest truthy wikidata dump

Create the database for Jena and leapfrog

Download the files apache-jena-3.9.0.tar.gz from Apache Jena downloads page or here and move it into jena folder
Change directory into jena folder
Extract it tar -xf apache-jena-3.9.0.tar.gz
Create the database for jena apache-jena-3.9.0/bin/tdbloader2 --loc=db/jena ../wikidata-filtered.nt

Edit the file apache-jena-3.9.0/bin/tdbloader2index with any text editor. After the line 389

  generate_index "$K3 $K1 $K2" "$DATA_TRIPLES" OSP

add the following lines:

  generate_index "$K1 $K3 $K2" "$DATA_TRIPLES" SOP
  generate_index "$K2 $K1 $K3" "$DATA_TRIPLES" PSO
  generate_index "$K3 $K2 $K1" "$DATA_TRIPLES" OPS

then save and exit.

Create the database for the leapfrog impementation apache-jena-3.9.0/bin/tdbloader2 --loc=db/leapfrog ../wikidata-filtered.nt

Create the database for Blazegraph

Download Blazegraph jar from its sourceforge page or from here and move it into blazegraph folder
Change directory into blazegraph folder
java -Xmx20g -cp blazegraph.jar com.bigdata.rdf.store.DataLoader load.properties ../wikidata-filtered.nt

Create the database for Virtuoso Opensource

Download the file from Virtuoso Open Source Edition v7.2.5.1 from its github releases page or from here and move it into virtuoso folder
Change directory into virtuoso folder
Extract it tar -xf virtuoso-opensource.x86_64-generic_glibc25-linux-gnu.tar.gz
Init the server virtuoso-opensource/bin/virtuoso-t -c virtuoso.ini
The server can take some time to start, wait a minute and start the interactive sql: virtuoso-opensource/bin/isql localhost:1111 and enter the following commands:
- ld_dir('..', '*.nt', 'http://wikidata.org');
- rdf_loader_run();
- exit();
Shut down the server virtuoso-opensource/bin/isql localhost:1111 -K

Run the benchmark

Change directory into benchmark folder
bash run-benchmark.sh queries/bgps
bash run-benchmark.sh queries/optionals

Now the results are available in the folders queries/bgps/output and queries/optionals/output

For each query pattern you will find a folder containing four files, one for each database. Each line of a file contains three values separated by a semicolon: queryNumber;numberOfResutls;executionTimeInNanoseconds

Building the dataset

Download the file latest-truthy.nt.bz2 from here
Extract it bzip2 -d latest-truthy.nt.bz2
Move it to wikidata-transformation folder and change directory to that folder
Execute python transform_wikidata1 to remove labels and descriptions from wikidata, along with strings having other language than english
Execute python transform_wikidata2 to remove all properies listed in removed_properties.txt in our case we removed all properties that appeared more than 1.000.000 and less than 1.000 times.

Getting random queries for the benchmark

For each query pattern we created a java program that will find 50 random sets of properties with at least 1 result. The jars are in the find-queries folder. To find a query, you need to execute java -jar find_XYZ.jar [jena-database-location] properties_wikidata.txt, where properties_wikidata.txt is a file with the properties that can be chosen.

Results

You can find our results in the results folder. For each query pattern you will find a folder containing four files, one for each database. Each line of a file contains three values separated by a semicolon: queryNumber;numberOfResutls;executionTimeInNanoseconds