Code
The code for our leapfrog implementation for Apache Jena is available here
Repeating the experiments
Prerequisites
- any x64 linux distribution with glib support
- java 8
- python (both 2 or 3 works)
- bzip2
- On a debian-based distro:
sudo apt install bzip2
- On a debian-based distro:
- pip
-
Some of the following steps can take hours to complete, so we recommend using tmux to execute them.
Getting the repo and the dataset
- Clone this repository.
git clone git@github.com:GQgH5wFgzT/benchmark-leapfrog.git
if you use ssh keys
or
git clone https://github.com/GQgH5wFgzT/benchmark-leapfrog.git
if you don’t.
- Download the dataset used and move it to the
benchmark
folder - Extract it
bzip2 -d wikidata-filtered.nt.bz2
- Or you can construct the dataset from the latest truthy wikidata dump
Create the database for Jena and leapfrog
- Download the files apache-jena-3.9.0.tar.gz from Apache Jena downloads page or here and move it into
jena
folder - Change directory into
jena
folder - Extract it
tar -xf apache-jena-3.9.0.tar.gz
- Create the database for jena
apache-jena-3.9.0/bin/tdbloader2 --loc=db/jena ../wikidata-filtered.nt
- Edit the file
apache-jena-3.9.0/bin/tdbloader2index
with any text editor. After the line 389generate_index "$K3 $K1 $K2" "$DATA_TRIPLES" OSP
add the following lines:
generate_index "$K1 $K3 $K2" "$DATA_TRIPLES" SOP generate_index "$K2 $K1 $K3" "$DATA_TRIPLES" PSO generate_index "$K3 $K2 $K1" "$DATA_TRIPLES" OPS
then save and exit.
- Create the database for the leapfrog impementation
apache-jena-3.9.0/bin/tdbloader2 --loc=db/leapfrog ../wikidata-filtered.nt
Create the database for Blazegraph
- Download Blazegraph jar from its sourceforge page or from here and move it into
blazegraph
folder - Change directory into
blazegraph
folder java -Xmx20g -cp blazegraph.jar com.bigdata.rdf.store.DataLoader load.properties ../wikidata-filtered.nt
Create the database for Virtuoso Opensource
- Download the file from Virtuoso Open Source Edition v7.2.5.1 from its github releases page or from here and move it into
virtuoso
folder - Change directory into
virtuoso
folder - Extract it
tar -xf virtuoso-opensource.x86_64-generic_glibc25-linux-gnu.tar.gz
- Init the server
virtuoso-opensource/bin/virtuoso-t -c virtuoso.ini
- The server can take some time to start, wait a minute and start the interactive sql:
virtuoso-opensource/bin/isql localhost:1111
and enter the following commands:ld_dir('..', '*.nt', 'http://wikidata.org');
rdf_loader_run();
exit();
- Shut down the server
virtuoso-opensource/bin/isql localhost:1111 -K
Run the benchmark
- Change directory into
benchmark
folder bash run-benchmark.sh queries/bgps
-
bash run-benchmark.sh queries/optionals
Now the results are available in the folders
queries/bgps/output
andqueries/optionals/output
For each query pattern you will find a folder containing four files, one for each database. Each line of a file contains three values separated by a semicolon:
queryNumber;numberOfResutls;executionTimeInNanoseconds
Building the dataset
- Download the file
latest-truthy.nt.bz2
from here - Extract it
bzip2 -d latest-truthy.nt.bz2
- Move it to
wikidata-transformation
folder and change directory to that folder - Execute
python transform_wikidata1
to remove labels and descriptions from wikidata, along with strings having other language than english - Execute
python transform_wikidata2
to remove all properies listed inremoved_properties.txt
in our case we removed all properties that appeared more than 1.000.000 and less than 1.000 times.
Getting random queries for the benchmark
For each query pattern we created a java program that will find 50 random sets of properties with at least 1 result.
The jars are in the find-queries
folder.
To find a query, you need to execute java -jar find_XYZ.jar [jena-database-location] properties_wikidata.txt
, where properties_wikidata.txt
is a file with the properties that can be chosen.
Results
You can find our results in the results folder. For each query pattern you will find a folder containing four files, one for each database. Each line of a file contains three values separated by a semicolon: queryNumber;numberOfResutls;executionTimeInNanoseconds