This package shall provide R script developers with some useful utilities to track meaningful provenance with inline R-code.
Find the Python version of this package at https://github.com/GeoinformationSystems/provo.
This package was written since concurrent provenance tracking packages tend to either produce lengthy documents with loads of abundant information (typically automatic provenance tracing) or don't allow for fine grained tracking (e.g. using qualified comments to track provenance cannot interpret conditions or loops). To sum up: This package gives the sole control over what is tracked to the user, while simplifying the task as much as possible.
This package produces provenance graphs that adhere to the PROV-Ontology https://www.w3.org/TR/prov-o/. Currently the package only implements the "starting point terms" (https://www.w3.org/TR/2013/REC-prov-o-20130430/#cross-reference-starting-point-terms) of the PROV-Ontology. These starting point terms are built by the three classes Entity, Activity and Agent, as well as a set of allowed relations between those classes (see Fig. 1).
Fig. 1 - Starting Point Terms of PROV-O (time related relations excluded). Notice the direction of the relations.In terms of data processing Entities can be viewed as data and Activities as processes that use existing data and generate new data. Agents stand for persons, organizations or even software that carries out these processing steps. An example where a process uses some input data to produce some output data is shown in Fig. 2.
Fig. 2 - PROV-O description of a process that uses some input data and produces some output data, controlled by some agent.A provenance graph is generated by concatenating multiple activities and entities, e.g Fig. 3.
Fig. 3 - Simple PROV-O provenance graph without agents.As first step that graph has to be set up:
Listing 1
init_provenance_graph(namespace = "https://www.yournamespace.com/script#")
PROV-O is RDF based, which means that every Entity, Activity and Agent has to have a unique ID which has to be provided as IRI. Every ID that you give to an Entity, Activity or Agent is concatenated to that very namespace you defined on graph initialization to build the IRI. See Hints / Pitfalls for more detailed information on namespaces.
The package implements the three PROV-O Node types Entity, Activity and Agent as R-environments with class like behavior. Each Node type must have an ID and can have a label and a description. If no label is given on instantiation the ID is used as label.
E.g. creating an Entity with the ID 'input_raster', the label 'Input Raster' and the description 'A 10 x 10 raster with values ranging from 0 to 100' would look like this:
Listing 2
in_raster_entity <- Entity(
id = "input_raster",
label = "Input Raster",
description = "A 10 x 10 raster with values ranging from 0 to 100"
)
Now lets assume we have a script in which a 10 x 10 input raster with values ranging from 0 to 100 shall be converted to a masking raster by applying a threshold operation that sets values above 70 to 1 and other values to 0. The resulting mask raster shall be written in a new raster.
Listing 3
in_raster <- array(val <- sample(0:100), dim = c(10, 10))
out_raster <- array(NA, dim = c(10, 10))
for (i in seq_along(in_raster)) {
if (in_raster[i] >= 70) {
out_raster[i] <- 1
} else {
out_raster[i] <- 0
}
}
To track the provenance of this example we can adjust the script to:
Listing 4
library("provr")
init_provenance_graph(namespace = "https://www.provr.com/10x10raster_ex#")
in_raster <- array(val <- sample(0:100), dim = c(10, 10))
# build input entity
in_raster_entity <- Entity("in_raster", "Input Raster")
out_raster <- array(NA, dim = c(10, 10))
# build output entity
out_raster_entity <- Entity("out_raster", "Mask Raster")
for (i in seq_along(in_raster)) {
if (in_raster[i] >= 70) {
out_raster[i] <- 1
} else {
out_raster[i] <- 0
}
}
# build mask activity
mask_activity <- Activity("mask", "Mask", "Generate mask by setting
every value in the input raster that is 70 or greater
to 1 and each other value to 0.")
# set inputs for mask activity
mask_activity$used(in_raster_entity)
# set the activity that generated the output entity
out_raster_entity$wasGeneratedBy(mask_activity)
# write graph to file
serialize_provenance_graph(name = "10x10raster_ex.ttl")
Visualized with the PROV-Viewer application (https://github.com/GeoinformationSystems/ProvViewer) you get Fig. 4.
Fig. 4 - Visualization of Listing 4. (Notice that ProvViewer shows the PROV-O properties wasUsedBy and generated, instead of used and wasGeneratedBy; i.e. the arrows point in the opposite direction.)The resulting RDF Graph File in Turtle 10x10raster_ex.ttl
looks as follows:
Listing 5
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix prov: <http://www.w3.org/ns/prov#> .
@prefix rscript: <https://www.provr.com/10x10raster_ex#> .
rscript:in_raster
a prov:Entity ;
rdfs:label "Input Raster" .
rscript:mask
a prov:Activity ;
rdfs:comment """Generate mask by setting
every value in the input raster that is 70 or greater
to 1 and each other value to 0.""" ;
rdfs:label "Mask" ;
prov:used rscript:in_raster .
rscript:out_raster
a prov:Entity ;
rdfs:label "Mask Raster" ;
prov:wasGeneratedBy rscript:mask .
What we just did, was tracking provenance at a coarse grained level, i.e. so that the resulting graph mainly conveys the meaning of the processing steps. But there are also cases where users might want to track provenance at a fine grained level. In the case of our example that would mean tracking the provenance pixel-wise. That requires us to set up an entity for each pixel of both the input and the output raster. We also define two 'classes' of activities: 'set_to_1' and 'set_to_0'. On each iteration of our loop we set up an 'instance' of either the one or the other. What I mean by 'classes' and 'instances' is that each activity gets a unique ID, but we only give two different labels. For this example we reduce the raster size to 3x3. We furthermore define two more (more or less arbitrary) activities called 'enter_loop' and 'leave_loop' to foster understandability of the resulting graph.
Listing 6
library("provr")
library("uuid")
init_provenance_graph(namespace = "https://www.provr.com/10x10raster_ex#")
in_raster <- array(val <- sample(0:100), dim = c(3, 3))
in_raster_entity <- Entity("in_raster", "Input Raster")
out_raster <- array(NA, dim = c(3, 3))
out_raster_entity <- Entity("out_raster", "Mask Raster")
enter_loop_activity <- Activity("enter_loop", "Enter Loop")
enter_loop_activity$used(in_raster_entity)
leave_loop_activity <- Activity("leave_loop", "Leave Loop")
for (i in seq_along(in_raster)) {
# build input px entity
id <- paste("in_px", toString(i), sep = "_")
label <- toString(in_raster[i])
in_px_entity <- Entity(id, label)
in_px_entity$wasGeneratedBy(enter_loop_activity)
out_px_entity <- NA
if (in_raster[i] >= 70) {
out_raster[i] <- 1
# build set to 1 activity
id <- paste("set_to_one", UUIDgenerate(), sep = "_")
set_to_one_activity <- Activity(id, "Set Pixel to 1")
set_to_one_activity$used(in_px_entity)
# build output px entity
id <- paste("out_px", toString(i), sep = "_")
out_px_entity <- Entity(id, "1")
out_px_entity$wasGeneratedBy(set_to_one_activity)
} else {
out_raster[i] <- 0
# build set to 0 activity
id <- paste("set_to_zero", UUIDgenerate(), sep = "_")
set_to_zero_activity <- Activity(id, "Set Pixel to 0")
set_to_zero_activity$used(in_px_entity)
# build output px entity
id <- paste("out_px", toString(i), sep = "_")
out_px_entity <- Entity(id, "0")
out_px_entity$wasGeneratedBy(set_to_zero_activity)
}
leave_loop_activity$used(out_px_entity)
}
out_raster_entity$wasGeneratedBy(leave_loop_activity)
# write graph to file
serialize_provenance_graph(name = "10x10raster_ex_fine.ttl")
Fig. 5 shows the visualization of the resulting provenance graph.
Fig. 5 - Visualization of Listing 6Say you want to track provenance information across distributed scripts (or sessions). Therefore you need to load the provenance graph you saved previously. Loading existing provenance documents is achieved by providing the init_provenance_graph()
function with the according file name:
Listing 7
init_provenance_graph(
namespace = "https://www.provr.com/10x10raster_ex#",
file = "10x10raster_ex.ttl")
To access nodes from the loaded graph you need to use the load = TRUE
option on initializing the node. After loading a node like this, you can attach further provenance information:
Listing 8
in_raster_entity <- Entity(id = 'in_raster', load = TRUE)
init_activity <- Activity(
id = 'initial_process',
label = 'initial process',
description = 'the process that generated the input raster')
in_raster_entity$wasGeneratedBy(init_activity)
See Hints /Pitfalls: Beware Namespaces for further information on using different namespaces for different scripts.
The package enables script developers to build concise provenance graphs that fit their needs. The obvious drawback to fully automated approaches, is the required typing to set up the nodes and relations. The fine-grained graph example showed that the user can automate the provenance generation to a certain degree. Control structures can be leveraged and values of variables can be used to build ids, labels and descriptions.
-
When creating an Entity, Activity or Agent, I advise to add the according
*_entity
,*_activity
or*_agent
to your variable name to prevent confusion -
The package prevents you from putting the wrong classes as argument to the methods of the classes:
agent <- Agent("agent") entity <- Entity("entity") entity$wasGeneratedBy(agent)
> Error in entity$wasGeneratedBy(agent) : argument has to be of the class "Activity"!
-
The package prevents you from setting up a node with the same IRI (namespace + ID) twice:
entity <- Entity("in_raster") # ... other_entity <- Entity("in_raster")
> Error in Entity(id = "in_raster") : A resource with the IRI https://www.provr.com/10x10raster_ex#in_raster already exists (as subject), please use the 'load = TRUE' option.
-
Beware Namespaces: A certain node in the provenance graph is identified by its IRI. The IRI is the combination of a namespace and an identifier in this namespace. That means same IDs in different namespaces result in different nodes. An example:
In Listing 4 we initialized the provenance graph with the namespace https://www.provr.com/10x10raster_ex#. Every Entity, Activity or Agent we subsequently defined in this script gets its unique IRI constructed by concatenating this namespace with the ID we provide on instantiation; e.g.:
in_raster_entity <- Entity("in_raster", "Input Raster")
results in https://www.provr.com/10x10raster_ex#in_raster as the nodes IRI.In Listing 7 and Listing 8 we loaded the graph from Listing 4 and accessed an Entity from the loaded graph by its ID. Notice that, on initializing the graph in Listing 7, we used the same namespace as in Listing 4. But what if we wanted to distinguish the nodes that were constructed in Listing 4 from those that were constructed in Listing 8? In this case we would need to set up the graph in Lisiting 7 with a different namespace:
init_provenance_graph( namespace = "https://www.provr.com/another_namespace#", file = "10x10raster_ex.ttl")
If we now proceed as in Listing 8 we get an error:
in_raster_entity <- Entity(id = 'in_raster', load = TRUE)
> Error in Entity(id = "in_raster", load = TRUE) : A resource with the IRI <https://www.provr.com/another_namespace#in_raster> does not exists (as subject).
This is, because on node initialization, the namespace that is concatenated with the provided ID, defaults to the namespace we gave at graph initialization. If we want to create or load nodes with namespaces that differ from this "default" namespace, we need to provide them explicitly:
init_provenance_graph( namespace = "https://www.provr.com/another_namespace#", file = "10x10raster_ex.ttl") in_raster_entity <- Entity( id = 'in_raster', namespace = "https://www.provr.com/10x10raster_ex#", load = TRUE) # -> IRI: <https://www.provr.com/10x10raster_ex#in_raster> # now we can proceed as in Listing 8: init_activity <- Activity('initial_process') # -> IRI: <https://www.provr.com/another_namespace#initial_process> in_raster_entity$wasGeneratedBy(init_activity)
- add time tracking for activities
- add automatic id generation toggle
- implement full PROV-O
- rdflib (https://cran.r-project.org/package=rdflib)
- redland (https://cran.r-project.org/package=redland)
- uuid (https://cran.r-project.org/package=uuid)
- magrittr (https://cran.r-project.org/package=magrittr)
GNU General Public License 3
https://www.gnu.org/licenses/gpl-3.0.de.html
Arne Rümmler ([email protected])