AI engineers spend a lot of time building, training, and iterating on models. But as pipelines grow more complex, it becomes difficult to answer simple but crucial questions:
- Which dataset version trained this model?
- Which parameters were used?
- Who triggered this training job?
- Can I reproduce this run six months later?
Without structured provenance tracking, reproducibility and compliance become almost impossible. In regulated domains, this is not optional — it’s mandatory.
In this article, we’ll show how to integrate W3C PROV-O (a standard for provenance modeling) with MLflow (a popular experiment tracking framework) in a PyTorch pipeline. The result: every training run not only logs metrics and artifacts but also generates a machine-readable provenance graph for accountability, auditability, and governance.
🔎 Background: Why PROV-O + MLflow?
- MLflow is widely used for experiment tracking. It records metrics, parameters, and artifacts like models and logs. However, MLflow’s logs are application-specific and not standardized for knowledge sharing across systems.
- W3C PROV-O is a semantic ontology (built on RDF/OWL2) that provides a standardized vocabulary for describing provenance: Entities, Activities, and Agents, and their relationships (
prov:used
,prov:wasGeneratedBy
,prov:wasAttributedTo
).
By combining the two:
- MLflow provides the data source of truth for training runs.
- PROV-O provides the interoperable representation of lineage, useful for audits, governance, and integration into knowledge graphs.
🏗️ Architecture Overview
Our integration maps MLflow concepts to PROV-O concepts:
MLflow Concept | PROV-O Equivalent | Example |
---|---|---|
MLflow Run | prov:Activity | Training job run ID f4a22 |
MLflow Artifact (model) | prov:Entity | model_v1.pth |
Dataset (input) | prov:Entity | dataset.csv |
Metrics (loss, accuracy) | prov:Entity | metrics.json |
MLflow User/System | prov:Agent | Engineer triggering the run |
⚙️ Step 1: Setup
We need a combination of MLflow (for tracking) and rdflib (for provenance graph generation).
pip install mlflow torch rdflib prov
- mlflow → tracks experiments, models, metrics, and artifacts.
- torch → used for building the PyTorch model.
- rdflib → builds and serializes RDF/PROV-O graphs.
- prov → utilities for working with W3C PROV specifications.
🧑💻 Step 2: PyTorch Training with MLflow Logging
We start with a simple PyTorch script that trains a small neural network while logging to MLflow.
import torch
import torch.nn as nn
import torch.optim as optim
import mlflow
import mlflow.pytorch
# Fake dataset
X = torch.randn(100, 10)
y = torch.randint(0, 2, (100,))
# Simple NN
model = nn.Sequential(nn.Linear(10, 32), nn.ReLU(), nn.Linear(32, 2))
loss_fn = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)
with mlflow.start_run() as run:
for epoch in range(5):
optimizer.zero_grad()
preds = model(X)
loss = loss_fn(preds, y)
loss.backward()
optimizer.step()
mlflow.log_metric("loss", loss.item(), step=epoch)
mlflow.log_param("lr", 0.001)
mlflow.pytorch.log_model(model, "model")
At this point, MLflow is recording metrics (loss
), params (lr
), and the trained model artifact. But it doesn’t capture semantic provenance — for example, which dataset was used, who ran this job, and how results are connected.
🔗 Step 3: Provenance Tracker for MLflow
Here’s where PROV-O comes in. We build a Provenance Tracker that:
- Defines entities (datasets, models, metrics).
- Defines activities (the MLflow run).
- Defines agents (engineer, system).
- Links them using PROV-O relations.
- Serializes into Turtle (.ttl) or JSON-LD.
from rdflib import Graph, Namespace, URIRef, Literal
from rdflib.namespace import RDF, FOAF
import mlflow
PROV = Namespace("http://www.w3.org/ns/prov#")
EX = Namespace("http://example.org/")
def log_provenance(run):
g = Graph()
g.bind("prov", PROV)
g.bind("ex", EX)
# Agent (engineer/system)
user = EX["engineer"]
g.add((user, RDF.type, PROV.Agent))
g.add((user, FOAF.name, Literal("AI Engineer")))
# Activity (the MLflow run)
activity = EX[f"run_{run.info.run_id}"]
g.add((activity, RDF.type, PROV.Activity))
# Input dataset
dataset = EX["dataset.csv"]
g.add((dataset, RDF.type, PROV.Entity))
g.add((activity, PROV.used, dataset))
# Model entity
model = EX[f"model_{run.info.run_id}.pth"]
g.add((model, RDF.type, PROV.Entity))
g.add((model, PROV.wasGeneratedBy, activity))
g.add((model, PROV.wasAttributedTo, user))
# Metrics entity
metrics = EX[f"metrics_{run.info.run_id}.json"]
g.add((metrics, RDF.type, PROV.Entity))
g.add((metrics, PROV.wasGeneratedBy, activity))
g.add((metrics, PROV.wasAttributedTo, user))
# Serialize + store
prov_file = f"prov_{run.info.run_id}.ttl"
g.serialize(prov_file, format="turtle")
mlflow.log_artifact(prov_file, artifact_path="provenance")
print(f"✅ Provenance logged in {prov_file}")
📦 Step 4: Integrate Tracker
Modify the training script to call log_provenance(run)
after training completes.
with mlflow.start_run() as run:
# Training loop (as above) ...
mlflow.pytorch.log_model(model, "model")
# Capture provenance
log_provenance(run)
Now every MLflow run will automatically create a provenance graph and store it alongside model artifacts.
Final script train-small-nn-pytorch.py
:
import torch
import torch.nn as nn
import torch.optim as optim
import mlflow
import mlflow.pytorch
from rdflib import Graph, Namespace, URIRef, Literal
from rdflib.namespace import RDF, FOAF
import mlflow
# Fake dataset
X = torch.randn(100, 10)
y = torch.randint(0, 2, (100,))
# Simple NN
model = nn.Sequential(nn.Linear(10, 32), nn.ReLU(), nn.Linear(32, 2))
loss_fn = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)
# Provenance Tracker for MLflow
PROV = Namespace("http://www.w3.org/ns/prov#")
EX = Namespace("http://example.org/")
def log_provenance(run):
g = Graph()
g.bind("prov", PROV)
g.bind("ex", EX)
# Agent (engineer/system)
user = EX["engineer"]
g.add((user, RDF.type, PROV.Agent))
g.add((user, FOAF.name, Literal("AI Engineer")))
# Activity (the MLflow run)
activity = EX[f"run_{run.info.run_id}"]
g.add((activity, RDF.type, PROV.Activity))
# Input dataset
dataset = EX["dataset.csv"]
g.add((dataset, RDF.type, PROV.Entity))
g.add((activity, PROV.used, dataset))
# Model entity
model = EX[f"model_{run.info.run_id}.pth"]
g.add((model, RDF.type, PROV.Entity))
g.add((model, PROV.wasGeneratedBy, activity))
g.add((model, PROV.wasAttributedTo, user))
# Metrics entity
metrics = EX[f"metrics_{run.info.run_id}.json"]
g.add((metrics, RDF.type, PROV.Entity))
g.add((metrics, PROV.wasGeneratedBy, activity))
g.add((metrics, PROV.wasAttributedTo, user))
# Serialize + store
prov_file = f"prov_{run.info.run_id}.ttl"
g.serialize(prov_file, format="turtle")
mlflow.log_artifact(prov_file, artifact_path="provenance")
print(f"✅ Provenance logged in {prov_file}")
# MLflow
with mlflow.start_run() as run:
for epoch in range(5):
# Training loop
optimizer.zero_grad()
preds = model(X)
loss = loss_fn(preds, y)
loss.backward()
optimizer.step()
mlflow.log_metric("loss", loss.item(), step=epoch)
mlflow.pytorch.log_model(model, "model")
# Capture provenance
log_provenance(run)
mlflow.log_param("lr", 0.001)
mlflow.pytorch.log_model(model, "model")
📂 Step 5: Example Output
Provenance graph (Turtle format) prov_70d8b46c6451416d92a0ae7cac4c8602.ttl
:
@prefix ex: <http://example.org/> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix prov: <http://www.w3.org/ns/prov#> .
ex:metrics_70d8b46c6451416d92a0ae7cac4c8602.json a prov:Entity ;
prov:wasAttributedTo ex:engineer ;
prov:wasGeneratedBy ex:run_70d8b46c6451416d92a0ae7cac4c8602 .
ex:model_70d8b46c6451416d92a0ae7cac4c8602.pth a prov:Entity ;
prov:wasAttributedTo ex:engineer ;
prov:wasGeneratedBy ex:run_70d8b46c6451416d92a0ae7cac4c8602 .
ex:dataset.csv a prov:Entity .
ex:engineer a prov:Agent ;
foaf:name "AI Engineer" .
ex:run_70d8b46c6451416d92a0ae7cac4c8602 a prov:Activity ;
prov:used ex:dataset.csv .
This graph is machine-readable and interoperable with semantic web tools, knowledge graphs, and governance platforms.
🔍 Step 6: Query Provenance
Since PROV-O is RDF-based, we can load graphs into a triple store and query with SPARQL. The following are a few example queries:
1️⃣Which dataset was used to generate a given model?
SELECT ?dataset WHERE {
ex:model_70d8b46c6451416d92a0ae7cac4c8602.pth prov:wasGeneratedBy ?activity .
?activity prov:used ?dataset .
}
This query returns dataset.csv
as the dataset that trained model_f4a22.pth
.
The SPARQL queries can be run using the following Python script:
import rdflib
# Create a Graph object
g = rdflib.Graph()
# Parse the TTL file into the graph
g.parse("prov_70d8b46c6451416d92a0ae7cac4c8602.ttl", format='turtle')
# Define your SPARQL query
sparql_query = """
SELECT ?dataset WHERE {
ex:model_70d8b46c6451416d92a0ae7cac4c8602.pth prov:wasGeneratedBy ?activity .
?activity prov:used ?dataset .
}
"""
# Execute the query
results = g.query(sparql_query)
# Process the results
for row in results:
print(row)
2️⃣All models generated by a given engineer
SELECT ?model
WHERE {
?model a prov:Entity ;
prov:wasAttributedTo ex:engineer .
}
👉 Returns all model URIs that were attributed to the engineer ex:engineer
.
3️⃣All datasets used in the last month
If your provenance tracker adds prov:generatedAtTime
or similar timestamps on entities/activities, you can filter by date. Example:
SELECT ?dataset ?time
WHERE {
?activity a prov:Activity ;
prov:used ?dataset ;
prov:endedAtTime ?time .
?dataset a prov:Entity .
FILTER (?time >= "2025-07-28T00:00:00Z"^^xsd:dateTime &&
?time <= "2025-08-28T23:59:59Z"^^xsd:dateTime)
}
👉 This finds all prov:Entity
datasets used by any activity that ended in the last month.
4️⃣Provenance chains across multiple runs (for auditing)
Here we want to trace lineage from dataset → activity → model → metrics.
SELECT ?dataset ?activity ?model ?metrics
WHERE {
?dataset a prov:Entity .
?activity a prov:Activity ;
prov:used ?dataset ;
prov:generated ?model, ?metrics .
?model a prov:Entity .
?metrics a prov:Entity .
}
👉 This gives a table of full provenance chains, so you can audit multiple runs together.
5️⃣Find all runs that reused the same dataset
Useful for detecting data reuse:
SELECT ?dataset (GROUP_CONCAT(?model; separator=", ") AS ?models)
WHERE {
?activity prov:used ?dataset ;
prov:generated ?model .
}
GROUP BY ?dataset
HAVING (COUNT(?model) > 1)
👉 Returns datasets that were reused in multiple model generations.
⚡ These queries assume you have prov:used
, prov:generated
, prov:wasAttributedTo
, and timestamps (prov:endedAtTime
or prov:generatedAtTime
) in your TTL logs.
✅ Why This Matters
By extending MLflow with PROV-O, AI engineers gain:
- Reproducibility → Every model is linked to the exact data and parameters that generated it.
- Auditability → Regulators and compliance teams can trace how outputs were produced.
- Transparency → Business stakeholders can understand lineage without relying on tribal knowledge.
- Interoperability → Since PROV-O is a W3C standard, provenance metadata can be integrated into external governance, data catalog, and knowledge graph systems.
🚀 What We Learnt
We’ve seen how to:
- Train a PyTorch model with MLflow.
- Capture provenance automatically using PROV-O.
- Serialize provenance graphs as RDF/Turtle.
- Query lineage with SPARQL.