Provenance in AI: Tracking AI Lineage with Signed Provenance Logs in Python – Part 2

In modern AI pipelines, provenance — the lineage of datasets, models, and inferences — is becoming as important as accuracy metrics. Regulators, auditors, and even downstream consumers increasingly demand answers to questions like:

Which dataset was this model trained on?
What code commit produced this artifact?
How do we know logs weren’t tampered with after training?

To learn more about provenance in AI, read my previous article: Provenance in AI: Why It Matters for AI Engineers – Part 1

To answer the above-raised questions, let’s walk through a Python-based provenance tracker that logs lineage events, cryptographically signs them, and maintains schema versioning for forward compatibility.

1. The Provenance Tracker: Key Features

The ProvenanceTracker implements three important ideas:

Multiple dataset support
- Models often train on more than one dataset (train + validation + test).
- This tracker keeps a list of dataset hashes (dataset_hashes) and auto-links them to model logs.
Signed JSONL envelopes
- Each log entry is wrapped in an envelope:
  { "schema_version": "1.1", "signed_data": "{…}", "signature": "" }
- signed_data is serialized with stable JSON (sort_keys=True).
- A digital signature (RSA + PSS padding + SHA-256) is generated using a private key.
Schema versioning
- schema_version = "1.1" is embedded in every record.

2. The Provenance Tracker: Source Code

Before we get to the provenance tracker code, let’s see a companion script generate_keys.py that creates the RSA keypair (private_key.pem, public_key.pem). This is used by the ProvenanceTracker.py to sign the JSONL logs.

# generate_keys.py
from cryptography.hazmat.primitives.asymmetric import rsa
from cryptography.hazmat.primitives import serialization

# Generate RSA private key (2048 bits)
private_key = rsa.generate_private_key(
    public_exponent=65537,
    key_size=2048,
)

# Save private key (PEM)
with open("private_key.pem", "wb") as f:
    f.write(
        private_key.private_bytes(
            encoding=serialization.Encoding.PEM,
            format=serialization.PrivateFormat.PKCS8,
            encryption_algorithm=serialization.NoEncryption(),
        )
    )

# Save public key (PEM)
public_key = private_key.public_key()
with open("public_key.pem", "wb") as f:
    f.write(
        public_key.public_bytes(
            encoding=serialization.Encoding.PEM,
            format=serialization.PublicFormat.SubjectPublicKeyInfo,
        )
    )

print("✅ RSA keypair generated: private_key.pem & public_key.pem")

Run once to create your keypair:

python generate_keys.py

Here’s a secure ProvenanceTracker (schema version 1.1) that:

Supports multiple datasets
Includes schema version
Signs JSONL using RSA private key

# ProvenanceTracker.py
import hashlib
import json
import os
import platform
import socket
import subprocess
import base64
from datetime import datetime
from typing import Any, Dict, List
from cryptography.hazmat.primitives import hashes, serialization
from cryptography.hazmat.primitives.asymmetric import padding


class ProvenanceTracker:
    SCHEMA_VERSION = "1.1"

    def __init__(self, storage_path: str = "provenance_logs.jsonl", private_key_path: str = "private_key.pem"):
        self.storage_path = storage_path
        self._dataset_hashes: List[str] = []  # track datasets used
        self.private_key = self._load_private_key(private_key_path)

    def _load_private_key(self, path: str):
        with open(path, "rb") as f:
            return serialization.load_pem_private_key(f.read(), password=None)

    def _get_git_commit(self) -> str:
        try:
            return subprocess.check_output(
                ["git", "rev-parse", "HEAD"], stderr=subprocess.DEVNULL
            ).decode("utf-8").strip()
        except Exception:
            return "N/A"

    def _hash_file(self, file_path: str) -> str:
        h = hashlib.sha256()
        with open(file_path, "rb") as f:
            while chunk := f.read(8192):
                h.update(chunk)
        return h.hexdigest()

    def _sign(self, payload: str) -> str:
        signature = self.private_key.sign(
            payload.encode("utf-8"),
            padding.PSS(
                mgf=padding.MGF1(hashes.SHA256()),
                salt_length=padding.PSS.MAX_LENGTH,
            ),
            hashes.SHA256(),
        )
        return base64.b64encode(signature).decode("utf-8")

    def _log(self, record: Dict[str, Any]):
        record["timestamp"] = datetime.utcnow().isoformat()
        record["schema_version"] = self.SCHEMA_VERSION

        # Serialize signed_data separately (stable JSON encoding)
        signed_data = json.dumps(record, sort_keys=True)
        signature = self._sign(signed_data)

        envelope = {
            "schema_version": self.SCHEMA_VERSION,
            "signed_data": signed_data,
            "signature": signature,
        }

        with open(self.storage_path, "a") as f:
            f.write(json.dumps(envelope) + "\n")

    def log_dataset(self, dataset_path: str, description: str = ""):
        dataset_hash = self._hash_file(dataset_path)
        self._dataset_hashes.append(dataset_hash)

        record = {
            "type": "dataset",
            "path": dataset_path,
            "hash": dataset_hash,
            "description": description,
        }
        self._log(record)
        return dataset_hash

    def log_model(self, model_name: str, hyperparams: Dict[str, Any]):
        record = {
            "type": "model",
            "model_name": model_name,
            "hyperparameters": hyperparams,
            "git_commit": self._get_git_commit(),
            "environment": {
                "python_version": platform.python_version(),
                "platform": platform.system(),
                "hostname": socket.gethostname(),
            },
            "dataset_hashes": self._dataset_hashes,  # link all datasets
        }
        self._log(record)

    def log_inference(self, model_name: str, input_data: Any, output_data: Any):
        record = {
            "type": "inference",
            "id": f"inf-{hashlib.sha1(json.dumps(input_data).encode()).hexdigest()[:12]}",  # deterministic ID
            "model_name": model_name,
            "input": input_data,
            "output": output_data,
        }
        self._log(record)


if __name__ == "__main__":
    tracker = ProvenanceTracker()

    # 1. Log datasets
    ds1 = tracker.log_dataset("data/training.csv", "Customer churn dataset")
    ds2 = tracker.log_dataset("data/validation.csv", "Validation set")

    # 2. Log model (linked to all datasets seen so far)
    tracker.log_model("churn-predictor-v2", {
        "algorithm": "XGBoost",
        "n_estimators": 200,
        "max_depth": 12,
    })

    # 3. Log inference
    tracker.log_inference(
        "churn-predictor-v2",
        {"customer_id": 54321, "features": [0.4, 1.7, 0.2]},
        {"churn_risk": 0.42}
    )

    print("✅ Signed provenance logs recorded in provenance_logs.jsonl")

3. Under the Hood

3.1 Datasets

Datasets are logged with a SHA-256 file hash, ensuring that even if file names change, the integrity check remains stable.

ds1 = tracker.log_dataset("data/training.csv", "Customer churn dataset")
ds2 = tracker.log_dataset("data/validation.csv", "Validation set")

Resulting record (inside signed_data):

{
  "type": "dataset",
  "path": "data/training.csv",
  "hash": "a41be7b96f...",
  "description": "Customer churn dataset",
  "timestamp": "2025-08-28T10:12:34.123456",
  "schema_version": "1.1"
}

3.2 Models

When logging a model, the tracker attaches:

Model metadata (name, hyperparameters)
Git commit hash (if available)
Runtime environment (Python version, OS, hostname)
All dataset hashes seen so far

tracker.log_model("churn-predictor-v2", {
    "algorithm": "XGBoost",
    "n_estimators": 200,
    "max_depth": 12,
})

This creates a strong lineage link:
Dataset(s) → Model

3.3 Inferences

Every inference is logged with a deterministic ID, computed as a SHA-1 hash of the input payload. This ensures repeat queries generate the same inference ID (helpful for deduplication).

tracker.log_inference(
    "churn-predictor-v2",
    {"customer_id": 54321, "features": [0.4, 1.7, 0.2]},
    {"churn_risk": 0.42}
)

Graphically:
Model → Inference

4. Signed Envelopes for Tamper-Proofing

Each record is not stored raw but wrapped in a signed envelope:

{
  "schema_version": "1.1",
  "signed_data": "{\"description\": \"Validation set\", \"hash\": \"c62...\"}",
  "signature": "MEUCIQDgtd...xyz..."
}

To verify:

Load the public key.
Verify the signature against the serialized signed_data.
If modified, verification fails → tampering detected.

This is exactly the mechanism PKI systems and blockchain protocols use for immutability.

5. Example End-to-End Run

When running ProvenanceTracker.py:

$ python ProvenanceTracker.py
✅ Signed provenance logs recorded in provenance_logs.jsonl

The log file (provenance_logs.jsonl) will contain three signed envelopes — one each for datasets, the model, and an inference.

Following is provenance_logs.jsonl after run:

{"schema_version": "1.1", "signed_data": "{\"description\": \"Customer churn dataset\", \"hash\": \"a41be7b96fb85110521bf03d1530879e9ca94b9f5de19866757f6d184300fff7\", \"path\": \"data/training.csv\", \"schema_version\": \"1.1\", \"timestamp\": \"2025-08-28T01:06:31.062695\", \"type\": \"dataset\"}", "signature": "MnCRJ+Acg0F1UledjnMwQMp24wAIPmLPaZonI7hvdFvdi7d8CaZDOIamNq0KnRgcZgttJnI1L675tqT1O1M5N2FRNuy/Wj6elzpyM9w56Kd2mBcQLFumhVHiGZHtwKj2wQtXND0SCqWo5jxQPLPl0dSFClA+FKzpxfazwMtrHAE7aoUmyt2cv1Wiv9uZxsE+Il226J91rBk03lpLcArqqxTtfstkayOK5AON9ETXs65ERf26oURngR/0HS9jnO0IH1DxZOcHcfWZMrLwGqdjRF1sSDYcH70XV61yeYzSeIb8KDODttuxxfzsIlb0897tv/ZZ/X4tv/FFICei7LeAuw=="}
{"schema_version": "1.1", "signed_data": "{\"description\": \"Validation set\", \"hash\": \"330af932f2dc1cae917f3bd0fb29395c4021319dd906189b7dc257d0ad58a617\", \"path\": \"data/validation.csv\", \"schema_version\": \"1.1\", \"timestamp\": \"2025-08-28T01:06:31.070827\", \"type\": \"dataset\"}", "signature": "pu8IvzPriN6eP9HTQGlIog8nfXV0FOEw818aw6uJS8oPKiQPjN3odzbP9zaeB+ZW4Nu9bBL5fm1btiiOSm9ziWUJWUzFRoHwlYTv2rgp/IXR0oWfTpXsdVeBj7NYVjUywLPofTeEE1C4J7XzZmusuCU9ZiKJzXU442E6Gsrj6tjRJxZoylONuekxegdTot4LwIcmCRtgigi1t3rQYBGdknmTFdW/I2h1Gguh+Shc/WG/jVuMq10vFNNM8iUJJAxAEktbpfhGw0of6lrZu9yn4wAmxvq0DFICKMEJlsyvEZ/mDaPkR4D55xuJh+dLlFbzNZvyw0woMII0hbIarNmG+w=="}
{"schema_version": "1.1", "signed_data": "{\"dataset_hashes\": [\"a41be7b96fb85110521bf03d1530879e9ca94b9f5de19866757f6d184300fff7\", \"330af932f2dc1cae917f3bd0fb29395c4021319dd906189b7dc257d0ad58a617\"], \"environment\": {\"hostname\": \"GlamorPC\", \"platform\": \"Windows\", \"python_version\": \"3.10.11\"}, \"git_commit\": \"N/A\", \"hyperparameters\": {\"algorithm\": \"XGBoost\", \"max_depth\": 12, \"n_estimators\": 200}, \"model_name\": \"churn-predictor-v2\", \"schema_version\": \"1.1\", \"timestamp\": \"2025-08-28T01:06:31.117627\", \"type\": \"model\"}", "signature": "tq/y6Blz04u2iYZh5OqfyZChADA+osNIzwb9Z2g++AZjFu2hkywazf19rbTMsdx9J5s4BDz6rglfcFczRW/TXMECD3k91ZmAds/e0I+Xw42xeTnr7+jHKq5kPdV6Pan8yFVd9ikGso93ZDatX72rx+orIg41BggFN7ifYlKNnGD87zCypahI7Eev0frnD6w8GybmPcBMnCVLYlIo2nWpLgJELkVpwwagQ9rKA+WOlBbLe41ZizooSL/hhGJOXTuwYrkJpBZ69TIwCzihINr+joZBqYrPF+0E+CFohdc03b0SFv1OuNTo7dyqL9qpWdCMSi1iK0LfCukCO41Bvr2yHA=="}
{"schema_version": "1.1", "signed_data": "{\"id\": \"inf-0276b2064ad0\", \"input\": {\"customer_id\": 54321, \"features\": [0.4, 1.7, 0.2]}, \"model_name\": \"churn-predictor-v2\", \"output\": {\"churn_risk\": 0.42}, \"schema_version\": \"1.1\", \"timestamp\": \"2025-08-28T01:06:31.118634\", \"type\": \"inference\"}", "signature": "Lf9r1vcXOaCxSc11UKNvuDjx7FophWXBxAobYlixIJgNIk2toFtEdjB2zzJtQI5cYEAImhNHB8hdssKUv3Dths0SpKeMQjpb0x0aKvXolnNsJMnEnGP443IRfMTpkcHpRjCVjIfEvP8EtAh58z4yHE77cy2IlSUFu3exwSEcRFVqBXvIKlojQTEneERUwEDZjfniluomSCLXiVFYMIB+LefPHGkChCVVulmyFJ9ITquD4Wymp2/c2/knopqXSP00EFON4SBOD9/RyQAXAl5UxP0s6faD7NeZxAdJWh3CY31+5V3Vv8b9y/jroAvxWjbpuCZT20gkHemArawDae3s0w=="}

The following is the standalone verification code validate_logs.py:

#!/usr/bin/env python3
"""
Usage:
    python validate_logs.py provenance_logs.jsonl public_key.pem
"""

import json
import base64
import sys
from cryptography.hazmat.primitives import hashes
from cryptography.hazmat.primitives.asymmetric import padding
from cryptography.hazmat.primitives.serialization import load_pem_public_key


EXPECTED_SCHEMA = "1.1"


def load_public_key(path: str):
    with open(path, "rb") as f:
        return load_pem_public_key(f.read())


def verify_signature(public_key, signed_data: str, signature_b64: str) -> bool:
    try:
        signature = base64.b64decode(signature_b64)
        public_key.verify(
            signature,
            signed_data.encode("utf-8"),
            padding.PSS(
                mgf=padding.MGF1(hashes.SHA256()),
                salt_length=padding.PSS.MAX_LENGTH,
            ),
            hashes.SHA256(),
        )
        return True
    except Exception:
        return False


def validate_file(jsonl_path: str, pubkey_path: str):
    public_key = load_public_key(pubkey_path)

    valid_count = 0
    failed_count = 0
    schema_mismatch = 0

    with open(jsonl_path, "r") as f:
        for line_no, line in enumerate(f, start=1):
            try:
                envelope = json.loads(line.strip())
            except json.JSONDecodeError:
                print(f"❌ Line {line_no}: invalid JSON")
                failed_count += 1
                continue

            schema = envelope.get("schema_version")
            signed_data = envelope.get("signed_data")
            signature = envelope.get("signature")

            if schema != EXPECTED_SCHEMA:
                print(f"⚠️  Line {line_no}: schema version mismatch ({schema})")
                schema_mismatch += 1
                continue

            if not signed_data or not signature:
                print(f"❌ Line {line_no}: missing signed_data/signature")
                failed_count += 1
                continue

            if verify_signature(public_key, signed_data, signature):
                valid_count += 1
            else:
                print(f"❌ Line {line_no}: signature verification failed")
                failed_count += 1

    print("\n--- Validation Report ---")
    print(f"✅ Valid entries      : {valid_count}")
    print(f"❌ Signature failures : {failed_count}")
    print(f"⚠️  Schema mismatches : {schema_mismatch}")
    print(f"📄 Total lines        : {valid_count + failed_count + schema_mismatch}")


if __name__ == "__main__":
    if len(sys.argv) != 3:
        print("Usage: python validate_logs.py provenance_logs.jsonl public_key.pem")
        sys.exit(1)

    jsonl_file = sys.argv[1]
    pubkey_file = sys.argv[2]

    validate_file(jsonl_file, pubkey_file)

Output:

6. Extending This for Real-World AI Workflows

An AI engineer could extend this design in several directions:

Neo4j Importer: Build a provenance graph to visualize dataset → model → inference lineage.
Metrics integration: Log evaluation metrics (AUC, F1) into the model record.
MLOps pipelines: Integrate into training jobs so every experiment auto-generates signed lineage logs.
Cloud KMS keys: Replace PEM private key with keys from AWS KMS, GCP KMS, or HashiCorp Vault.
Verification service: Deploy a microservice that validates provenance logs on ingestion.

7. Why This Matters for You

As AI systems leave the lab and enter regulated domains (finance, healthcare, insurance), being able to say:

“This prediction came from Model X at commit Y, trained on Dataset Z, verified cryptographically.”

…will be non-negotiable.

Implementing provenance today sets you ahead of compliance requirements tomorrow.

This ProvenanceTracker is a blueprint for trustworthy AI engineering — versioned, signed, and reproducible lineage for every dataset, model, and inference in your pipeline.

Note: The Customer churn dataset can be downloaded from Kaggle and can be renamed and placed in the data directory.