How to tag and release versioned OpenStreetMap extracts

To tag and release versioned OpenStreetMap extracts, combine semantic versioning with immutable spatial snapshots, attach standardized metadata, and automate extraction using CLI toolchains like osmium-tool or pyosmium. Each release must be identified by a vMAJOR.MINOR.PATCH tag, paired with a timestamped .osm.pbf or .geojson archive, a VERSION.json manifest, and cryptographic checksums. Store artifacts in a version-controlled registry (Git LFS, S3, or an artifact repository), publish tags via Git, and document the underlying OSM replication sequence. This guarantees reproducibility, traceability, and seamless integration into downstream GIS pipelines.

Core Versioning Architecture

OpenStreetMap data mutates continuously, so spatial extracts require dual identifiers: a human-readable semantic version and a machine-readable temporal anchor. The recommended pattern uses vMAJOR.MINOR.PATCH aligned with Semantic Versioning 2.0.0 principles:

  • MAJOR: Increments when the extraction boundary, schema mapping, coordinate reference system, or output format changes.
  • MINOR: Increments when feature filtering rules, tag transformations, or attribute enrichment pipelines are updated.
  • PATCH: Increments for routine data refreshes from upstream OSM replication sequences without altering extraction logic.

Every release must embed the exact OSM replication timestamp (osmosis_replication_timestamp) and sequence ID (osmosis_replication_sequence_number) in the metadata. This allows downstream consumers to reconstruct the exact state of the planet file at extraction time. Aligning your tagging cadence with established Release Tagging Strategies for Spatial Basemaps ensures consistency across multi-region deployments and prevents schema drift in collaborative environments.

Automated Extraction & Tagging Pipeline

The following Python script orchestrates metadata generation, checksumming, and Git tagging. It assumes osmium-tool and git are installed and available in $PATH. The script expects a pre-downloaded .osm.pbf file and outputs a release directory containing the artifact, manifest, and checksum.

#!/usr/bin/env python3
"""Automated OSM extract versioning, checksumming, and Git tagging."""

import os
import subprocess
import hashlib
import json
import sys
from datetime import datetime, timezone
from pathlib import Path

def run_cmd(cmd: list[str], cwd: str | None = None) -> str:
    """Execute a shell command and return stripped stdout."""
    result = subprocess.run(
        cmd, capture_output=True, text=True, check=True, cwd=cwd
    )
    return result.stdout.strip()

def sha256_file(filepath: str) -> str:
    """Compute SHA-256 hex digest for a given file."""
    h = hashlib.sha256()
    with open(filepath, "rb") as f:
        for chunk in iter(lambda: f.read(8192), b""):
            h.update(chunk)
    return h.hexdigest()

def tag_and_release_osm_extract(
    region_name: str,
    pbf_path: str,
    version: str,
    output_dir: str = "./releases"
) -> None:
    """Generate manifest, checksum, and Git tag for an OSM extract."""
    out = Path(output_dir)
    out.mkdir(parents=True, exist_ok=True)
    
    target_pbf = out / f"{region_name}-{version}.osm.pbf"
    meta_path = out / f"{region_name}-{version}.json"
    sha_path = out / f"{region_name}-{version}.sha256"

    # Copy or symlink the source PBF to the release directory
    if not Path(pbf_path).is_file():
        raise FileNotFoundError(f"Source PBF not found: {pbf_path}")
    if not target_pbf.exists():
        target_pbf.symlink_to(Path(pbf_path).resolve())

    # Compute checksum
    checksum = sha256_file(str(target_pbf))
    sha_path.write_text(f"{checksum}  {target_pbf.name}\n")

    # Extract OSM replication metadata using osmium
    try:
        osmium_out = run_cmd(["osmium", "fileinfo", "-g", "header.option.timestamp", str(target_pbf)])
        timestamp = osmium_out.strip()
    except subprocess.CalledProcessError:
        timestamp = "unknown"

    # Build VERSION.json manifest
    manifest = {
        "region": region_name,
        "version": version,
        "format": "osm.pbf",
        "file": target_pbf.name,
        "sha256": checksum,
        "extracted_at": datetime.now(timezone.utc).isoformat(),
        "osm_replication_timestamp": timestamp,
        "generator": "osmium-tool",
        "license": "ODbL 1.0"
    }
    meta_path.write_text(json.dumps(manifest, indent=2) + "\n")

    # Git tagging
    try:
        run_cmd(["git", "add", str(meta_path), str(sha_path)])
        run_cmd(["git", "commit", "-m", f"release: {region_name} {version}"])
        run_cmd(["git", "tag", "-a", version, "-m", f"OSM extract {region_name} v{version}"])
        print(f"✅ Tagged and committed: {version}")
    except subprocess.CalledProcessError as e:
        print(f"⚠️ Git operation failed: {e.stderr}", file=sys.stderr)
        sys.exit(1)

if __name__ == "__main__":
    if len(sys.argv) < 4:
        print("Usage: python tag_release.py <region> <path/to/source.pbf> <vMAJOR.MINOR.PATCH>")
        sys.exit(1)
    tag_and_release_osm_extract(sys.argv[1], sys.argv[2], sys.argv[3])

Publishing & Registry Workflow

Once tagged locally, push artifacts and tags to your remote registry. For large .osm.pbf files, Git LFS prevents repository bloat while preserving version history:

git lfs track "*.osm.pbf"
git add .gitattributes
git commit -m "chore: configure LFS for spatial artifacts"
git push origin main --follow-tags

If your infrastructure relies on object storage, sync the release directory to S3 or an equivalent artifact repository using aws s3 sync or rclone. Maintain a strict naming convention ({region}-{version}.osm.pbf) to enable programmatic discovery. When coordinating across multiple extraction zones or schema branches, consult Branching & Merge Strategies for Spatial Datasets to isolate experimental filters from production basemaps. Always tag releases on the mainline branch after CI validation passes.

Validation & Downstream Integration

Reproducibility hinges on cryptographic verification and metadata transparency. Downstream pipelines should:

  1. Verify the .sha256 file against the downloaded archive using sha256sum -c.
  2. Parse VERSION.json to confirm the OSM replication timestamp matches expected freshness thresholds.
  3. Validate geometry integrity using osmium check or ogrinfo before loading into PostGIS, DuckDB, or tile servers.

For automated freshness checks, monitor the OSM Planet File replication state and trigger PATCH releases when the upstream sequence advances beyond your defined threshold. Document extraction boundaries using GeoJSON or OSM relation IDs inside the manifest to prevent coordinate drift during regional merges.

By standardizing semantic tags, embedding replication anchors, and automating checksum generation, your team eliminates manual release friction and guarantees that every spatial extract is auditable, cacheable, and production-ready.