How to Configure DVC for PostGIS and GeoJSON
To configure DVC for PostGIS and GeoJSON, treat GeoJSON as standard versioned files and handle PostGIS through reproducible export/import pipelines. Because DVC does not natively track live relational databases, you must serialize PostGIS tables into disk-friendly formats (GeoJSON, GeoPackage, or Parquet), track those serialized outputs with DVC, and automate the extraction using dvc.yaml stages. This workflow preserves spatial schema, coordinate reference systems (CRS), and topology while keeping Git lightweight and binaries in cloud storage.
Compatibility & Prerequisites
Before building the pipeline, verify your stack aligns with these baselines. Mismatched GDAL/Python wheels are the leading cause of silent CRS drops or geometry corruption during export.
| Component | Minimum Version | Notes |
|---|---|---|
| DVC | 3.0.0+ |
Modern dvc.yaml syntax is stable; legacy dvc run is deprecated |
| Python | 3.9+ |
Required for psycopg2, geopandas, and shapely compatibility |
| PostgreSQL/PostGIS | 13+ / 3.0+ |
Relies on modern spatial functions and ST_AsGeoJSON |
| GDAL/OGR | 3.4+ |
Handles CRS transformations and format validation |
| Remote Storage | S3, GCS, Azure, SSH | DVC tracks metadata locally; heavy binaries live remotely |
Step 1: Initialize DVC & Configure Remote Storage
Initialize the repository and point DVC to your cloud backend. DVC stores lightweight .dvc pointer files in Git while pushing actual payloads to the remote. For teams managing multi-gigabyte shapefiles or raster mosaics, review Large File Handling in DVC for GIS to optimize chunking, cache eviction, and concurrent pull/push strategies.
dvc init
dvc remote add -d geodata_remote s3://your-bucket/dvc-geospatial
dvc remote modify geodata_remote credentialpath ~/.aws/credentials
git add .dvc .dvc/config .gitignore
git commit -m "Initialize DVC with S3 remote"
Step 2: Version GeoJSON Directly
GeoJSON is a lightweight, text-based format that DVC handles natively without serialization overhead. Place source files in a structured directory and track them:
mkdir -p data/raw
cp administrative_boundaries.geojson data/raw/
dvc add data/raw/administrative_boundaries.geojson
git add data/raw/administrative_boundaries.geojson.dvc .gitignore
git commit -m "Track initial GeoJSON boundary layer"
DVC hashes the file using SHA-256, stores the binary in .dvc/cache, and commits a pointer. Because GeoJSON strictly follows RFC 7946, coordinate order and CRS defaults (EPSG:4326) are preserved without extra configuration. Avoid embedding custom CRS properties unless your downstream toolchain explicitly requires them.
Step 3: Automate PostGIS Export with DVC Pipelines
PostGIS requires explicit serialization. The production-ready pattern wraps a Python export script in a dvc.yaml stage. This guarantees that any team member can regenerate identical outputs from the database state at a specific point in time.
Create scripts/export_postgis.py:
import os
import geopandas as gpd
from sqlalchemy import create_engine
def export_table(table_name: str, output_path: str, crs: str = "EPSG:4326"):
db_url = os.getenv("POSTGIS_CONNECTION_STRING")
if not db_url:
raise EnvironmentError("POSTGIS_CONNECTION_STRING not set")
engine = create_engine(db_url)
gdf = gpd.read_postgis(f"SELECT * FROM {table_name}", con=engine, geom_col="geom")
if gdf.crs is None:
gdf.set_crs(crs, inplace=True)
elif gdf.crs != crs:
gdf = gdf.to_crs(crs)
# Validate geometry before writing
invalid_mask = ~gdf.geometry.is_valid
if invalid_mask.any():
print(f"Warning: {invalid_mask.sum()} invalid geometries found. Applying buffer(0) fix.")
gdf.loc[invalid_mask, "geometry"] = gdf.loc[invalid_mask, "geometry"].buffer(0)
gdf.to_file(output_path, driver="GeoJSON")
print(f"Exported {len(gdf)} features to {output_path}")
if __name__ == "__main__":
export_table("urban_zones", "data/raw/urban_zones.geojson")
Define the pipeline in dvc.yaml following official DVC pipeline syntax:
stages:
export_postgis:
cmd: python scripts/export_postgis.py
deps:
- scripts/export_postgis.py
outs:
- data/raw/urban_zones.geojson
Execute and track:
export POSTGIS_CONNECTION_STRING="postgresql+psycopg2://user:pass@host:5432/dbname"
dvc repro
git add dvc.yaml data/raw/urban_zones.geojson.dvc .gitignore
git commit -m "Add PostGIS export pipeline for urban zones"
DVC automatically detects changes to the script or dependencies, invalidates the cache, and re-runs the stage. For a deeper dive into pipeline architecture and dependency graphs, see Geospatial Data Versioning Fundamentals & Architecture.
Step 4: Validate Spatial Integrity & CI/CD Integration
Versioning spatial data requires more than file tracking. Add a validation stage to catch topology errors before they propagate downstream.
stages:
validate_geojson:
cmd: python scripts/validate_spatial.py data/raw/urban_zones.geojson
deps:
- data/raw/urban_zones.geojson
- scripts/validate_spatial.py
Run dvc repro to chain export and validation automatically. This ensures that every dvc pull delivers spatially sound datasets. To automate this in CI/CD, use a GitHub Actions workflow:
name: DVC Spatial Pipeline
on: [push, pull_request]
jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: iterative/setup-dvc@v1
- name: Configure AWS Credentials
uses: aws-actions/configure-aws-credentials@v4
with:
aws-access-key-id: $
aws-secret-access-key: $
- name: Run Pipeline
env:
POSTGIS_CONNECTION_STRING: $
run: dvc repro
Troubleshooting & Best Practices
- Secrets Management: Never hardcode database credentials. Use environment variables or integrate with HashiCorp Vault/AWS Secrets Manager. DVC ignores
.envfiles by default. - Schema Drift: If PostGIS table structures change, update your export script and bump the
dvc.yamlstage. DVC will detect the script change and re-export automatically. - Format Selection: Use GeoJSON for web/JS ecosystems, GeoPackage for desktop GIS (QGIS/ArcGIS), and Parquet for analytical pipelines. DVC treats them identically, but downstream tools have strict format expectations.
- Cache Optimization: Configure
.dvc/configto usetype: copyortype: reflinkon supported filesystems to speed up localdvc checkoutoperations. - Large Geometry Truncation: PostGIS
ST_AsGeoJSONmay truncate coordinates if precision isnβt specified. Always validate output bounding boxes and coordinate precision against your project requirements.
By combining direct file tracking for GeoJSON with reproducible DVC stages for PostGIS, you achieve full spatial data lineage without bloating your Git history. The pipeline approach scales from single-developer projects to enterprise GIS teams, ensuring consistent, auditable, and cloud-backed geospatial workflows.