Why can't I use standard Git branching for GeoJSON or Shapefile data?

Git operates on line-level text diffs. A single vertex shift in GeoJSON triggers hundreds of changed lines, and binary formats like Shapefile or GeoTIFF appear as complete rewrites. Topology constraints between adjacent features are invisible to Git, so a merge that passes syntactically can still produce overlapping polygons or coordinate reference system mismatches.

What is the safest branching model for a high-frequency spatial data pipeline?

Trunk-based development with short-lived feature branches minimises long-lived divergence. Every branch is validated against the current `main` baseline, keeping geometric context fresh and making spatial rebasing computationally tractable.

How do I handle CRS inconsistencies during a branch merge?

Run CRS normalisation as a mandatory pre-merge gate. Use pyproj or GDAL's ogr2ogr to reproject all incoming features to the target branch's authority CRS before the merge commit. Flag any EPSG code mismatches as blocking errors in CI.

Branching & Merge Strategies for Spatial Datasets

Standard distributed version-control systems assume that differences between revisions are expressible as line-level text deltas — an assumption that breaks immediately when applied to polygons, rasters, or topology-constrained vector layers. GIS teams that bolt Git onto spatial workflows without adaptation end up with repositories that are brittle, opaque, and nearly impossible to merge cleanly at scale.

Why This Is Hard for Spatial Data

Spatial datasets introduce failure modes that have no equivalent in software source control.

Topology sensitivity. Adjacent polygons must share boundaries without gaps or overlaps. When two analysts independently edit neighbouring features on separate branches, each edit may be geometrically valid in isolation yet violate the topology constraint the moment the branches are merged. Standard three-way merge logic has no concept of “these two polygons now overlap” — it sees only lines of text.

Binary and large-file formats. GeoPackage, FlatGeobuf, Cloud Optimized GeoTIFF, and Shapefile bundles are either binary or compressed. Git stores a full copy of every changed version, inflating the repository to unworkable sizes within weeks on any active project. Worse, binary files produce no meaningful diff output, so reviewers cannot evaluate what actually changed.

CRS coupling. Every spatial operation is relative to a coordinate reference system. A branch that silently reprojects features from EPSG:4326 to EPSG:3857 will merge without error yet corrupt any downstream analysis that assumes a consistent datum. CRS drift is invisible to line-diff tools.

Non-linear GIS workflows. A topology rebuild, a projection batch, or a geometry simplification run affects every feature in a layer simultaneously — producing diffs that look catastrophic even when the change is routine. Reviewers have no way to distinguish a deliberate global transformation from an accidental mass corruption.

Schema and attribute drift. Adding a column, changing a field type, or renaming an attribute on one branch breaks ETL pipelines and query patterns on every other branch that assumes the old schema. Standard merge tools apply the new schema silently, with no downstream-impact assessment.

These constraints require a fundamentally different approach: decoupled logical branching, geometry-level diffing, and automated spatial validation gates that run before any merge is finalised.

Core Architectural Components

Effective spatial branching rests on four distinct layers, each addressing a different failure mode described above.

1. Branching Model Layer

The branching model governs how work is isolated, promoted, and integrated. Three models cover the majority of real-world geospatial workflows:

Trunk-based with short-lived feature branches suits high-frequency pipelines — daily satellite ingestion, continuous cadastral updates, real-time sensor feeds. All production-ready data lives on main. Contributors open branches for specific, bounded tasks (correcting digitisation errors, applying a projection update, adding attribute fields) and merge back quickly. Short-lived branches reduce the geometric context that must be reconciled at merge time, keeping spatial rebasing tractable. Teams that need to isolate experiments from the primary dataset will find the discipline described in Feature Branching for GIS Development Teams essential — particularly the conventions around branch naming, lifetime limits, and mandatory topology checks before merge.

Environment-driven branching (dev → staging → prod) suits spatial data that feeds regulated outputs: production mapping services, legal boundary datasets, regulatory reports. Each branch represents a promotion environment rather than a feature. Changes enter dev for initial validation, advance to staging for integration testing with downstream consumers, and reach prod only after passing automated topology and schema checks. This model enforces a strict separation between experimental transformations and certified production layers.

Dataset-centric forking treats each external data source — municipal zoning, federal land surveys, commercial POI feeds — as a dedicated branch synchronised with the canonical repository on a defined cadence. Upstream changes are audited before merging into the master spatial layer. When merging external contributions, the spatial diff algorithms for polygon data cluster covers how to compute set-theoretic differences between feature collections, reconcile coordinate precision, and detect geometric regressions before they reach the integrated dataset.

2. Spatial-Aware Diffing Layer

Replacing line-based diff with geometry-level comparison is the single highest-leverage architectural decision in spatial versioning. A geometry-aware diff engine evaluates changes at the feature level: it identifies added, deleted, modified, and topologically altered geometries, normalises coordinate precision across branches, aligns attribute schemas, and flags violations before a merge is committed.

For raster data, pixel-level diffing is computationally prohibitive; tile-based checksums or band-statistics comparisons are the practical alternative. For point clouds, delta compression techniques enable efficient per-return diffs against a registered baseline.

The delta tracking algorithms that underpin this layer record only the changed vertices and attributes, not full feature copies — keeping branch payloads small even for large datasets.

3. Storage and Branch Isolation Layer

Branching strategies collapse when the storage layer cannot support concurrent spatial operations efficiently. Two architectures are in production use:

Pointer-based storage (DVC, Git-LFS) keeps metadata and pointers in Git while actual spatial files live in cloud object storage (s3://, gs://, Azure Blob). Branch operations are lightweight because only the pointer files change; the binary payloads stay in object storage and are referenced by content hash. Teams managing large raster catalogs that rely on pointer synchronisation for raster datasets benefit most from this architecture, as it avoids downloading unchanged tiles during branch checkouts.

Relational schema isolation (PostGIS) implements branching at the database level via schema namespacing, row-level versioning tables, or temporal tables. PostgreSQL savepoints allow transactional branching: a branch is a savepoint, validation runs within the transaction, and a failed check triggers a rollback without leaving partial data in the target schema. Spatial indexes (GiST, SP-GiST, BRIN) must be maintained per branch to prevent cross-branch query degradation.

Design rules that apply regardless of storage backend:

Treat reference datasets (administrative boundaries, hydrography layers) as immutable or append-only branches that feature branches can read but not modify.
Store only modified geometries and attributes in feature branches; merge deltas rather than full copies.
Maintain spatial indexes independently per branch to avoid cross-branch query interference.

4. Merge Validation and Conflict Resolution Layer

Merge conflicts in spatial data rarely manifest as text collisions. They appear as overlapping geometries, conflicting attribute assignments, or divergent coordinate transformations. The automated conflict detection system must parse spatial relationships, apply deterministic resolution rules, and surface genuinely ambiguous cases for human review — without halting the pipeline for routine merges that have no spatial conflict.

Resolution strategies:

Geometric precedence rules: define which branch’s geometry wins when features overlap — last-timestamp, source-authority, union, or intersection — and encode this as a deterministic policy rather than a case-by-case decision.
Attribute reconciliation: merge conflicting metadata using type-safe casting and audit trails that preserve both values; never silently drop a column.
Topology repair automation: run post-merge snapping, gap-filling, and overlap removal using shapely operations or PostGIS functions before the merge commit is finalised.

Implementation Patterns

The pipeline logic differs materially across data types. The diagram below shows the validation DAG that runs on every branch push.

Vector Datasets

Python-based pipelines typically use geopandas for schema and attribute validation, shapely for geometry checks, and pyproj for CRS normalisation. A minimal pre-merge validation script:

import geopandas as gpd
from shapely.validation import explain_validity
import pyproj

TARGET_CRS = pyproj.CRS("EPSG:4326")

def validate_branch(path: str) -> list[str]:
    errors = []
    gdf = gpd.read_file(path)

    # Stage 1 — schema lint
    required_cols = {"feature_id", "source_authority", "valid_from"}
    missing = required_cols - set(gdf.columns)
    if missing:
        errors.append(f"Missing required columns: {missing}")

    # Stage 2a — CRS normalisation
    if gdf.crs is None:
        errors.append("No CRS defined; cannot merge.")
    elif not gdf.crs.equals(TARGET_CRS):
        errors.append(
            f"CRS mismatch: branch has {gdf.crs.to_epsg()}, "
            f"target expects {TARGET_CRS.to_epsg()}"
        )

    # Stage 2b — topology enforcement
    invalid = gdf[~gdf.is_valid]
    for idx, row in invalid.iterrows():
        reason = explain_validity(row.geometry)
        errors.append(f"Feature {row.get('feature_id', idx)}: {reason}")

    # Stage 3 — bounding-box sanity
    if gdf.total_bounds[0] < -180 or gdf.total_bounds[2] > 180:
        errors.append("Longitude values exceed EPSG:4326 bounds.")

    return errors

Raster Datasets

Raster branches skip geometry diffing and instead validate tile completeness, nodata consistency, and band-statistics drift. Cloud Optimized GeoTIFF branches are compared by tile checksum, not pixel value. Band statistics (mean, standard deviation, percentile) are recorded at tag time and compared against the previous release; a drift beyond a configurable threshold blocks the merge.

Point-Cloud Datasets

Point clouds require registration validation before merging. Branches containing LiDAR captures must report a point-to-point alignment error (typically measured in centimetres) against the reference cloud. Density maps are compared tile-by-tile; tiles with density drops greater than a defined threshold flag for human review. The delta compression patterns described in the LiDAR point-cloud delta compression article also apply here: storing only the changed returns rather than full sweeps keeps branch payloads manageable.

Operational Workflows & Governance

Branching Conventions

Consistent branch naming makes automated routing possible. Recommended convention:

<type>/<dataset-slug>/<short-description>

Examples:

feat/cadastral-parcels/add-zoning-attribute
fix/hydrography/correct-river-topology
release/basemap-2024q4/v3.2.0
env/staging/integration-test

The type prefix allows CI to apply different validation profiles: feat/ branches run full topology enforcement; fix/ branches run targeted checks scoped to the changed features; release/ branches run the complete validation suite plus performance benchmarks.

Pre-Commit Hook Requirements

Every contributor’s local environment should enforce the minimum validation set before a commit is accepted:

# .pre-commit-config.yaml (relevant hooks)
repos:
  - repo: local
    hooks:
      - id: crs-check
        name: CRS consistency
        language: python
        entry: python scripts/validate_crs.py
        files: \.(geojson|gpkg|fgb)$

      - id: topology-check
        name: Topology validity
        language: python
        entry: python scripts/validate_topology.py
        files: \.(geojson|gpkg|fgb)$

      - id: no-binary-blobs
        name: Block untracked binary spatial files
        language: pygrep
        entry: '\.tif$|\.img$|\.ecw$'
        args: [--multiline]
        types: [file]

The no-binary-blobs hook prevents raw raster files from being committed directly to Git. All binary spatial payloads must route through DVC or Git-LFS — committing them directly is the most common single cause of repository bloat and uncorrectable history.

Team Sync Cadence

For teams running environment-driven branching, a weekly sync meeting covering staging → prod promotions keeps the pipeline from becoming a bottleneck. The agenda should cover: topology validation report from the last week’s merges, any CRS normalisation exceptions, schema change proposals and their downstream impact. For trunk-based teams, asynchronous review via pull-request comments is sufficient, provided the CI pipeline enforces the validation gates described above.

The conflict resolution and team synchronisation topic covers the human-coordination dimension in depth — manual review triggers, priority queuing for conflicting edits, and reconciliation patterns for distributed field teams.

Security & Compliance Boundaries

Spatial data carries legal and regulatory risk that source code typically does not. Cadastral boundaries, environmental survey extents, and classified infrastructure polygons are subject to data-sharing agreements that restrict who can read, modify, or export them.

Access control at the branch level. Branches containing regulated datasets (sensitive land titles, protected area boundaries, infrastructure security perimeters) must be restricted to named contributors with verified need-to-know. In Git-based workflows, this means branch protection rules that prevent direct pushes and require code-owner approval for merge. In database-backed workflows, row-level security policies on PostGIS tables enforce the same isolation. The security boundaries in spatial repositories guide covers the full access-control stack, including setting up secure access controls for versioned shapefiles.

Audit trail requirements. Every merge that touches a regulated layer must produce an immutable record: who approved, what validation gates passed, which CRS was active, and what the bounding extent of the change was. Git commit metadata captures some of this, but a supplementary audit log (written to append-only storage at merge time) is required for regulatory reporting. Log entries should include the committer’s identity, the dataset version tag before and after the merge, and a hash of the topology validation report.

Encryption at rest and in transit. Binary spatial payloads stored in object storage should use server-side encryption with customer-managed keys. In transit, all DVC remote operations should use TLS; S3 bucket policies should deny non-TLS requests. For PostGIS branches, connection encryption (sslmode=require) must be enforced and verified in the CI environment.

Export controls. Some spatial datasets are subject to export restrictions. Branches containing such data should be tagged with a classification label in their branch description and frontmatter, and CI should block any operation that would publish classified features to a public-facing tile service or GeoJSON export endpoint.

Failure Modes & Anti-Patterns

These are the patterns that most reliably cause merge-induced data corruption in production spatial repositories.

Committing binary spatial files directly to Git. Large GeoTIFFs and GeoPackage files bloat the repository immediately, and the damage is permanent — Git history cannot be rewritten without force-pushing, which destroys traceability. Remediation: run git filter-repo to strip blobs above a size threshold, then retroactively migrate the files to DVC. Prevention: the no-binary-blobs pre-commit hook described above.

Skipping CRS validation before merge. A branch that silently reprojects features from EPSG:32634 to EPSG:4326 will merge cleanly at the text level. Downstream, every spatial join, buffer, and area calculation will produce wrong results because the coordinate units changed. Remediation: add a blocking CRS gate to CI and reproject the offending branch back to the canonical CRS before re-submitting the merge request.

Ignoring topology before merge. Self-intersecting polygons, unclosed rings, and spike vertices pass geometry-existence checks but fail topology validation. If a layer with topology errors is merged into a production boundary dataset, every downstream spatial join against that layer will produce incorrect results — often silently. Remediation: run ST_IsValid / ST_MakeValid on the branch before opening a merge request, and block the merge in CI until all features pass.

Long-lived feature branches with wide spatial footprints. A branch that touches every feature in a large polygon layer will conflict with almost any other concurrent branch. When this branch is finally merged, the topology check must compare the full layer against the current main state — a slow, expensive operation with a high probability of producing conflicts. Remediation: decompose large edits into spatially bounded sub-branches covering defined tile extents; merge each tile branch independently.

Schema changes without a migration plan. Adding a non-nullable column on a feature branch causes every other branch that writes to that table to fail immediately after the merge. Remediation: treat schema changes as breaking releases (MAJOR version bump), communicate them to all branch owners before merging, and provide a migration script that backfills the new column for existing features.

Merging branches with misaligned precision tolerances. Two branches may each use a different coordinate precision (e.g., 6 decimal places vs 8). When merged, vertices that should be shared between adjacent polygons are no longer exactly coincident, introducing hairline gaps that topology checks miss at coarse tolerances. Remediation: define a project-wide coordinate precision standard (e.g., 6 decimal places for EPSG:4326), enforce it in the crs-check pre-commit hook, and run ST_SnapToGrid on all incoming features before merge.

Feature Branching for GIS Development Teams — naming conventions, lifetime limits, and topology-check patterns for short-lived spatial branches
Spatial Diff Algorithms for Polygon Data — geometry-level comparison engines, set-theoretic feature diffs, and raster tile checksums
Automated Conflict Detection in Merge Requests — CI integration, geometric precedence rules, and human-review escalation
Release Tagging Strategies for Spatial Basemaps — semantic versioning for spatial datasets, release manifests, and tile-index metadata
Conflict Resolution & Team Synchronisation Workflows — distributed field team coordination, manual review triggers, and attribute reconciliation patterns
Geospatial Data Versioning Fundamentals & Architecture — delta tracking, pointer synchronisation, and the foundational storage architecture that branching depends on