Parsing GeoJSON & Shapefile APIs: A Production-Ready Python Workflow

Parsing GeoJSON & Shapefile APIs remains a foundational requirement for modern geospatial ETL pipelines. While cloud-native formats like Parquet and cloud-optimized GeoTIFFs are gaining traction, municipal portals, environmental agencies, and legacy GIS platforms continue to expose vector data through REST endpoints that return either RFC 7946-compliant GeoJSON or zipped ESRI Shapefiles. For GIS analysts, data engineers, and urban tech teams, building resilient ingestion logic that handles both formats without manual intervention is critical to maintaining automated data pipelines.

This guide outlines a tested, step-by-step workflow for fetching, validating, normalizing, and persisting vector data from public and authenticated APIs. The patterns presented here align with broader strategies for Mastering Geospatial Data Ingestion in Python, emphasizing reproducibility, memory efficiency, and schema enforcement.

Prerequisites

Before implementing the ingestion workflow, ensure your environment meets the following baseline:

  • Python 3.9+ with virtual environment isolation
  • Core Libraries: requests>=2.31.0, geopandas>=1.0.0, shapely>=2.0.0, pyproj>=3.6.0
  • System Dependencies: GDAL/OGR compiled headers (typically resolved via conda install -c conda-forge geopandas or apt install libgdal-dev)
  • API Access: Endpoint documentation, rate limit awareness, and optional authentication credentials

Modern geospatial APIs frequently implement pagination, spatial bounding filters, and content negotiation. Understanding how to programmatically negotiate these parameters prevents incomplete downloads, schema drift, and out-of-memory crashes during bulk operations.

Step-by-Step Workflow

1. Endpoint Configuration & Request Strategy

Identify whether the API serves GeoJSON natively (application/json or application/geo+json) or packages Shapefiles in .zip archives. Configure base URLs, query parameters, and HTTP headers. For spatially constrained requests, many providers accept bbox parameters formatted as west,south,east,north. When designing spatial filters, refer to Extracting bounding boxes from GeoJSON APIs to ensure coordinate ordering and projection alignment match the provider’s expectations. Misaligned bounding boxes are a common cause of empty result sets or truncated geometries.

2. Fetching & Streaming Responses

Use requests with streaming enabled to avoid loading multi-megabyte payloads into memory. For GeoJSON, parse the JSON incrementally or validate the structure before conversion. For Shapefiles, stream the binary response directly into a zipfile object and pass the in-memory archive to geopandas.

When endpoints require credentials, never hardcode tokens. Instead, implement a token-refresh routine or leverage environment variables. For enterprise deployments, consult Handling authentication tokens for ArcGIS REST services to understand token lifecycle management and automatic renewal patterns.

import requests
import io
import geopandas as gpd
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

def fetch_vector_data(url: str, params: dict = None, headers: dict = None) -> gpd.GeoDataFrame:
    session = requests.Session()
    retry_strategy = Retry(total=3, backoff_factor=1, status_forcelist=[429, 500, 502, 503, 504])
    session.mount("https://", HTTPAdapter(max_retries=retry_strategy))
    
    headers = headers or {"Accept": "application/json, application/geo+json, application/zip"}
    response = session.get(url, params=params, headers=headers, stream=True, timeout=60)
    response.raise_for_status()
    
    content_type = response.headers.get("Content-Type", "").lower()
    
    if "application/zip" in content_type or url.endswith(".zip"):
        return gpd.read_file(io.BytesIO(response.content))
    else:
        # Fallback: parse GeoJSON directly from response text
        return gpd.read_file(io.StringIO(response.text), driver="GeoJSON")

3. Geometry Validation & Schema Normalization

Raw API responses frequently contain self-intersecting polygons, collapsed lines, or mixed attribute naming conventions. Apply strict validation using shapely.is_valid_reason to diagnose issues, then repair geometries using shapely.make_valid(). Standardizing column names and dropping null geometries early prevents downstream pipeline failures.

import shapely
import pandas as pd

def normalize_gdf(gdf: gpd.GeoDataFrame) -> gpd.GeoDataFrame:
    # Drop rows with missing geometries
    gdf = gdf.dropna(subset=["geometry"])
    
    # Validate and repair geometries
    invalid_mask = ~gdf.geometry.is_valid
    if invalid_mask.any():
        gdf.loc[invalid_mask, "geometry"] = gdf.loc[invalid_mask, "geometry"].apply(shapely.make_valid)
        gdf = gdf[gdf.geometry.is_valid]
        
    # Standardize column names: lowercase, replace spaces with underscores
    gdf.columns = [col.lower().replace(" ", "_") for col in gdf.columns]
    
    return gdf

4. CRS Harmonization & Spatial Indexing

Coordinate Reference System (CRS) mismatches cause silent spatial errors. Always inspect the source CRS, explicitly reproject to a consistent target (e.g., EPSG:4326 for web mapping or a local projected CRS for analysis), and build a spatial index. While raster workflows often rely on Syncing STAC Catalogs with pystac-client for metadata discovery, vector pipelines require explicit CRS enforcement at ingestion time.

TARGET_CRS = "EPSG:4326"

def harmonize_crs(gdf: gpd.GeoDataFrame, target_crs: str = TARGET_CRS) -> gpd.GeoDataFrame:
    if gdf.crs is None:
        raise ValueError("Input GeoDataFrame lacks CRS definition. Assign manually before harmonization.")
    if gdf.crs.to_epsg() != int(target_crs.split(":")[1]):
        gdf = gdf.to_crs(target_crs)
    gdf.sindex  # Build spatial index for fast spatial joins/queries
    return gdf

5. Persistence & Output Formatting

Persist validated data using columnar or spatial database formats rather than raw CSV/JSON. GeoPackage (.gpkg) and Parquet with geopandas’s to_parquet() method offer excellent compression and schema preservation. Partition large datasets by administrative boundaries or temporal windows to optimize query performance.

def persist_data(gdf: gpd.GeoDataFrame, output_path: str, format: str = "parquet"):
    if format == "parquet":
        gdf.to_parquet(output_path, compression="zstd", index=False)
    elif format == "gpkg":
        gdf.to_file(output_path, driver="GPKG", layer="ingested_data", index=False)
    else:
        raise ValueError("Unsupported output format. Use 'parquet' or 'gpkg'.")

6. Error Handling & Pipeline Resilience

Production ingestion must survive transient network failures, malformed payloads, and provider-side schema changes. Wrap the core workflow in a structured exception handler, log diagnostic information, and implement graceful degradation. When dealing with community-driven or highly dynamic endpoints, patterns similar to those used for Fetching OSM Data via Overpass API—such as query validation before execution and strict timeout enforcement—prove invaluable.

import logging

logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s")

def run_ingestion_pipeline(api_url: str, params: dict, output_path: str):
    try:
        logging.info("Fetching vector data from %s", api_url)
        gdf = fetch_vector_data(api_url, params)
        
        logging.info("Normalizing geometries and schema")
        gdf = normalize_gdf(gdf)
        
        logging.info("Harmonizing CRS to %s", TARGET_CRS)
        gdf = harmonize_crs(gdf)
        
        logging.info("Persisting to %s", output_path)
        persist_data(gdf, output_path)
        logging.info("Pipeline completed successfully. %d features ingested.", len(gdf))
        
    except requests.exceptions.RequestException as e:
        logging.error("Network/API error during fetch: %s", e)
    except ValueError as e:
        logging.error("Data validation error: %s", e)
    except Exception as e:
        logging.exception("Unexpected pipeline failure: %s", e)

Production Checklist

Before deploying this workflow to a scheduler (e.g., Airflow, Prefect, or GitHub Actions), verify the following:

  • Memory Profiling: Confirm that stream=True and in-memory zipfile handling prevent OOM errors on payloads >500MB.
  • Schema Drift Monitoring: Log attribute column counts and types on each run. Alert on unexpected schema changes.
  • Idempotency: Ensure repeated runs overwrite or append safely without duplicating features.
  • GDAL Version Alignment: Match fiona and geopandas versions to the system GDAL installation to avoid silent driver failures.
  • Rate Limit Compliance: Implement exponential backoff and respect Retry-After headers.

Conclusion

Parsing GeoJSON & Shapefile APIs requires more than a simple HTTP GET and a format converter. By enforcing streaming downloads, rigorous geometry validation, explicit CRS harmonization, and structured error handling, teams can build ingestion pipelines that remain stable across provider updates and scale efficiently. The patterns outlined here provide a reusable foundation for integrating legacy GIS exports into modern data stacks, ensuring that spatial data remains a reliable asset rather than an operational bottleneck.