Implementing fallback write routing during network partitions

A fleet of edge gateways streams sensor telemetry to a central InfluxDB cluster over cellular or shared WAN links, and those links drop. When cellular backhaul degrades, a gateway loses connectivity, or a regional endpoint blackholes traffic, every synchronous write to InfluxDB fails and the ingestion thread either blocks or discards data. This page shows how to keep telemetry lossless through that window: intercept the write failure, confirm a genuine partition with circuit-breaker semantics rather than a single timeout, spool the failed batch to a crash-safe local buffer, and replay it in order once the primary recovers. It is the write-path implementation behind the broader fallback routing and high-availability patterns, and it leans on the same Python client orchestration patterns used elsewhere in the ingestion tier.

Prerequisites

InfluxDB 2.7+ or InfluxDB 3.x with the Flux task engine enabled for server-side reconciliation.
Python 3.9+ with influxdb-client 1.36+ on each ingestion service.
SQLite 3.24+ (bundled with modern Python) for the local buffer — WAL mode requires 3.7+, but INSERT ... ON CONFLICT used in extensions needs 3.24+.
A primary write bucket (production_telemetry) plus a staging bucket (fallback_staging) for server-side replay.
An operator or all-access token with read/write on both buckets and write on _tasks.
Local disk on the edge host sized for the longest partition you must survive (see the buffer-sizing note in step 2).

Why a single failed request is not a partition

Relying on one HTTP status code to trigger fallback is unreliable. TCP retransmission timeouts, load-balancer buffering, and proxy health-check intervals all produce transient failures that clear on the next attempt, so flipping to the fallback path on the first error causes needless buffering and duplicate writes on replay. A correct implementation treats partition detection as a state machine with a failure threshold, and prioritises write continuity over immediate consistency — the data lands somewhere durable now and reconciles later. This is the same durability-first stance that governs the wider InfluxDB data lifecycle and architecture fundamentals, where ingestion resilience is a first-class concern rather than an afterthought.

Solution walkthrough

1. Detect the partition with a three-state circuit breaker

Model the connection to InfluxDB as a breaker with three states, and drive transitions from confirmed failures rather than single errors:

Closed — normal operation. Writes go straight to InfluxDB; latency and error rate are monitored continuously.
Open — partition confirmed. Synchronous writes are short-circuited immediately and every batch is serialized to the local buffer. Health probes continue at a reduced cadence.
Half-Open — recovery suspected. A controlled subset of writes (the buffer drain) is routed to the primary. Success returns the breaker to Closed; any failure reverts it to Open.

Confirm a partition only when several probes agree. A practical rule is three consecutive /health failures inside a 15-second window, or write latency exceeding the 99th-percentile baseline by more than 400%. The health probe itself is a cheap, isolated check you can run on its own cadence:

python

def _check_health(client) -> bool:
    """A single partition probe: True only if the primary answers 'pass'."""
    try:
        return client.health().status == "pass"
    except Exception:
        # Any transport-level failure counts as an unhealthy probe.
        return False

The key parameter is the consecutive failure count, not a rate. Three failures in a row is a far stronger partition signal than three failures out of a hundred, which is just noise on a busy link.

2. Intercept failed writes into a durable local buffer

The influxdb-client library supplies retry and batching but has no native fallback routing, so wrap it in a dispatcher that catches ConnectionError, TimeoutError, and InfluxDBError, spools the failed batch to disk, and only then acknowledges the upstream producer. The buffer uses SQLite in Write-Ahead Logging (WAL) mode for crash-safe, ACID persistence without blocking the ingest thread, storing each batch as serialized Line Protocol so timestamps and tags replay exactly.

python

import json
import time
import sqlite3
import threading
import logging
from enum import Enum
from typing import List
from influxdb_client import InfluxDBClient, Point, WriteOptions
from influxdb_client.client.exceptions import InfluxDBError

logging.basicConfig(level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s")


class CircuitState(Enum):
    CLOSED = "closed"
    OPEN = "open"
    HALF_OPEN = "half_open"


class FallbackWriteRouter:
    def __init__(self, url, token, org, bucket, fallback_db="telemetry_fallback.db"):
        self.org = org
        self.bucket = bucket
        self.state = CircuitState.CLOSED
        self.lock = threading.RLock()

        # InfluxDB client with a conservative retry budget.
        self.client = InfluxDBClient(url=url, token=token, org=org)
        self.write_api = self.client.write_api(write_options=WriteOptions(
            batch_size=500,
            flush_interval=10_000,
            retry_interval=5_000,
            max_retries=3,
            max_retry_delay=30_000,
        ))

        self._init_fallback_db(fallback_db)

        # Circuit-breaker thresholds.
        self.failure_count = 0
        self.max_failures = 3
        self.last_failure_time = 0.0

    def _init_fallback_db(self, db_path: str):
        # check_same_thread=False: the connection is shared across the ingest
        # thread and the background drain thread (guarded by self.lock).
        self.db_conn = sqlite3.connect(db_path, timeout=10.0, check_same_thread=False)
        self.db_conn.execute("PRAGMA journal_mode=WAL;")     # crash-safe, non-blocking
        self.db_conn.execute("PRAGMA synchronous=NORMAL;")   # durable enough, far faster
        self.db_conn.execute("""
            CREATE TABLE IF NOT EXISTS telemetry_buffer (
                id INTEGER PRIMARY KEY AUTOINCREMENT,
                payload TEXT NOT NULL,
                ingested_at REAL NOT NULL
            )
        """)
        self.db_conn.commit()

    def _transition_state(self, new_state: CircuitState):
        with self.lock:
            old = self.state
            self.state = new_state
            if old != new_state:
                logging.info(f"Circuit breaker: {old.value} -> {new_state.value}")

    def write(self, points: List[Point]):
        # In OPEN state, never touch the network — spool straight to disk.
        if self.state == CircuitState.OPEN:
            self._buffer_to_disk(points)
            return
        try:
            self.write_api.write(bucket=self.bucket, record=points)
            with self.lock:
                self.failure_count = 0
                if self.state == CircuitState.HALF_OPEN:
                    self._transition_state(CircuitState.CLOSED)
        except (InfluxDBError, ConnectionError, TimeoutError) as e:
            self._handle_write_failure(points, e)

    def _handle_write_failure(self, points: List[Point], error: Exception):
        with self.lock:
            self.failure_count += 1
            self.last_failure_time = time.time()
            if self.failure_count >= self.max_failures:
                self._transition_state(CircuitState.OPEN)
        self._buffer_to_disk(points)
        logging.warning(f"Write failed, routed to fallback buffer: {error}")

    def _buffer_to_disk(self, points: List[Point]):
        serialized = [p.to_line_protocol() for p in points]
        with self.db_conn:  # implicit transaction; WAL keeps it crash-safe
            self.db_conn.execute(
                "INSERT INTO telemetry_buffer (payload, ingested_at) VALUES (?, ?)",
                (json.dumps(serialized), time.time()),
            )

    def close(self):
        self.write_api.close()
        self.client.close()
        self.db_conn.close()

The critical parameters are check_same_thread=False with a shared RLock (so the ingest and drain threads can share one connection safely) and PRAGMA synchronous=NORMAL, which keeps WAL durability across process crashes while avoiding an fsync on every insert. Serializing to Line Protocol rather than pickling Point objects means the buffer replays identically even if the class definition changes between crash and recovery.

Buffer sizing and backpressure. Local disk is finite, so cap the buffer and decide what happens when a partition outlasts it. When the buffer crosses a threshold (for example 500 MB or one million rows) you have three levers: drop the lowest-priority telemetry (debug metrics) while preserving critical signals; push backpressure upstream via HTTP 429 or a lower MQTT QoS so producers slow down; or rotate to secondary media such as an NVMe scratch volume or compressed Parquet. Choosing explicitly keeps the edge host stable through prolonged outages instead of letting an unbounded buffer exhaust the disk.

3. Drain the buffer and reconcile server-side

When the breaker leaves OPEN, drain the buffer in insertion order and delete each batch only after the primary confirms the write, so an interrupted drain resumes cleanly. Probe for health before draining so a still-partitioned primary is not hammered:

python

def drain_buffer(self, batch_size: int = 200):
    if self.state == CircuitState.OPEN:
        # Only move to HALF_OPEN (and drain) once a probe passes;
        # otherwise stay OPEN and keep buffering.
        if _check_health(self.client):
            self._transition_state(CircuitState.HALF_OPEN)
        else:
            return

    rows = self.db_conn.execute(
        "SELECT id, payload FROM telemetry_buffer ORDER BY id ASC LIMIT ?",
        (batch_size,),
    ).fetchall()
    if not rows:
        return

    for row_id, payload_json in rows:
        try:
            line_protocol = json.loads(payload_json)
            self.write_api.write(bucket=self.bucket, record=line_protocol)
            self.db_conn.execute("DELETE FROM telemetry_buffer WHERE id = ?", (row_id,))
            self.db_conn.commit()
        except Exception as e:
            # Stop on the first failure; unwritten rows stay buffered for
            # the next cycle. Breaking here preserves strict ordering.
            logging.error(f"Replay failed for batch {row_id}: {e}")
            break

For fleets where thousands of gateways all recover at once, application-level replay can stampede the primary. In that case, have each gateway drain into a fallback_staging bucket and let InfluxDB reconcile server-side on its own schedule. A Flux task moves staged points into the production bucket at a controlled cadence, keeping reconciliation load off the ingest tier:

flux

// Reconcile partitioned telemetry from staging into production.
option task = {name: "replay_fallback_telemetry", every: 5m}

from(bucket: "fallback_staging")
    |> range(start: -10m)
    |> filter(fn: (r) => r._measurement == "sensor_readings")
    |> drop(columns: ["_start", "_stop"])
    |> to(bucket: "production_telemetry", org: "iot-ops")

This yields a dual-layer model: the Python dispatcher handles immediate partition isolation on the edge, while the server-side task manages historical reconciliation, retention alignment, and downstream tiering. Because InfluxDB writes are keyed by measurement, tag set, and timestamp, replaying an already-written point simply overwrites it with an identical value, so the staging replay is naturally idempotent as long as the original timestamps are preserved.

Gotchas and edge cases

Acknowledging the producer before the buffer commit. If you tell the upstream producer the write succeeded before _buffer_to_disk has committed, a crash in that window loses the batch silently — the producer will never resend it. Acknowledge only after the SQLite transaction commits. WAL mode makes that commit cheap, so there is no throughput reason to acknowledge early.

Draining without a health gate. Skipping the probe in drain_buffer and replaying blindly during a flapping partition pushes each batch straight back into the failure path, churning the breaker between Half-Open and Open and burning the retry budget. Always confirm health first, and keep the OPEN-state probe cadence slow (10–30 seconds) so transient routing instability does not read as recovery.

Losing ordering on partial drains. Replaying in id order and breaking on the first failure preserves the original write sequence. If you instead delete rows optimistically or drain concurrently across threads, a mid-drain failure can leave older buffered points behind newer ones — harmless for last-value queries but wrong for anything that reads monotonic sequences or computes deltas.

Buffer TTL shorter than the retention window. If the local buffer expires data faster than InfluxDB’s own retention, a long partition drops points that were never written upstream. Ensure fallback buffer lifetimes exceed the primary retention policy so no telemetry falls through the gap during an extended outage. If the buffer holds sensitive identifiers, encrypt it at rest with SQLCipher or OS-level disk encryption.

Verification

Confirm the buffer drains to empty after a simulated partition, and that no points were lost, by checking the local buffer depth and the run history of the reconciliation task. First, on the edge host:

bash

sqlite3 telemetry_fallback.db \
  "SELECT count(*) AS buffered, min(ingested_at) AS oldest FROM telemetry_buffer;"

A healthy, fully recovered gateway returns 0 buffered rows. Then verify the server-side task is landing staged data and reporting success in the _tasks system bucket:

flux

from(bucket: "_tasks")
    |> range(start: -1h)
    |> filter(fn: (r) => r._measurement == "runs")
    |> filter(fn: (r) => r.status != "success")   // any row here is a failed replay
    |> keep(columns: ["_time", "status", "scheduledFor"])

No rows from the second query means every reconciliation run succeeded, and an empty buffer on each gateway means every partitioned write was replayed.

Fallback routing and high availability — the parent guide to write continuity and endpoint failover.
Building fallback chains for missing data — the read-side counterpart: substituting values when a source series is absent.
Automating security token rotation for InfluxDB writes — keeping the write token valid across the same edge fleet.

Up one level: Fallback Routing & High Availability

# Implementing fallback write routing during network partitions

# Prerequisites

# Why a single failed request is not a partition

# Solution walkthrough

# 1. Detect the partition with a three-state circuit breaker

# 2. Intercept failed writes into a durable local buffer

# 3. Drain the buffer and reconcile server-side

# Gotchas and edge cases

# Verification

# Related

Implementing fallback write routing during network partitions

Prerequisites

Why a single failed request is not a partition

Solution walkthrough

1. Detect the partition with a three-state circuit breaker

2. Intercept failed writes into a durable local buffer

3. Drain the buffer and reconcile server-side

Gotchas and edge cases

Verification

Related