Skip to main content

πŸ› ️ PostgreSQL Replication Troubleshooting Guide: Recovering from Large Replication Lag in a Patroni-Managed Cluster

 

A comprehensive step-by-step guide to diagnose and resolve replication lag issues in PostgreSQL clusters managed by Patroni

πŸ–₯️ Environment Overview

Before diving into the troubleshooting process, let's establish our cluster environment:

Cluster Configuration:

  • Cluster Name: postgres-cluster
  • Management Tool: Patroni
  • PostgreSQL Version: 13+ (recommended)

Node Details:

  • Primary Node: postgres-primary (192.168.1.11)
  • Replica Node 1: postgres-replica1 (192.168.1.14)
  • Replica Node 2: postgres-replica2 (192.168.1.17)

This setup represents a typical high-availability PostgreSQL cluster where Patroni manages automatic failover and cluster coordination.

πŸ” Understanding Replication Lag

Replication lag occurs when replica nodes fall behind the primary node in processing Write-Ahead Log (WAL) files. This can happen due to:

  • Network latency or connectivity issues
  • High write volume on the primary
  • Resource constraints on replica nodes
  • WAL file accumulation
  • Disk I/O bottlenecks

Warning Signs:

  • Lag measurements in GB rather than MB
  • Queries on replicas returning stale data
  • Increased storage usage on primary due to WAL retention
  • Application performance degradation

🚨 Step-by-Step Recovery Process

1️⃣ Initial Cluster Health Assessment

Start by checking the overall cluster status from the primary node:

bash
ssh postgres@postgres-primary
patronictl -c /opt/patroni/patroni_postgres-primary.conf list -e

Expected Output Analysis:

+ Cluster: postgres-cluster --+-----------+----+-----------+
| Member            | Host         | Role         | State    | TL | Lag in MB |
+------------------+--------------+--------------+-----------+----+-----------+
| postgres-primary  | 192.168.1.11 | Leader       | running  | 25 |           |
| postgres-replica1 | 192.168.1.14 | Replica      | running  | 22 | 11592167  |
| postgres-replica2 | 192.168.1.17 | Sync Standby | streaming| 25 | 18        |
+------------------+--------------+--------------+-----------+----+-----------+

Key Indicators:

  • Timeline (TL): Should be consistent across healthy nodes
  • Lag in MB: Values over 1000 MB indicate significant issues
  • State: "streaming" is healthy, "running" with high lag needs attention

2️⃣ Replication Slot Investigation

Examine replication slots to understand WAL retention:

bash
psql -h 192.168.1.11 -U postgres -c "SELECT slot_name, plugin, slot_type, database, active, restart_lsn, confirmed_flush_lsn FROM pg_replication_slots;"

What to Look For:

  • Active slots: Should show 'true' for healthy replicas
  • LSN positions: Large gaps indicate replication delays
  • Unused slots: May cause WAL bloat and should be dropped

3️⃣ Critical Decision Point: Reinitialize the Lagging Replica

When lag exceeds manageable limits (typically >5GB), reinitializing is often the fastest recovery method:

Standard Reinitialize Command:

bash
patronictl -c /opt/patroni/patroni_postgres-primary.conf reinit postgres-cluster postgres-replica1

Force Reinitialize (when WAL files are missing):

bash
patronictl -c /opt/patroni/patroni_postgres-primary.conf reinit postgres-cluster postgres-replica1 --force

⚠️ Important Considerations:

  • This process will rebuild the replica from scratch
  • The replica will be temporarily unavailable during reconstruction
  • All existing WAL files on the replica will be discarded
  • Use --force only when standard reinit fails due to missing WAL segments

4️⃣ Monitoring Reinitialize Progress

Track the recovery process with periodic status checks:

bash
# Check every 30 seconds during recovery
watch -n 30 "patronictl -c /opt/patroni/patroni_postgres-primary.conf list -e"

Progress Indicators:

  • Initial: Replica shows as "stopped" or "stopped(failed)"
  • During sync: Status changes to "creating replica"
  • Near completion: State becomes "streaming" with reducing lag
  • Complete: Lag drops to normal levels (<100 MB)

✅ Post-Recovery Validation

5️⃣ Comprehensive Health Checks

5.1 πŸ“Š PostgreSQL Log Analysis

Monitor replica logs for recovery success indicators:

bash
ssh postgres@postgres-replica1
tail -f /var/lib/pgsql/data/log/postgresql-*.log

Success Indicators:

  • "database system is ready to accept read only connections"
  • "started streaming WAL from primary"
  • "consistent recovery state reached"

5.2 πŸ”— Replication Slot Verification

Confirm all replication slots are functioning:

bash
psql -h 192.168.1.11 -U postgres -c "SELECT slot_name, active, restart_lsn, confirmed_flush_lsn FROM pg_replication_slots WHERE active = true;"

5.3 🎯 Recovery Mode Confirmation

Verify replica is in proper recovery mode:

bash
psql -h 192.168.1.14 -U postgres -c "SELECT pg_is_in_recovery();"

Expected Result: t (true)

5.4 πŸ“‘ WAL Receiver Status Check

Examine WAL receiver process health:

bash
psql -h 192.168.1.14 -U postgres -x -c "SELECT pid, status, receive_start_lsn, receive_start_tli, received_lsn, received_tli, last_msg_send_time, last_msg_receipt_time FROM pg_stat_wal_receiver;"

Health Indicators:

  • Status: Should be "streaming"
  • Recent timestamps: last_msg_receipt_time within last few seconds
  • LSN progression: received_lsn should be advancing

πŸ› ️ Troubleshooting Common Issues

Issue 1: WAL Segment Errors

Symptoms: "requested WAL segment has already been removed" Solution: Use --force flag during reinitialize to bypass missing WAL files

Issue 2: Persistent Lag Growth

Symptoms: Lag continues increasing after reinitialize Root Causes & Solutions:

  • Disk Space: Ensure adequate storage on replica (>20% free)
  • Network Issues: Test connectivity with ping and iperf
  • Resource Constraints: Monitor CPU, memory, and I/O with htop, iostat
  • Configuration Issues: Review wal_keep_size, max_wal_senders, max_replication_slots

Issue 3: Replica Connection Failures

Symptoms: Cannot connect to replica after reinitialize Solutions:

  • Verify PostgreSQL service status: systemctl status postgresql
  • Check firewall rules: firewall-cmd --list-all
  • Validate authentication settings in pg_hba.conf

Issue 4: Slow Reinitialize Process

Symptoms: Reinitialize takes hours to complete Optimizations:

  • Increase checkpoint_segments temporarily
  • Optimize network between primary and replica
  • Consider using pg_basebackup with parallel workers

πŸ›‘️ Preventive Measures

Configuration Optimizations

WAL Management:

sql
-- Increase WAL retention (adjust based on your lag tolerance)
ALTER SYSTEM SET wal_keep_size = '10GB';

-- Prevent excessive WAL accumulation
ALTER SYSTEM SET max_slot_wal_keep_size = '20GB';

-- Optimize WAL writing
ALTER SYSTEM SET wal_buffers = '64MB';
ALTER SYSTEM SET wal_writer_delay = '10ms';

Replication Settings:

sql
-- Increase replication connections
ALTER SYSTEM SET max_wal_senders = 10;
ALTER SYSTEM SET max_replication_slots = 10;

-- Optimize streaming
ALTER SYSTEM SET wal_sender_timeout = '60s';
ALTER SYSTEM SET wal_receiver_timeout = '60s';

Monitoring Setup

Essential Metrics to Track:

  • Replication lag: Query pg_stat_replication regularly
  • WAL generation rate: Monitor pg_stat_wal
  • Disk usage: Track WAL directory growth
  • Network latency: Between primary and replicas

Alerting Thresholds:

  • Warning: Lag > 100 MB
  • Critical: Lag > 1 GB
  • Emergency: Lag > 5 GB or replica disconnected

Regular Maintenance Tasks

Weekly Checks:

  • Review replication slot usage
  • Monitor WAL file accumulation
  • Verify replica connectivity
  • Check disk space on all nodes

Monthly Reviews:

  • Analyze replication performance trends
  • Review and optimize PostgreSQL configuration
  • Test failover procedures
  • Update monitoring dashboards

Comments

Popular posts from this blog

πŸš€ Automating Oracle Database Patching with Ansible: A Complete Guide

Oracle database patching has long been the bane of DBAs everywhere. It's a critical task that requires precision, expertise, and often results in extended maintenance windows. What if I told you that you could automate this entire process, reducing both risk and downtime while ensuring consistency across your Oracle estate? πŸ’‘ In this comprehensive guide, I'll walk you through a production-ready Ansible playbook that completely automates Oracle patch application using OPatch. Whether you're managing a single Oracle instance or hundreds of databases across your enterprise, this solution will transform your patch management strategy! 🎯 πŸ”₯ The Challenge: Why Oracle Patching is Complex Before diving into the solution, let's understand why Oracle patching is so challenging: πŸ”— Multiple dependencies : OPatch versions, Oracle Home configurations, running processes ⚠️ Risk of corruption : Incorrect patch application can render databases unusable ⏰ Downtime requirements : Da...

🐳Oracle 19c Database Deployment with Docker

Oracle 19c Database Deployment with Docker 🐳 Oracle 19c Database Deployment with Docker Welcome to this comprehensive guide on deploying, configuring, and managing Oracle 19c Database using Docker containers. This blog will walk you through the entire process from setup to production best practices with practical code examples. Docker provides an excellent way to run Oracle databases in isolated, portable containers, making it easy to deploy and manage Oracle 19c instances for development, testing, and production environments. This approach offers numerous benefits: πŸ”’ Isolation : Run Oracle in a containerized environment without affecting your host system 🚚 Portability : Easily move your database between different environments πŸ”„ Reproducibility : Quickly spin up identical database instances ⚡ Resource Efficiency : Use Docker's resource management capabilities to control CPU, memory, and stor...

Mastering Oracle RAC with CRSCTL commands

Mastering Oracle Clusterware Administration: Essential Commands & Best Practices Oracle Clusterware is a key component for managing cluster environments, ensuring high availability and resource management for Oracle databases. Below are essential commands for managing Oracle Clusterware effectively. What is crsctl? crsctl (Cluster Ready Services Control) is a command-line utility provided by Oracle to manage Oracle Clusterware. It allows administrators to start, stop, check, and configure various aspects of cluster services. With crsctl , DBAs can control cluster resources, manage voting disks, check the status of Oracle High Availability Services, and ensure the proper functioning of Oracle RAC environments. Starting and Stopping Oracle Clusterware On Local Node Stop Clusterware: crsctl stop cluster Start Clusterware: crsctl start cluster On RAC Standalone/Oracle Restart Stop Cluster: crsctl stop has Start Cluster: crsctl start has On All Nodes or All Hub Nodes Start Clusterware:...