Skip to main content

🛠️ PostgreSQL Replication Troubleshooting Guide: Recovering from Large Replication Lag in a Patroni-Managed Cluster

 

A comprehensive step-by-step guide to diagnose and resolve replication lag issues in PostgreSQL clusters managed by Patroni

🖥️ Environment Overview

Before diving into the troubleshooting process, let's establish our cluster environment:

Cluster Configuration:

  • Cluster Name: postgres-cluster
  • Management Tool: Patroni
  • PostgreSQL Version: 13+ (recommended)

Node Details:

  • Primary Node: postgres-primary (192.168.1.11)
  • Replica Node 1: postgres-replica1 (192.168.1.14)
  • Replica Node 2: postgres-replica2 (192.168.1.17)

This setup represents a typical high-availability PostgreSQL cluster where Patroni manages automatic failover and cluster coordination.

🔍 Understanding Replication Lag

Replication lag occurs when replica nodes fall behind the primary node in processing Write-Ahead Log (WAL) files. This can happen due to:

  • Network latency or connectivity issues
  • High write volume on the primary
  • Resource constraints on replica nodes
  • WAL file accumulation
  • Disk I/O bottlenecks

Warning Signs:

  • Lag measurements in GB rather than MB
  • Queries on replicas returning stale data
  • Increased storage usage on primary due to WAL retention
  • Application performance degradation

🚨 Step-by-Step Recovery Process

1️⃣ Initial Cluster Health Assessment

Start by checking the overall cluster status from the primary node:

bash
ssh postgres@postgres-primary
patronictl -c /opt/patroni/patroni_postgres-primary.conf list -e

Expected Output Analysis:

+ Cluster: postgres-cluster --+-----------+----+-----------+
| Member            | Host         | Role         | State    | TL | Lag in MB |
+------------------+--------------+--------------+-----------+----+-----------+
| postgres-primary  | 192.168.1.11 | Leader       | running  | 25 |           |
| postgres-replica1 | 192.168.1.14 | Replica      | running  | 22 | 11592167  |
| postgres-replica2 | 192.168.1.17 | Sync Standby | streaming| 25 | 18        |
+------------------+--------------+--------------+-----------+----+-----------+

Key Indicators:

  • Timeline (TL): Should be consistent across healthy nodes
  • Lag in MB: Values over 1000 MB indicate significant issues
  • State: "streaming" is healthy, "running" with high lag needs attention

2️⃣ Replication Slot Investigation

Examine replication slots to understand WAL retention:

bash
psql -h 192.168.1.11 -U postgres -c "SELECT slot_name, plugin, slot_type, database, active, restart_lsn, confirmed_flush_lsn FROM pg_replication_slots;"

What to Look For:

  • Active slots: Should show 'true' for healthy replicas
  • LSN positions: Large gaps indicate replication delays
  • Unused slots: May cause WAL bloat and should be dropped

3️⃣ Critical Decision Point: Reinitialize the Lagging Replica

When lag exceeds manageable limits (typically >5GB), reinitializing is often the fastest recovery method:

Standard Reinitialize Command:

bash
patronictl -c /opt/patroni/patroni_postgres-primary.conf reinit postgres-cluster postgres-replica1

Force Reinitialize (when WAL files are missing):

bash
patronictl -c /opt/patroni/patroni_postgres-primary.conf reinit postgres-cluster postgres-replica1 --force

⚠️ Important Considerations:

  • This process will rebuild the replica from scratch
  • The replica will be temporarily unavailable during reconstruction
  • All existing WAL files on the replica will be discarded
  • Use --force only when standard reinit fails due to missing WAL segments

4️⃣ Monitoring Reinitialize Progress

Track the recovery process with periodic status checks:

bash
# Check every 30 seconds during recovery
watch -n 30 "patronictl -c /opt/patroni/patroni_postgres-primary.conf list -e"

Progress Indicators:

  • Initial: Replica shows as "stopped" or "stopped(failed)"
  • During sync: Status changes to "creating replica"
  • Near completion: State becomes "streaming" with reducing lag
  • Complete: Lag drops to normal levels (<100 MB)

✅ Post-Recovery Validation

5️⃣ Comprehensive Health Checks

5.1 📊 PostgreSQL Log Analysis

Monitor replica logs for recovery success indicators:

bash
ssh postgres@postgres-replica1
tail -f /var/lib/pgsql/data/log/postgresql-*.log

Success Indicators:

  • "database system is ready to accept read only connections"
  • "started streaming WAL from primary"
  • "consistent recovery state reached"

5.2 🔗 Replication Slot Verification

Confirm all replication slots are functioning:

bash
psql -h 192.168.1.11 -U postgres -c "SELECT slot_name, active, restart_lsn, confirmed_flush_lsn FROM pg_replication_slots WHERE active = true;"

5.3 🎯 Recovery Mode Confirmation

Verify replica is in proper recovery mode:

bash
psql -h 192.168.1.14 -U postgres -c "SELECT pg_is_in_recovery();"

Expected Result: t (true)

5.4 📡 WAL Receiver Status Check

Examine WAL receiver process health:

bash
psql -h 192.168.1.14 -U postgres -x -c "SELECT pid, status, receive_start_lsn, receive_start_tli, received_lsn, received_tli, last_msg_send_time, last_msg_receipt_time FROM pg_stat_wal_receiver;"

Health Indicators:

  • Status: Should be "streaming"
  • Recent timestamps: last_msg_receipt_time within last few seconds
  • LSN progression: received_lsn should be advancing

🛠️ Troubleshooting Common Issues

Issue 1: WAL Segment Errors

Symptoms: "requested WAL segment has already been removed" Solution: Use --force flag during reinitialize to bypass missing WAL files

Issue 2: Persistent Lag Growth

Symptoms: Lag continues increasing after reinitialize Root Causes & Solutions:

  • Disk Space: Ensure adequate storage on replica (>20% free)
  • Network Issues: Test connectivity with ping and iperf
  • Resource Constraints: Monitor CPU, memory, and I/O with htop, iostat
  • Configuration Issues: Review wal_keep_size, max_wal_senders, max_replication_slots

Issue 3: Replica Connection Failures

Symptoms: Cannot connect to replica after reinitialize Solutions:

  • Verify PostgreSQL service status: systemctl status postgresql
  • Check firewall rules: firewall-cmd --list-all
  • Validate authentication settings in pg_hba.conf

Issue 4: Slow Reinitialize Process

Symptoms: Reinitialize takes hours to complete Optimizations:

  • Increase checkpoint_segments temporarily
  • Optimize network between primary and replica
  • Consider using pg_basebackup with parallel workers

🛡️ Preventive Measures

Configuration Optimizations

WAL Management:

sql
-- Increase WAL retention (adjust based on your lag tolerance)
ALTER SYSTEM SET wal_keep_size = '10GB';

-- Prevent excessive WAL accumulation
ALTER SYSTEM SET max_slot_wal_keep_size = '20GB';

-- Optimize WAL writing
ALTER SYSTEM SET wal_buffers = '64MB';
ALTER SYSTEM SET wal_writer_delay = '10ms';

Replication Settings:

sql
-- Increase replication connections
ALTER SYSTEM SET max_wal_senders = 10;
ALTER SYSTEM SET max_replication_slots = 10;

-- Optimize streaming
ALTER SYSTEM SET wal_sender_timeout = '60s';
ALTER SYSTEM SET wal_receiver_timeout = '60s';

Monitoring Setup

Essential Metrics to Track:

  • Replication lag: Query pg_stat_replication regularly
  • WAL generation rate: Monitor pg_stat_wal
  • Disk usage: Track WAL directory growth
  • Network latency: Between primary and replicas

Alerting Thresholds:

  • Warning: Lag > 100 MB
  • Critical: Lag > 1 GB
  • Emergency: Lag > 5 GB or replica disconnected

Regular Maintenance Tasks

Weekly Checks:

  • Review replication slot usage
  • Monitor WAL file accumulation
  • Verify replica connectivity
  • Check disk space on all nodes

Monthly Reviews:

  • Analyze replication performance trends
  • Review and optimize PostgreSQL configuration
  • Test failover procedures
  • Update monitoring dashboards

Comments

Popular posts from this blog

🚀 Automating Oracle Database Patching with Ansible: A Complete Guide

Oracle database patching has long been the bane of DBAs everywhere. It's a critical task that requires precision, expertise, and often results in extended maintenance windows. What if I told you that you could automate this entire process, reducing both risk and downtime while ensuring consistency across your Oracle estate? 💡 In this comprehensive guide, I'll walk you through a production-ready Ansible playbook that completely automates Oracle patch application using OPatch. Whether you're managing a single Oracle instance or hundreds of databases across your enterprise, this solution will transform your patch management strategy! 🎯 🔥 The Challenge: Why Oracle Patching is Complex Before diving into the solution, let's understand why Oracle patching is so challenging: 🔗 Multiple dependencies : OPatch versions, Oracle Home configurations, running processes ⚠️ Risk of corruption : Incorrect patch application can render databases unusable ⏰ Downtime requirements : Da...

Oracle RAC Switchover & Switchback: Step-by-Step Guide

 Ensuring business continuity requires regular Disaster Recovery (DR) drills. This guide covers the Switchover and Switchback process between Primary (DC) and Standby (DR) databases . Pre-checks Before Performing Switchover Before starting the activity, ensure there are no active sessions in the database. If any are found, share the session details with the application team, get their confirmation, and terminate the sessions. Primary Database Name: PRIMARY Standby Database Name: STANDBY  Identify Active Sessions set lines 999 pages 999 col machine for a30 col username for a30 col program for a30 compute sum of count on report break on report select inst_id,username,osuser,machine,program,status,count(1) "count" from gv$session where inst_id=1 and program like 'JDBC%' group by inst_id,username,osuser,machine,program,status order by 1,2; select inst_id,username,osuser,machine,program,status,count(1) "count" from gv$session where inst_id=2 and program lik...

Mastering Oracle RAC with SRVCTL Commands

Oracle Real Application Clusters (RAC) provide high availability, scalability, and manageability for databases. One of the most powerful tools for managing RAC databases is srvctl , a command-line utility that allows administrators to control various database services. This blog explores essential srvctl commands to help you efficiently manage Oracle RAC environments. 1. Checking Database Configuration and Status  List all available databases on the host:                  srvctl config database   Check the status of a specific database and its instances:                    srvctl status database -d <database_name>   Retrieve detailed status information about a database, including its instances and states:                    srvctl status database -d <database_name> -v 2. Stopping and Starting Databases   ...