🛠️ PostgreSQL Replication Troubleshooting Guide: Recovering from Large Replication Lag in a Patroni-Managed Cluster

A comprehensive step-by-step guide to diagnose and resolve replication lag issues in PostgreSQL clusters managed by Patroni

🖥️ Environment Overview

Before diving into the troubleshooting process, let's establish our cluster environment:

Cluster Configuration:

Cluster Name: postgres-cluster
Management Tool: Patroni
PostgreSQL Version: 13+ (recommended)

Node Details:

Primary Node: postgres-primary (192.168.1.11)
Replica Node 1: postgres-replica1 (192.168.1.14)
Replica Node 2: postgres-replica2 (192.168.1.17)

This setup represents a typical high-availability PostgreSQL cluster where Patroni manages automatic failover and cluster coordination.

🔍 Understanding Replication Lag

Replication lag occurs when replica nodes fall behind the primary node in processing Write-Ahead Log (WAL) files. This can happen due to:

Network latency or connectivity issues
High write volume on the primary
Resource constraints on replica nodes
WAL file accumulation
Disk I/O bottlenecks

Warning Signs:

Lag measurements in GB rather than MB
Queries on replicas returning stale data
Increased storage usage on primary due to WAL retention
Application performance degradation

🚨 Step-by-Step Recovery Process

1️⃣ Initial Cluster Health Assessment

Start by checking the overall cluster status from the primary node:


bash
ssh postgres@postgres-primary
patronictl -c /opt/patroni/patroni_postgres-primary.conf list -e

Expected Output Analysis:


+ Cluster: postgres-cluster --+-----------+----+-----------+
| Member            | Host         | Role         | State    | TL | Lag in MB |
+------------------+--------------+--------------+-----------+----+-----------+
| postgres-primary  | 192.168.1.11 | Leader       | running  | 25 |           |
| postgres-replica1 | 192.168.1.14 | Replica      | running  | 22 | 11592167  |
| postgres-replica2 | 192.168.1.17 | Sync Standby | streaming| 25 | 18        |
+------------------+--------------+--------------+-----------+----+-----------+

Key Indicators:

Timeline (TL): Should be consistent across healthy nodes
Lag in MB: Values over 1000 MB indicate significant issues
State: "streaming" is healthy, "running" with high lag needs attention

2️⃣ Replication Slot Investigation

Examine replication slots to understand WAL retention:


bash
psql -h 192.168.1.11 -U postgres -c "SELECT slot_name, plugin, slot_type, database, active, restart_lsn, confirmed_flush_lsn FROM pg_replication_slots;"

What to Look For:

Active slots: Should show 'true' for healthy replicas
LSN positions: Large gaps indicate replication delays
Unused slots: May cause WAL bloat and should be dropped

3️⃣ Critical Decision Point: Reinitialize the Lagging Replica

When lag exceeds manageable limits (typically >5GB), reinitializing is often the fastest recovery method:

Standard Reinitialize Command:


bash
patronictl -c /opt/patroni/patroni_postgres-primary.conf reinit postgres-cluster postgres-replica1

Force Reinitialize (when WAL files are missing):


bash
patronictl -c /opt/patroni/patroni_postgres-primary.conf reinit postgres-cluster postgres-replica1 --force

⚠️ Important Considerations:

This process will rebuild the replica from scratch
The replica will be temporarily unavailable during reconstruction
All existing WAL files on the replica will be discarded
Use --force only when standard reinit fails due to missing WAL segments

4️⃣ Monitoring Reinitialize Progress

Track the recovery process with periodic status checks:


bash
# Check every 30 seconds during recovery
watch -n 30 "patronictl -c /opt/patroni/patroni_postgres-primary.conf list -e"

Progress Indicators:

Initial: Replica shows as "stopped" or "stopped(failed)"
During sync: Status changes to "creating replica"
Near completion: State becomes "streaming" with reducing lag
Complete: Lag drops to normal levels (<100 MB)

✅ Post-Recovery Validation

5️⃣ Comprehensive Health Checks

5.1 📊 PostgreSQL Log Analysis

Monitor replica logs for recovery success indicators:


bash
ssh postgres@postgres-replica1
tail -f /var/lib/pgsql/data/log/postgresql-*.log

Success Indicators:

"database system is ready to accept read only connections"
"started streaming WAL from primary"
"consistent recovery state reached"

5.2 🔗 Replication Slot Verification

Confirm all replication slots are functioning:


bash
psql -h 192.168.1.11 -U postgres -c "SELECT slot_name, active, restart_lsn, confirmed_flush_lsn FROM pg_replication_slots WHERE active = true;"

5.3 🎯 Recovery Mode Confirmation

Verify replica is in proper recovery mode:


bash
psql -h 192.168.1.14 -U postgres -c "SELECT pg_is_in_recovery();"

Expected Result: t (true)

5.4 📡 WAL Receiver Status Check

Examine WAL receiver process health:


bash
psql -h 192.168.1.14 -U postgres -x -c "SELECT pid, status, receive_start_lsn, receive_start_tli, received_lsn, received_tli, last_msg_send_time, last_msg_receipt_time FROM pg_stat_wal_receiver;"

Health Indicators:

Status: Should be "streaming"
Recent timestamps: last_msg_receipt_time within last few seconds
LSN progression: received_lsn should be advancing

🛠️ Troubleshooting Common Issues

Issue 1: WAL Segment Errors

Symptoms: "requested WAL segment has already been removed" Solution: Use --force flag during reinitialize to bypass missing WAL files

Issue 2: Persistent Lag Growth

Symptoms: Lag continues increasing after reinitialize Root Causes & Solutions:

Disk Space: Ensure adequate storage on replica (>20% free)
Network Issues: Test connectivity with ping and iperf
Resource Constraints: Monitor CPU, memory, and I/O with htop, iostat
Configuration Issues: Review wal_keep_size, max_wal_senders, max_replication_slots

Issue 3: Replica Connection Failures

Symptoms: Cannot connect to replica after reinitialize Solutions:

Verify PostgreSQL service status: systemctl status postgresql
Check firewall rules: firewall-cmd --list-all
Validate authentication settings in pg_hba.conf

Issue 4: Slow Reinitialize Process

Symptoms: Reinitialize takes hours to complete Optimizations:

Increase checkpoint_segments temporarily
Optimize network between primary and replica
Consider using pg_basebackup with parallel workers

🛡️ Preventive Measures

Configuration Optimizations

WAL Management:


sql
-- Increase WAL retention (adjust based on your lag tolerance)
ALTER SYSTEM SET wal_keep_size = '10GB';

-- Prevent excessive WAL accumulation
ALTER SYSTEM SET max_slot_wal_keep_size = '20GB';

-- Optimize WAL writing
ALTER SYSTEM SET wal_buffers = '64MB';
ALTER SYSTEM SET wal_writer_delay = '10ms';

Replication Settings:


sql
-- Increase replication connections
ALTER SYSTEM SET max_wal_senders = 10;
ALTER SYSTEM SET max_replication_slots = 10;

-- Optimize streaming
ALTER SYSTEM SET wal_sender_timeout = '60s';
ALTER SYSTEM SET wal_receiver_timeout = '60s';

Monitoring Setup

Essential Metrics to Track:

Replication lag: Query pg_stat_replication regularly
WAL generation rate: Monitor pg_stat_wal
Disk usage: Track WAL directory growth
Network latency: Between primary and replicas

Alerting Thresholds:

Warning: Lag > 100 MB
Critical: Lag > 1 GB
Emergency: Lag > 5 GB or replica disconnected

Regular Maintenance Tasks

Weekly Checks:

Review replication slot usage
Monitor WAL file accumulation
Verify replica connectivity
Check disk space on all nodes

Monthly Reviews:

Analyze replication performance trends
Review and optimize PostgreSQL configuration
Test failover procedures
Update monitoring dashboards

🚀 Automating Oracle Database Patching with Ansible: A Complete Guide

Oracle database patching has long been the bane of DBAs everywhere. It's a critical task that requires precision, expertise, and often results in extended maintenance windows. What if I told you that you could automate this entire process, reducing both risk and downtime while ensuring consistency across your Oracle estate? 💡 In this comprehensive guide, I'll walk you through a production-ready Ansible playbook that completely automates Oracle patch application using OPatch. Whether you're managing a single Oracle instance or hundreds of databases across your enterprise, this solution will transform your patch management strategy! 🎯 🔥 The Challenge: Why Oracle Patching is Complex Before diving into the solution, let's understand why Oracle patching is so challenging: 🔗 Multiple dependencies : OPatch versions, Oracle Home configurations, running processes ⚠️ Risk of corruption : Incorrect patch application can render databases unusable ⏰ Downtime requirements : Da...

DBA NAVIGATOR

Search This Blog