🛠️ PostgreSQL Replication Troubleshooting Guide: Recovering from Large Replication Lag in a Patroni-Managed Cluster
A comprehensive step-by-step guide to diagnose and resolve replication lag issues in PostgreSQL clusters managed by Patroni
🖥️ Environment Overview
Before diving into the troubleshooting process, let's establish our cluster environment:
Cluster Configuration:
- Cluster Name: postgres-cluster
- Management Tool: Patroni
- PostgreSQL Version: 13+ (recommended)
Node Details:
- Primary Node: postgres-primary (192.168.1.11)
- Replica Node 1: postgres-replica1 (192.168.1.14)
- Replica Node 2: postgres-replica2 (192.168.1.17)
This setup represents a typical high-availability PostgreSQL cluster where Patroni manages automatic failover and cluster coordination.
🔍 Understanding Replication Lag
Replication lag occurs when replica nodes fall behind the primary node in processing Write-Ahead Log (WAL) files. This can happen due to:
- Network latency or connectivity issues
- High write volume on the primary
- Resource constraints on replica nodes
- WAL file accumulation
- Disk I/O bottlenecks
Warning Signs:
- Lag measurements in GB rather than MB
- Queries on replicas returning stale data
- Increased storage usage on primary due to WAL retention
- Application performance degradation
🚨 Step-by-Step Recovery Process
1️⃣ Initial Cluster Health Assessment
Start by checking the overall cluster status from the primary node:
bashssh postgres@postgres-primary patronictl -c /opt/patroni/patroni_postgres-primary.conf list -e
Expected Output Analysis:
+ Cluster: postgres-cluster --+-----------+----+-----------+ | Member | Host | Role | State | TL | Lag in MB | +------------------+--------------+--------------+-----------+----+-----------+ | postgres-primary | 192.168.1.11 | Leader | running | 25 | | | postgres-replica1 | 192.168.1.14 | Replica | running | 22 | 11592167 | | postgres-replica2 | 192.168.1.17 | Sync Standby | streaming| 25 | 18 | +------------------+--------------+--------------+-----------+----+-----------+
Key Indicators:
- Timeline (TL): Should be consistent across healthy nodes
- Lag in MB: Values over 1000 MB indicate significant issues
- State: "streaming" is healthy, "running" with high lag needs attention
2️⃣ Replication Slot Investigation
Examine replication slots to understand WAL retention:
bashpsql -h 192.168.1.11 -U postgres -c "SELECT slot_name, plugin, slot_type, database, active, restart_lsn, confirmed_flush_lsn FROM pg_replication_slots;"
What to Look For:
- Active slots: Should show 'true' for healthy replicas
- LSN positions: Large gaps indicate replication delays
- Unused slots: May cause WAL bloat and should be dropped
3️⃣ Critical Decision Point: Reinitialize the Lagging Replica
When lag exceeds manageable limits (typically >5GB), reinitializing is often the fastest recovery method:
Standard Reinitialize Command:
bashpatronictl -c /opt/patroni/patroni_postgres-primary.conf reinit postgres-cluster postgres-replica1
Force Reinitialize (when WAL files are missing):
bashpatronictl -c /opt/patroni/patroni_postgres-primary.conf reinit postgres-cluster postgres-replica1 --force
⚠️ Important Considerations:
- This process will rebuild the replica from scratch
- The replica will be temporarily unavailable during reconstruction
- All existing WAL files on the replica will be discarded
- Use
--force
only when standard reinit fails due to missing WAL segments
4️⃣ Monitoring Reinitialize Progress
Track the recovery process with periodic status checks:
bash# Check every 30 seconds during recovery watch -n 30 "patronictl -c /opt/patroni/patroni_postgres-primary.conf list -e"
Progress Indicators:
- Initial: Replica shows as "stopped" or "stopped(failed)"
- During sync: Status changes to "creating replica"
- Near completion: State becomes "streaming" with reducing lag
- Complete: Lag drops to normal levels (<100 MB)
✅ Post-Recovery Validation
5️⃣ Comprehensive Health Checks
5.1 📊 PostgreSQL Log Analysis
Monitor replica logs for recovery success indicators:
bashssh postgres@postgres-replica1 tail -f /var/lib/pgsql/data/log/postgresql-*.log
Success Indicators:
- "database system is ready to accept read only connections"
- "started streaming WAL from primary"
- "consistent recovery state reached"
5.2 🔗 Replication Slot Verification
Confirm all replication slots are functioning:
bashpsql -h 192.168.1.11 -U postgres -c "SELECT slot_name, active, restart_lsn, confirmed_flush_lsn FROM pg_replication_slots WHERE active = true;"
5.3 🎯 Recovery Mode Confirmation
Verify replica is in proper recovery mode:
bashpsql -h 192.168.1.14 -U postgres -c "SELECT pg_is_in_recovery();"
Expected Result: t
(true)
5.4 📡 WAL Receiver Status Check
Examine WAL receiver process health:
bashpsql -h 192.168.1.14 -U postgres -x -c "SELECT pid, status, receive_start_lsn, receive_start_tli, received_lsn, received_tli, last_msg_send_time, last_msg_receipt_time FROM pg_stat_wal_receiver;"
Health Indicators:
- Status: Should be "streaming"
- Recent timestamps: last_msg_receipt_time within last few seconds
- LSN progression: received_lsn should be advancing
🛠️ Troubleshooting Common Issues
Issue 1: WAL Segment Errors
Symptoms: "requested WAL segment has already been removed"
Solution: Use --force
flag during reinitialize to bypass missing WAL files
Issue 2: Persistent Lag Growth
Symptoms: Lag continues increasing after reinitialize Root Causes & Solutions:
- Disk Space: Ensure adequate storage on replica (>20% free)
- Network Issues: Test connectivity with
ping
andiperf
- Resource Constraints: Monitor CPU, memory, and I/O with
htop
,iostat
- Configuration Issues: Review
wal_keep_size
,max_wal_senders
,max_replication_slots
Issue 3: Replica Connection Failures
Symptoms: Cannot connect to replica after reinitialize Solutions:
- Verify PostgreSQL service status:
systemctl status postgresql
- Check firewall rules:
firewall-cmd --list-all
- Validate authentication settings in
pg_hba.conf
Issue 4: Slow Reinitialize Process
Symptoms: Reinitialize takes hours to complete Optimizations:
- Increase
checkpoint_segments
temporarily - Optimize network between primary and replica
- Consider using
pg_basebackup
with parallel workers
🛡️ Preventive Measures
Configuration Optimizations
WAL Management:
sql-- Increase WAL retention (adjust based on your lag tolerance) ALTER SYSTEM SET wal_keep_size = '10GB'; -- Prevent excessive WAL accumulation ALTER SYSTEM SET max_slot_wal_keep_size = '20GB'; -- Optimize WAL writing ALTER SYSTEM SET wal_buffers = '64MB'; ALTER SYSTEM SET wal_writer_delay = '10ms';
Replication Settings:
sql-- Increase replication connections ALTER SYSTEM SET max_wal_senders = 10; ALTER SYSTEM SET max_replication_slots = 10; -- Optimize streaming ALTER SYSTEM SET wal_sender_timeout = '60s'; ALTER SYSTEM SET wal_receiver_timeout = '60s';
Monitoring Setup
Essential Metrics to Track:
- Replication lag: Query
pg_stat_replication
regularly - WAL generation rate: Monitor
pg_stat_wal
- Disk usage: Track WAL directory growth
- Network latency: Between primary and replicas
Alerting Thresholds:
- Warning: Lag > 100 MB
- Critical: Lag > 1 GB
- Emergency: Lag > 5 GB or replica disconnected
Regular Maintenance Tasks
Weekly Checks:
- Review replication slot usage
- Monitor WAL file accumulation
- Verify replica connectivity
- Check disk space on all nodes
Monthly Reviews:
- Analyze replication performance trends
- Review and optimize PostgreSQL configuration
- Test failover procedures
- Update monitoring dashboards
Comments
Post a Comment