基础信息:
节点 | 角色 | pg版本 | 节点IP | os版本 | pg安装方式 |
---|---|---|---|---|---|
node1 | primary | 14.12 | 10.204.121.214 | Ubuntu2204 | apt |
node2 | standby | 14.12 | 10.204.121.214 | Ubuntu2204 | apt |
安装pg(所有节点)
sudo apt update
sudo apt install postgresql-14 -y
sudo systemctl status postgresql@14-main.service
sudo -u postgres psql -c "SELECT version();"
安装repmgr(所有节点)
下面安装的是14匹配的repmgr,如果是其他版本把14换成其他版本即可
sudo apt-get install postgresql-14-repmgr -y
配置主库node1
修改配置文件 /etc/postgresql/14/main/postgresql.conf
sed -i '/cluster_name/s/14\/main/node1/' /etc/postgresql/14/main/postgresql.conf
cat <<EOF >> /etc/postgresql/14/main/postgresql.conf
listen_addresses = '*'
max_wal_senders = 10
max_replication_slots = 10
wal_level = 'hot_standby'
hot_standby = on
archive_mode = on # repmgr 本身不需要 WAL 文件归档。
archive_command = '/bin/true'
shared_preload_libraries = 'repmgr'
wal_log_hints = 'on'
EOF
- 创建 repmgr 用户
su -l postgres -c "createuser -s repmgr"
su -l postgres -c "createdb repmgr -O repmgr"
- 配置 /etc/postgresql/14/main/pg_hba.conf
这个hba文件是为了配置repmgr用户无密码访问repmgr数据库的,所以上面创建用户的时候没有设置密码,trust就是这个意思,还有其他的配置项可以看配置文件里的解释
sed -i '/# DO NOT DISABLE!/i \
local replication repmgr trust \
host replication repmgr 127.0.0.1/32 trust \
host replication repmgr 0.0.0.0/0 trust \
local repmgr repmgr trust \
host repmgr repmgr 127.0.0.1/32 trust \
host repmgr repmgr 0.0.0.0/0 trust' /etc/postgresql/14/main/pg_hba.conf
- 配置/etc/postgresql/14/main/repmgr.conf,内容如下
cat <<EOF > /etc/postgresql/14/main/repmgr.conf
node_id=1
node_name='${PRIMARY}'
conninfo='host=${PRIMARY} user=repmgr dbname=repmgr connect_timeout=2'
data_directory='/var/lib/postgresql/14/main'
# 日志管理
log_level='INFO'
log_file='/tmp/repmgr.log'
# failover设置
failover='automatic'
promote_command='/usr/bin/repmgr standby promote -f /etc/postgresql/14/main/repmgr.conf --log-to-file'
follow_command='/usr/bin/repmgr standby follow -f /etc/postgresql/14/main/repmgr.conf --log-to-file --upstream-node-id=%n'
# 用于repmgr启停pg
service_start_command = 'sudo pg_ctlcluster 14 main start'
service_stop_command = 'sudo pg_ctlcluster 14 main stop'
service_restart_command = 'sudo pg_ctlcluster 14 main restart'
service_reload_command = 'sudo pg_ctlcluster 14 main reload'
EOF
上面的变量按实际情况去修改
- 配置后重启数据库
sudo systemctl restart postgresql@14-main.service
- 注册主库
# 注册主节点
su -l postgres -c "repmgr -f /etc/postgresql/14/main/repmgr.conf primary register"
# 查看注册后情况,如下说明已注册成功
su -l postgres -c "repmgr -f /etc/postgresql/14/main/repmgr.conf cluster show"
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----+-------+---------+-----------+----------+----------+----------+----------+-----------------------------------------------------------------------------------
1 | node1 | primary | * running | | default | 100 | 1 | host=10.204.121.218 user=repmgr dbname=repmgr connect_timeout=2 password=test2024
配置从库node2
- 修改配置文件 /etc/postgresql/14/main/postgresql.conf
sed -i '/cluster_name/s/14\/main/node2/' /etc/postgresql/14/main/postgresql.conf
cat <<EOF >> /etc/postgresql/14/main/postgresql.conf
listen_addresses = '*'
max_wal_senders = 10
max_replication_slots = 10
wal_level = 'hot_standby'
hot_standby = on
archive_mode = on # repmgr 本身不需要 WAL 文件归档。
archive_command = '/bin/true'
shared_preload_libraries = 'repmgr'
wal_log_hints = 'on'
EOF
- 配置/etc/postgresql/14/main/repmgr.conf,内容如下
cat <<EOF > /etc/postgresql/14/main/repmgr.conf
node_id=2
node_name='${STANDBY}'
conninfo='host=${STANDBY} user=repmgr dbname=repmgr connect_timeout=2'
data_directory='/var/lib/postgresql/14/main'
# 日志管理
log_level='INFO'
log_file='/tmp/repmgr.log'
# failover设置
failover='automatic'
promote_command='/usr/bin/repmgr standby promote -f /etc/postgresql/14/main/repmgr.conf --log-to-file'
follow_command='/usr/bin/repmgr standby follow -f /etc/postgresql/14/main/repmgr.conf --log-to-file --upstream-node-id=%n'
# 用于repmgr启停pg
service_start_command = 'sudo pg_ctlcluster 14 main start'
service_stop_command = 'sudo pg_ctlcluster 14 main stop'
service_restart_command = 'sudo pg_ctlcluster 14 main restart'
service_reload_command = 'sudo pg_ctlcluster 14 main reload'
EOF
- 配置 /etc/postgresql/14/main/pg_hba.conf
# 添加repmgr无密码登录pg
sed -i '/# DO NOT DISABLE!/i \
local replication repmgr trust \
host replication repmgr 127.0.0.1/32 trust \
host replication repmgr 0.0.0.0/0 trust \
local repmgr repmgr trust \
host repmgr repmgr 127.0.0.1/32 trust \
host repmgr repmgr 0.0.0.0/0 trust' /etc/postgresql/14/main/pg_hba.conf
- 停止pg 删除从库数据
systemctl stop postgresql@14-main.service
rm -rf /var/lib/postgresql/14/main/*
- 注册并启动从库
# 测试克隆数据,-h 是主节点的IP地址
su -l postgres -c "repmgr -h $primaryIP -U repmgr -d repmgr -f /etc/postgresql/14/main/repmgr.conf standby clone --dry-run"
# 正常的话,执行clone
su -l postgres -c "repmgr -h $primaryIP -U repmgr -d repmgr -f /etc/postgresql/14/main/repmgr.conf standby clone"
# 启动pg
systemctl start postgresql@14-main.service
# 注册从库
su -l postgres -c "repmgr -f /etc/postgresql/14/main/repmgr.conf standby register"
# 查看注册好的从库,目前主从都建立完成
su -l postgres -c "repmgr -f /etc/postgresql/14/main/repmgr.conf cluster show"
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----+-------+---------+-----------+----------+----------+----------+----------+-----------------------------------------------------------------------------------
1 | node1 | primary | * running | | default | 100 | 1 | host=10.204.121.218 user=repmgr dbname=repmgr connect_timeout=2 password=test2024
2 | node2 | standby | running | node1 | default | 100 | 1 | host=10.204.121.219 user=repmgr dbname=repmgr connect_timeout=2 password=test2024
到此已经完成了pg的主从搭建,目前也可以故障转移,但是需要手动操作,后面接着部署repmgr自动故障切换
创建repmgrd(所有节点)
repmgrd 作为运行在集群中每个节点上的一个管理和监控的守护程序,可以自动进行故障转移和维护复制关系,并提供有关每个节点状态的监控信息。
- 修改repmgrd service 配置文件,新的,如下
cat <<EOF > /etc/default/repmgrd
REPMGRD_ENABLED=yes
REPMGRD_CONF="/etc/postgresql/14/main/repmgr.conf"
REPMGRD_OPTS="--daemonize=false"
EOF
- 重启repmgrd
# 启动
systemctl restart repmgrd
- 查看repmgrd在集群的状态,显示running说明正常。到这里就可以自动故障转移了
su - postgres -c 'repmgr -f /etc/postgresql/14/main/repmgr.conf service status'
ID | Name | Role | Status | Upstream | repmgrd | PID | Paused? | Upstream last seen
----+-------+---------+-----------+----------+-------------+---------+---------+--------------------
1 | node1 | primary | * running | | running | 1541134 | no | n/a
2 | node2 | standby | running | node1 | running | n/a | n/a | n/a
repmgr缺点:主节点宕机恢复后,需要人工手动加入到集群
主节点宕机恢复
重新加入之前的主节点为当前的从节点,可以使用repmgr node rejoin 命令,官方链接: repmgr-node-rejoin但是测试没有成功,所以用重新clone的方式恢复
systemctl stop postgresql@14-main.service
rm -rf /var/lib/postgresql/14/main/*
# $primaryIP 替换成当前主节点IP
su -l postgres -c "repmgr -h $primaryIP -U repmgr -d repmgr -f /etc/postgresql/14/main/repmgr.conf standby clone --dry-run"
# 上一步正常执行后,执行clone
su -l postgres -c "repmgr -h 10.204.121.219 -U repmgr -d repmgr -f /etc/postgresql/14/main/repmgr.conf standby clone"
systemctl start postgresql@14-main.service
su -l postgres -c "repmgr -f /etc/postgresql/14/main/repmgr.conf standby register --force"
# 查看集群状态是否已同步
su -l postgres -c "repmgr -f /etc/postgresql/14/main/repmgr.conf cluster show"
上面是一步一步的部署测试方法,本人又写了个自动部署脚本,用来自动安装,链接如下:
https://gitee.com/znblog/db-ha/tree/master/postgres