一、
前言
postgresql 的yum部署其实还是有点东西的,本文就做一个小小的记录,高可用方面repmgr插件还是非常不错的,但如何部署以及部署后如何使用也是一个难点,因此,也在本文里做一个记录
环境介绍:
第一台服务器:操作系统是centos7,内核版本是3.10,IP地址是192.168.123.17 VMware虚拟机,内存4G,cpu4核,计划在此服务器上安装postgresql12.4版本以及repmgr高可用插件。该服务器是主节点
第二台服务器:操作系统是centos7,内核版本是3.10,IP地址是192.168.123.20 VMware虚拟机,内存4G,cpu4核,计划在此服务器上安装postgresql12.4版本以及repmgr高可用插件。该服务器是从节点
二、
详细的yum安装postgreslq数据库
1,配置yum源
yum安装postgresql比较简单,源配置官方源就可用了,命令如下(2台服务器都执行):
yum install -y https://download.postgresql.org/pub/repos/yum/reporpms/EL-7-x86_64/pgdg-redhat-repo-latest.noarch.rpm
为了防止某些库安装的时候提示缺少依赖,增加阿里云的库和阿里云的epel库
wget -O /etc/yum.repos.d/epel.repo https://mirrors.aliyun.com/repo/epel-7.repo
wget -O /etc/yum.repos.d/CentOS-Base.repo https://mirrors.aliyun.com/repo/Centos-7.repo
增加不常用的rpm包依赖仓库:
[centos-sclo-sclo]
name=CentOS-7 - SCLo sclo
baseurl=https://mirrors.tuna.tsinghua.edu.cn/centos/7/sclo/$basearch/sclo/
gpgcheck=0
enabled=1
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-CentOS-SIG-SCLo
[centos-sclo-rh]
name=CentOS-7 - SCLo rh
baseurl=https://mirrors.tuna.tsinghua.edu.cn/centos/7/sclo/$basearch/rh/
gpgcheck=0
enabled=1
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-CentOS-SIG-SCLo
2,
安装postgresql的组件和postgresql的server服务(2台服务器都执行)
组件比较多,主要是lib库(postgresql运行的依赖库),插件扩展库(postgresql的插件库),开发库(一些postgresql的特殊组件需要使用的库),连接库(开发微服务使用的库)
后面的几个看自己的需求安装不,比如postgresql12-plperl
版本号根据自己需求来,我这里是12.4, 不想安这个低版本的,修改12.4就可以了,比如全部的12.4修改为12.8
yum install postgresql12-server-12.4 postgresql12-libs-12.4 postgresql12-devel-12.4 postgresql12-contrib-12.4 postgresql12-tcl postgresql12-odbc postgresql12-12.4 postgresql12-plperl-12.4 postgresql12-pltcl-12.4 postgresql12-plpython-12.4 postgresql12-plpython3-12.4 postgresql12-llvmjit-12.4 postgresql12-devel-12.4 -y
3,
postgresql的环境变量定义和一些特殊处理(两台服务器都执行)
在/etc/profile 文件末尾添加如下变量,计划postgresql安装在/data/pgsql/data目录下
PGDATA=/data/pgsql/data
export PGDATA
PGHOME=/data/pgsql
export PGHOME
PATH=$PATH:$PGHOME/bin:$PGDATA:/usr/pgsql-12/bin/
export PATH PGHOME PGDATA
写入完成后,激活变量:
source /etc/profile
切换用户到普通用户postgres,将该用户的环境变量文件全部注释掉,结果如下:
-bash-4.2$ whoami
postgres
-bash-4.2$ cat ~/.bash_profile
#[ -f /etc/profile ] && source /etc/profile
#PGDATA=/var/lib/pgsql/12/data
#export PGDATA
# If you want to customize your settings,
# Use the file below. This is not overridden
# by the RPMS.
#[ -f /var/lib/pgsql/.pgsql_profile ] && source /var/lib/pgsql/.pgsql_profile
注释的原因是我们不希望使用默认的/var/lib/12目录存放数据库的数据文件,注销普通用户再次su - postgres 重新激活变量后,在普通用户下,可以看到postgresql数据库相关变量已经固定了:
[root@centos10 media]# su - postgres
Last login: Sat Mar 30 05:19:48 CST 2024 on pts/0
-bash-4.2$ echo $PGDATA
/data/pgsql/data
4,创建相关目录并赋予普通用户的权限,准备初始化工作(root用户执行,两台服务器都执行):
mkdir -p /data/pgsql/
chown -Rf postgres. /data/pgsql
5,执行数据库初始化(仍然是切换到普通用户postgres):
由于前面的准备工作比较充分,环境变量已经完全配置好了,因此,执行initdb就可以完成初始化了,/data/pgsql/data/目录以及其下的文件将在初始化的时候创建好
初始化日志如下:
[root@centos7 media]# su - postgres
Last login: Sat Mar 30 05:25:23 CST 2024 on pts/0
-bash-4.2$ initdb
The files belonging to this database system will be owned by user "postgres".
This user must also own the server process.
The database cluster will be initialized with locale "en_US.UTF-8".
The default database encoding has accordingly been set to "UTF8".
The default text search configuration will be set to "english".
Data page checksums are disabled.
creating directory /data/pgsql/data ... ok
creating subdirectories ... ok
selecting dynamic shared memory implementation ... posix
selecting default max_connections ... 100
selecting default shared_buffers ... 128MB
selecting default time zone ... Asia/Shanghai
creating configuration files ... ok
running bootstrap script ... ok
performing post-bootstrap initialization ... ok
syncing data to disk ... ok
initdb: warning: enabling "trust" authentication for local connections
You can change this by editing pg_hba.conf or using the option -A, or
--auth-local and --auth-host, the next time you run initdb.
Success. You can now start the database server using:
pg_ctl -D /data/pgsql/data -l logfile start
6,
在root用户下管理postgresql的启停
由于是yum安装的,因此,启停脚本是已经写好的,可以直接用,但我们数据库的数据文件存放路径修改了,因此,需要先修改一下启停脚本:
/usr/lib/systemd/system/postgresql-12.service
Environment=PGDATA=/var/lib/pgsql/12/data/
修改为
Environment=PGDATA=/data/pgsql/data/
这个启停脚本名字太长了,修改一下名字 就叫pg12吧:
mv /usr/lib/systemd/system/postgresql-12.service /usr/lib/systemd/system/pg12.service
启动postgresql数据库并查看进行确认是否启动成功:
systemctl enable pg12 && systemctl start pg12
查看postgreslq的进程:
[root@centos7 media]# ps aux |grep postgre
postgres 16161 0.2 0.4 397292 17424 ? Ss 05:51 0:00 /usr/pgsql-12/bin/postmaster -D /data/pgsql/data/
postgres 16164 0.0 0.0 251676 2008 ? Ss 05:51 0:00 postgres: logger
postgres 16166 0.0 0.0 397292 2060 ? Ss 05:51 0:00 postgres: checkpointer
postgres 16167 0.0 0.0 397292 2288 ? Ss 05:51 0:00 postgres: background writer
postgres 16168 0.0 0.0 397292 2064 ? Ss 05:51 0:00 postgres: walwriter
postgres 16169 0.0 0.0 397844 3212 ? Ss 05:51 0:00 postgres: autovacuum launcher
postgres 16170 0.0 0.0 251672 2120 ? Ss 05:51 0:00 postgres: stats collector
postgres 16171 0.0 0.0 397844 2808 ? Ss 05:51 0:00 postgres: logical replication launcher
root 16173 0.0 0.0 112712 964 pts/0 S+ 05:51 0:00 grep --color=auto postgre
7,
远程连接和postgres用户密码设置
修改主配置文件 /data/pgsql/data/postgresql.conf
listen_addresses = '*'
port = 15433
max_connections = 1000
wal_level = logical
这里是设置数据库连接端口是15433
修改安全连接文件pg_hba.conf:
末尾添加如下两行,这两行的意思是复制用户replication可以随意连接,但其它用户远程连接的时候需要密码,本地登录的时候不校验密码
host replication all 0.0.0.0/0 trust
host all all 0.0.0.0/0 md5
8,
本地登录数据库设置密码
su - postgres -c "psql -p15433"
alter user postgres with password '123456';
此时,数据库算是完全安装完毕,连接端口15433,连接密码123456 后面就不在重复这些信息了
随意安装一个插件,看看lib库什么的是否正常:
[root@centos10 ~]# su - postgres -c "psql -p15433"
psql (12.4)
Type "help" for help.
postgres=# create extension
adminpack btree_gin dblink file_fdw hstore_plperlu insert_username jsonb_plperl jsonb_plpythonu ltree_plpython3u pg_buffercache pgrowlocks pg_visibility plpython3u postgres_fdw tablefunc unaccent
amcheck btree_gist dict_int fuzzystrmatch hstore_plpython2u intagg jsonb_plperlu lo ltree_plpythonu pgcrypto pg_stat_statements plperl plpythonu refint tcn "uuid-ossp"
autoinc citext dict_xsyn hstore hstore_plpython3u intarray jsonb_plpython2u ltree moddatetime pg_freespacemap pgstattuple plperlu pltcl seg tsm_system_rows xml2
bloom cube earthdistance hstore_plperl hstore_plpythonu isn jsonb_plpython3u ltree_plpython2u pageinspect pg_prewarm pg_trgm plpython2u pltclu sslinfo tsm_system_time
postgres=# create extension pg_trgm ;
CREATE EXTENSION
二,
repmgr的基本安装部署和配置
repmgr的介绍:
Repmgr是2ndQuadrant(第二象限公司)开发的一款复制的开源工具套件,用于管理PostgreSQL服务器集群中的复制和故障转移。
最初,它主要是为了简化流副本的管理,后来发展成为一个完整的故障转移管理套件。它通过设置备用服务器,监视复制以及执行管理任务(如故障转移或手动切换操作)的工具,增强了PostgreSQL内置的热备份功能。
其中各组件功能如下:
No.1 Repmgrd 守护进程
它主动监视复制集群中的服务器并执行以下任务:
1)监控和记录集群复制性能;
2)通过检测主服务器故障并提升最合适的备用服务器来执行故障转移;
3)将有关群集中事件的通知提供给用户定义的脚本,该脚本可以执行诸如通过电子邮件发送警报等任务;
4)repmgrd 根据本地数据库角色不同,其功能也不同:
-
主库:repmgrd仅监控本地数据库,负责自动恢复、同异步切换;
-
备库:repmgrd监控本地数据库和主数据库,负责自动切换、复制槽删除。
No.2 Repmgr命令管理
用于执行管理任务的命令行工具,主要有以下方面作用:
1)设置备用服务器;
2)将备用服务器升级为主服务器;
3)切换主服务器和备用服务器;
4)显示复制群集中的服务器状态。
No.3 用户和元数据 为了有效地管理复制集群,repmgr提供专用数据库存储和管理有关repmgr集群服务的相关信息。
此模式在 部署repmgr服务时,由repmgr扩展自动创建,该扩展在初始化repmgr -administered集群(repmgr主寄存器) 的第一步中安装,包含以下对象:
-
表:
repmgr.events:记录感兴趣的事件;
repmgr.nodes:复制群集中每个服务器的连接和状态信息 ;
repmgr.monitoring_history:repmgrd写入的历史备用监视信息。
-
视图
repmgr.show_nodes:基于表repmgr.nodes,另外显示服务器上游节点的名称。
repmgr.replication_status:启用repmgrd的监视时,显示每个备用数据库的当前监视状态。
repmgr元数据模式可以存储在现有的数据库或在自己的专用数据库。
请注意,repmgr元数据模式不能驻留在不属于repmgr管理的复制集群的数据库服务器上。数据库用户必须可供repmgr访问此数据库并执行必要的更改。此用户不需要是超级用户,但是某些操作(如初始安装repmgr扩展)将需要超级用户连接(可以使用命令 行选项--superuser在需要时指定 )。
1、
计划repmgr安装方式为yum,安装命令如下:
yum install repmgr_12-5.4.1 repmgr_12-devel-5.4.1 repmgr_12-llvmjit-5.4.1 -y
2、
主数据库装载repmgr插件
修改主数据库17的主配置文件,data/pgsql/data/postgresql.conf,修改完毕后重启服务器
主要是定义数据库的端口,以及归档命令
listen_addresses = '*'
port = 15433
max_connections = 1000
shared_buffers = 128MB
dynamic_shared_memory_type = posix
wal_level = logical
max_wal_size = 1GB
min_wal_size = 80MB
archive_mode = on
archive_command = 'test ! -f /data/pgsql/arclog/%f && cp %p /data/pgsql/arclog/%f' #
log_destination = 'stderr'
logging_collector = on
log_directory = 'log'
log_filename = 'postgresql-%a.log'
log_truncate_on_rotation = on
log_rotation_age = 1d
log_rotation_size = 0
log_line_prefix = '%m [%p] '
log_timezone = 'Asia/Shanghai'
datestyle = 'iso, mdy'
timezone = 'Asia/Shanghai'
lc_messages = 'en_US.UTF-8'
lc_monetary = 'en_US.UTF-8'
lc_numeric = 'en_US.UTF-8'
lc_time = 'en_US.UTF-8'
default_text_search_config = 'pg_catalog.english'
shared_preload_libraries = 'repmgr'
wal_log_hints = on
DETAIL: pg_rewind requires "wal_log_hints" to be enabled
根据以上配置,创建目录/data/pgsql/arclog并赋予postgres这个用户权限,最后重启服务器
mkdir -p /data/pgsql/arclog
chown -Rf postgres. /data/pgsql/arclog/
systemctl restart pg12
ntpdate ntp.aliyun.com
3、
主节点17repmgr的配置文件 /etc/repmgr/12/repmgr.conf 增加如下内容:
node_id=1
node_name='pg1'
conninfo='host=192.168.123.17 port=15433 user=repmgr dbname=repmgr password=123456'
data_directory='/data/pgsql/data'
replication_user='repmgr'
replication_type='physical'
#location='pg1'
use_replication_slots=true
witness_sync_interval=15
log_level='DEBUG'
log_facility='STDERR'
log_file='/var/log/repmgr/pg2.log'
pg_bindir='/usr/pgsql-12/bin/'
ssh_options='-q -o ConnectTimeout=10'
promote_check_timeout=60
promote_check_interval=1
node_rejoin_timeout=60
failover='automatic'
#priority=100
connection_check_type=ping
promote_command='/usr/pgsql-12/bin/repmgr standby promote -f /etc/repmgr/12/repmgr.conf --log-to-file'
follow_command='/usr/pgsql-12/bin/repmgr standby follow -f /etc/repmgr/12/repmgr.conf --log-to-file --upstream-node-id=%n'
monitoring_history=yes
service_start_command = '/usr/bin/systemctl start pg12'
service_stop_command = '/usr/bin/systemctl stop pg12'
service_restart_command = '/usr/bin/systemctl restart pg12'
service_reload_command = '/usr/bin/systemctl reload pg12'
service_promote_command = '/usr/pgsql-12/bin/pg_ctl promote -D /data/pgsql/data'
repmgrd_service_start_command = '/usr/pgsql-12/bin/repmgrd -f /etc/repmgr/12/repmgr.conf -p /run/repmgr/repmgrd-12.pid -d --verbose'
repmgrd_service_stop_command = '/usr/bin/kill `cat /run/repmgr/repmgrd-12.pid`'
4、
根据以上配置文件,数据库添加repmgr用户
[root@centos7 ~]# su - postgres -c 'psql -p15433'
psql (12.4)
Type "help" for help.
postgres=# create user repmgr with password '123456';
CREATE ROLE
postgres=# create database repmgr;
CREATE DATABASE
ALTER USER repmgr WITH REPLICATION;
5、
切换普通用户,利用repmgr注册主节点
[root@centos7 ~]# su - postgres
Last login: Sat Mar 30 23:08:48 CST 2024 on pts/0
-bash-4.2$ repmgr primary register
INFO: connecting to primary database...
DEBUG: connecting to: "user=repmgr dbname=repmgr host=192.168.123.17 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
ERROR: connection to database failed
DETAIL:
fe_sendauth: no password supplied
DETAIL: attempted to connect using:
user=repmgr dbname=repmgr host=192.168.123.17 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path=
失败了,原因是没有传递密码变量,执行变量命令,并重新注册,注册的时候临时给repmgr用户超级权限,注册成功
export PGPASSWORD="123456"
-bash-4.2$ repmgr primary register -S postgres
WARNING: following problems with command line parameters detected:
--superuser ignored when executing PRIMARY REGISTER
INFO: connecting to primary database...
DEBUG: connecting to: "user=repmgr dbname=repmgr host=192.168.123.17 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
NOTICE: attempting to install extension "repmgr"
DEBUG: established superuser connection as "postgres"
NOTICE: "repmgr" extension successfully installed
NOTICE: primary node record (ID: 1) registered
可以看到,扩展插件repmgr启用了,具体路径为repmgr数据库下的名为repmgr的schema下
命令行也可以佐证确实是安装了插件repmgr:
repmgr=# set search_path to repmgr ;
SET
repmgr=# \dx
List of installed extensions
Name | Version | Schema | Description
---------+---------+------------+------------------------------------
plpgsql | 1.0 | pg_catalog | PL/pgSQL procedural language
repmgr | 5.4 | repmgr | Replication manager for PostgreSQL
(2 rows)
如果觉得注册的不对,可以重新注册,重新注册命令如下:
-bash-4.2$ repmgr primary unregister -S postgres --force
WARNING: following problems with command line parameters detected:
--superuser ignored when executing PRIMARY UNREGISTER
DEBUG: connecting to: "user=repmgr dbname=repmgr host=192.168.123.17 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
DEBUG: connecting to: "user=repmgr dbname=repmgr host=192.168.123.17 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
DEBUG: connecting to: "user=repmgr dbname=repmgr host=192.168.123.17 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
INFO: node "pg1" (ID: 1) was successfully unregistered
看到 successfully后,再次执行注册就可以,查看repmgr集群的状态:
-bash-4.2$ repmgr cluster show
DEBUG: connecting to: "user=repmgr dbname=repmgr host=192.168.123.17 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
DEBUG: connecting to: "user=repmgr dbname=repmgr host=192.168.123.17 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----+------+---------+-----------+----------+----------+----------+----------+----------------------------------------------------------
1 | pg1 | primary | * running | | pg1 | 100 | 1 | host=192.168.123.17 port=15433 user=repmgr dbname=repmgr
6、
从节点20的repmgr的配置文件
node_id=2
node_name='pg2'
conninfo='host=192.168.123.20 port=15433 user=repmgr dbname=repmgr password=123456'
data_directory='/data/pgsql/data'
replication_user='repmgr'
replication_type='physical'
#location='pg2'
use_replication_slots=true
witness_sync_interval=15
log_level='DEBUG'
log_facility='STDERR'
log_file='/var/log/repmgr/pg2.log'
pg_bindir='/usr/pgsql-12/bin/'
ssh_options='-q -o ConnectTimeout=10'
promote_check_timeout=60
promote_check_interval=1
node_rejoin_timeout=60
failover='automatic'
#priority=120
connection_check_type=ping
promote_command='/usr/pgsql-12/bin/repmgr standby promote -f /etc/repmgr/12/repmgr.conf --log-to-file'
follow_command='/usr/pgsql-12/bin/repmgr standby follow -f /etc/repmgr/12/repmgr.conf --log-to-file --upstream-node-id=%n'
monitoring_history=yes
service_start_command = '/usr/bin/systemctl start pg12'
service_stop_command = '/usr/bin/systemctl stop pg12'
service_restart_command = '/usr/bin/systemctl restart pg12'
service_reload_command = '/usr/bin/systemctl reload pg12'
service_promote_command = '/usr/pgsql-12/bin/pg_ctl promote -D /data/pgsql/data'
repmgrd_service_start_command = '/usr/pgsql-12/bin/repmgrd -f /etc/repmgr/12/repmgr.conf -p /run/repmgr/repmgrd-12.pid -d --verbose'
repmgrd_service_stop_command = '/usr/bin/kill `cat /run/repmgr/repmgrd-12.pid`'
7、
从节点利用流复制搭建主从集群,通过repmgr命令来执行(postgres这个普通用户执行)
repmgr standby clone -h 192.168.123.17 -p15433 -d repmgr --force -U repmgr -c
HINT: this may take some time; consider using the -c/--fast-checkpoint option
日志如下:
NOTICE: destination directory "/data/pgsql/data" provided
INFO: connecting to source node
DETAIL: connection string is: host=192.168.123.17 port=15433 user=repmgr dbname=repmgr
DETAIL: current installation size is 32 MB
DEBUG: 1 node records returned by source node
DEBUG: connecting to: "user=repmgr dbname=repmgr host=192.168.123.17 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
DEBUG: upstream_node_id determined as 1
NOTICE: checking for available walsenders on the source node (2 required)
NOTICE: checking replication connections can be made to the source server (2 required)
WARNING: directory "/data/pgsql/data" exists but is not empty
NOTICE: -F/--force provided - deleting existing data directory "/data/pgsql/data"
DEBUG: replication slot "repmgr_slot_2" exists but is inactive; reusing
NOTICE: starting backup (using pg_basebackup)...
INFO: executing:
/usr/pgsql-12/bin/pg_basebackup -l "repmgr base backup" -D /data/pgsql/data -h 192.168.123.17 -p 15433 -U repmgr -c fast -X stream -S repmgr_slot_2
shell-init: error retrieving current directory: getcwd: cannot access parent directories: No such file or directory
could not identify current directory: No such file or directory
DEBUG: connecting to: "user=repmgr dbname=repmgr host=192.168.123.17 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
NOTICE: standby clone (using pg_basebackup) complete
NOTICE: you can now start your PostgreSQL server
HINT: for example: /usr/bin/systemctl start pg12
HINT: after starting the server, you need to register this standby with "repmgr standby register"
命令执行完成后,就可以启动从节点了:
systemctl start pg12
查看进程可以看到从节点正常复制了:
[root@centos10 ~]# ps aux |grep postgre
root 1637 0.0 0.0 191784 2336 pts/0 S Mar30 0:00 su - postgres
postgres 1638 0.0 0.0 115444 2064 pts/0 S+ Mar30 0:00 -bash
postgres 1744 0.3 1.1 450504 46412 ? Ss 00:05 0:00 /usr/pgsql-12/bin/postmaster -D /data/pgsql/data/
postgres 1745 0.0 0.0 260664 2084 ? Ss 00:05 0:00 postgres: logger
postgres 1746 0.0 0.0 450708 3100 ? Ss 00:05 0:00 postgres: startup recovering 000000010000000000000013
postgres 1747 0.0 0.0 450504 2112 ? Ss 00:05 0:00 postgres: checkpointer
postgres 1748 0.0 0.0 450504 2120 ? Ss 00:05 0:00 postgres: background writer
postgres 1749 0.0 0.0 262784 2008 ? Ss 00:05 0:00 postgres: stats collector
postgres 1750 0.4 0.1 457520 4700 ? Ss 00:05 0:00 postgres: walreceiver streaming 0/130007C8
root 1752 0.0 0.0 112712 964 pts/1 S+ 00:05 0:00 grep --color=auto postgre
主节点正常发送wal日志了:
[root@centos7 ~]# ps aux |grep postgres
postgres 2205 0.0 1.1 450504 46464 ? Ss 00:01 0:00 /usr/pgsql-12/bin/postmaster -D /data/pgsql/data/
postgres 2207 0.0 0.0 260664 2084 ? Ss 00:01 0:00 postgres: logger
postgres 2209 0.0 0.1 450908 4180 ? Ss 00:01 0:00 postgres: checkpointer
postgres 2210 0.0 0.0 450764 3916 ? Ss 00:01 0:00 postgres: background writer
postgres 2211 0.0 0.1 450504 6296 ? Ss 00:01 0:00 postgres: walwriter
postgres 2212 0.0 0.0 451784 3352 ? Ss 00:01 0:00 postgres: autovacuum launcher
postgres 2213 0.0 0.0 262784 2196 ? Ss 00:01 0:00 postgres: archiver last was 000000010000000000000012.00000028.backup
postgres 2214 0.0 0.0 262940 2352 ? Ss 00:01 0:00 postgres: stats collector
postgres 2215 0.0 0.0 451624 2856 ? Ss 00:01 0:00 postgres: logical replication launcher
root 2216 0.0 0.0 191784 2336 pts/0 S 00:01 0:00 su - postgres
postgres 2217 0.0 0.0 115444 2004 pts/0 S+ 00:01 0:00 -bash
postgres 2260 0.0 0.0 451812 3848 ? Ss 00:05 0:00 postgres: walsender repmgr 192.168.123.20(38408) streaming 0/130008B0
postgres 2265 0.0 0.1 452464 7044 ? Ss 00:07 0:00 postgres: postgres postgres 192.168.123.1(56048) idle
root 2294 0.0 0.0 112712 960 pts/1 S+ 00:10 0:00 grep --color=auto postgres
8、
注册从节点
-bash-4.2$ repmgr standby register
INFO: connecting to local node "pg2" (ID: 2)
DEBUG: connecting to: "user=repmgr dbname=repmgr host=192.168.123.20 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
INFO: connecting to primary database
DEBUG: connecting to: "user=repmgr dbname=repmgr host=192.168.123.17 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
WARNING: --upstream-node-id not supplied, assuming upstream node is primary (node ID: 1)
INFO: standby registration complete
NOTICE: standby node "pg2" (ID: 2) successfully registered
9、
查看集群状态
-bash-4.2$ repmgr cluster show
DEBUG: connecting to: "user=repmgr password=123456 dbname=repmgr host=192.168.123.17 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
DEBUG: connecting to: "user=repmgr password=123456 dbname=repmgr host=192.168.123.17 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
DEBUG: connecting to: "user=repmgr password=123456 dbname=repmgr host=192.168.123.20 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
DEBUG: connecting to: "user=repmgr password=123456 dbname=repmgr host=192.168.123.17 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----+------+---------+-----------+----------+----------+----------+----------+------------------------------------------------------------------------------
1 | pg1 | primary | * running | | default | 100 | 1 | host=192.168.123.17 port=15433 user=repmgr dbname=repmgr password=123456
2 | pg2 | standby | running | pg1 | default | 100 | 1 | host=192.168.123.20 port=15433 user=repmgr dbname=repmgr password=123456
三、
自动故障转移和测试
如果需要启用自动故障转移,那么,我们需要开启repmgrd服务,每个节点都开启
开启比较简单,一条命令就可以了
systemctl start repmgr-12
服务启动完毕后,可以查看进程是否正确:
[root@centos7 ~]# ps aux |grep repm
postgres 5915 0.2 0.2 453488 9716 ? Ss 09:28 0:00 postgres: repmgr repmgr 192.168.123.17(35544) idle
postgres 5917 0.0 0.0 84276 2160 ? S 09:28 0:00 /usr/pgsql-12/bin/repmgrd -f /etc/repmgr/12/repmgr.conf -p /run/repmgr/repmgrd-12.pid -d --verbose
root 5925 0.0 0.0 112712 960 pts/3 S+ 09:29 0:00 grep --color=auto repm
注意,这里问题来了,该服务是依赖于postgresql数据库的进程的,数据库必须是在线的才可以启停repmgrd服务
那么,很多人会疑惑,这个服务有什么用?其实很简单,就是故障转移用的,这些是关键配置,在发生宕机的情况下,这些命令会自动的启停数据库,例如,主节点宕机,那么,从节点的该服务将会在规定时间内检查主节点,如果一直找不到主,将会把自己提升为主,先执行这个命令:
promote_command='/usr/pgsql-12/bin/repmgr standby promote -f /etc/repmgr/12/repmgr.conf --log-to-file'
然后执行这个命令:
service_promote_command = '/usr/pgsql-12/bin/pg_ctl promote -D /data/pgsql/data'
特别注意:这些命令全部需要绝对路径,否则会报找不到命令,配置文件修改后,重启repmgrd服务就可以生效
promote_command='/usr/pgsql-12/bin/repmgr standby promote -f /etc/repmgr/12/repmgr.conf --log-to-file'
follow_command='/usr/pgsql-12/bin/repmgr standby follow -f /etc/repmgr/12/repmgr.conf --log-to-file --upstream-node-id=%n'
monitoring_history=yes
service_start_command = '/usr/bin/systemctl start pg12'
service_stop_command = '/usr/bin/systemctl stop pg12'
service_restart_command = '/usr/bin/systemctl restart pg12'
service_reload_command = '/usr/bin/systemctl reload pg12'
service_promote_command = '/usr/pgsql-12/bin/pg_ctl promote -D /data/pgsql/data'
repmgrd_service_start_command = '/usr/pgsql-12/bin/repmgrd -f /etc/repmgr/12/repmgr.conf -p /run/repmgr/repmgrd-12.pid -d --verbose'
repmgrd_service_stop_command = '/usr/bin/kill `cat /run/repmgr/repmgrd-12.pid`'
下面是将主节点数据库服务关闭后,从节点的repmgrd服务打印的日志/var/log/repmgr/pg2.log
[2024-03-31 09:13:29] [INFO] 2 total nodes registered
[2024-03-31 09:13:29] [INFO] primary node "pg1" (ID: 1) and this node have the same location ("default")
[2024-03-31 09:13:29] [INFO] no other sibling nodes - we win by default
[2024-03-31 09:13:29] [DEBUG] election result: WON
[2024-03-31 09:13:29] [NOTICE] this node is the only available candidate and will now promote itself
[2024-03-31 09:13:29] [DEBUG] get_node_record():
SELECT n.node_id, n.type, n.upstream_node_id, n.node_name, n.conninfo, n.repluser, n.slot_name, n.location, n.priority, n.active, n.config_file, '' AS upstream_node_name, NULL AS attached FROM repmgr.nodes n WHERE n.node_id = 1
[2024-03-31 09:13:29] [INFO] promote_command is:
"/usr/pgsql-12/bin/repmgr standby promote -f /etc/repmgr/12/repmgr.conf --log-to-file"
[2024-03-31 09:13:29] [NOTICE] redirecting logging output to "/var/log/repmgr/pg2.log"
[2024-03-31 09:13:29] [DEBUG] connecting to: "user=repmgr password=123456 dbname=repmgr host=192.168.123.20 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
[2024-03-31 09:13:29] [DEBUG] connecting to: "user=repmgr password=123456 dbname=repmgr host=192.168.123.17 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
[2024-03-31 09:13:29] [DEBUG] connecting to: "user=repmgr password=123456 dbname=repmgr host=192.168.123.20 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
[2024-03-31 09:13:29] [NOTICE] promoting standby to primary
[2024-03-31 09:13:29] [DETAIL] promoting server "pg2" (ID: 2) using "/usr/pgsql-12/bin/pg_ctl promote -D /data/pgsql/data"
[2024-03-31 09:13:29] [NOTICE] waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
[2024-03-31 09:13:29] [DEBUG] setting node 2 as primary and marking existing primary as failed
[2024-03-31 09:13:29] [NOTICE] STANDBY PROMOTE successful
[2024-03-31 09:13:29] [DETAIL] server "pg2" (ID: 2) was successfully promoted to primary
[2024-03-31 09:13:29] [DEBUG] result of promote_command: 0
[2024-03-31 09:13:29] [INFO] checking state of node 2, 1 of 6 attempts
再次启动原主节点的postgresql数据库,在主节点查看集群状态,可以发现脑裂了:
-bash-4.2$ repmgr cluster show
DEBUG: connecting to: "user=repmgr password=123456 dbname=repmgr host=192.168.123.17 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
DEBUG: connecting to: "user=repmgr password=123456 dbname=repmgr host=192.168.123.17 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
DEBUG: connecting to: "user=repmgr password=123456 dbname=repmgr host=192.168.123.20 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----+------+---------+----------------------+----------+----------+----------+----------+------------------------------------------------------------------------------
1 | pg1 | primary | * running | | default | 100 | 1 | host=192.168.123.17 port=15433 user=repmgr dbname=repmgr password=123456
2 | pg2 | standby | ! running as primary | | default | 100 | 2 | host=192.168.123.20 port=15433 user=repmgr dbname=repmgr password=123456
WARNING: following issues were detected
- node "pg2" (ID: 2) is registered as standby but running as primary
在从节点,可以看到显示的不一样,不过仍然是脑裂:
-bash-4.2$ repmgr cluster show
DEBUG: connecting to: "user=repmgr password=123456 dbname=repmgr host=192.168.123.20 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
DEBUG: connecting to: "user=repmgr password=123456 dbname=repmgr host=192.168.123.17 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
DEBUG: connecting to: "user=repmgr password=123456 dbname=repmgr host=192.168.123.20 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----+------+---------+-----------+----------+----------+----------+----------+------------------------------------------------------------------------------
1 | pg1 | primary | - failed | ? | default | 100 | | host=192.168.123.17 port=15433 user=repmgr dbname=repmgr password=123456
2 | pg2 | primary | * running | | default | 100 | 2 | host=192.168.123.20 port=15433 user=repmgr dbname=repmgr password=123456
WARNING: following issues were detected
- unable to connect to node "pg1" (ID: 1)
解决方案:
首先,在从节点强制踢掉自己,数据库服务不能关闭,命令如下:
repmgr primary unregister --force
然后关闭从节点的数据库服务,重新从现主17服务器拉取:
repmgr standby clone -h 192.168.123.17 -p15433 -d repmgr --force -U repmgr -c
同步拉取完毕后,在重新强制注册为standby节点,就恢复正常了:
-bash-4.2$ repmgr standby register --force
INFO: connecting to local node "pg2" (ID: 2)
DEBUG: connecting to: "user=repmgr password=123456 dbname=repmgr host=192.168.123.20 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
INFO: connecting to primary database
DEBUG: connecting to: "user=repmgr password=1234562 dbname=repmgr host=192.168.123.17 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
INFO: standby registration complete
NOTICE: standby node "pg2" (ID: 2) successfully registered
再次查看集群状态,发现恢复正常了:
-bash-4.2$ repmgr cluster show
DEBUG: connecting to: "user=repmgr password=123456 dbname=repmgr host=192.168.123.20 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
DEBUG: connecting to: "user=repmgr password=123456 dbname=repmgr host=192.168.123.17 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
DEBUG: connecting to: "user=repmgr password=123456 dbname=repmgr host=192.168.123.20 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
DEBUG: connecting to: "user=repmgr password=123456 dbname=repmgr host=192.168.123.17 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----+------+---------+-----------+----------+----------+----------+----------+------------------------------------------------------------------------------
1 | pg1 | primary | * running | | default | 100 | 1 | host=192.168.123.17 port=15433 user=repmgr dbname=repmgr password=123456
2 | pg2 | standby | running | pg1 | default | 100 | 1 | host=192.168.123.20 port=15433 user=repmgr dbname=repmgr password=123456
总结:
脑裂的处理步骤是,首先,将异常的节点也就是 running as primary的节点踢掉,这个节点在哪,就在哪个节点踢掉,踢节点也就是unregister,踢完后standby clone,重新克隆的时候需要停止异常节点数据库服务,也就是原从节点的数据库服务,完事后在重新注册并再次启动原从节点的数据库就可以了;而脑裂产生的原因是原主节点数据库挂掉后,又重新启动数据库服务了,重新启动数据库后,原主认为自己是主节点,而原从节点已经提升为主节点,它也认为自己是主节点,造成集群内有两个主节点的信息了,那么,此时集群的主从复制就是混乱状态
如果脑裂很快发现按上述步骤没什么问题,但如果时间很长才发现,那么,需要根据pg_controlldata命令确定哪个是主,哪个是从,Latest checkpoint's oldestXID高的为主,反之为从,在从节点按上述操作即可
-bash-4.2$ pg_controldata
pg_control version number: 1201
Catalog version number: 201909212
Database system identifier: 7351900556498275811
Database cluster state: in production
pg_control last modified: Sun 31 Mar 2024 11:41:05 AM CST
Latest checkpoint location: 0/2F001528
Latest checkpoint's REDO location: 0/2F0014F0
Latest checkpoint's REDO WAL file: 00000001000000000000002F
Latest checkpoint's TimeLineID: 1
Latest checkpoint's PrevTimeLineID: 1
Latest checkpoint's full_page_writes: on
Latest checkpoint's NextXID: 0:3382
Latest checkpoint's NextOID: 16454
Latest checkpoint's NextMultiXactId: 1
Latest checkpoint's NextMultiOffset: 0
Latest checkpoint's oldestXID: 479
Latest checkpoint's oldestXID's DB: 1
Latest checkpoint's oldestActiveXID: 3382
Latest checkpoint's oldestMultiXid: 1
Latest checkpoint's oldestMulti's DB: 1
Latest checkpoint's oldestCommitTsXid:0
Latest checkpoint's newestCommitTsXid:0
Time of latest checkpoint: Sun 31 Mar 2024 11:41:05 AM CST
Fake LSN counter for unlogged rels: 0/3E8
Minimum recovery ending location: 0/0
Min recovery ending loc's timeline: 0
Backup start location: 0/0
Backup end location: 0/0
End-of-backup record required: no
wal_level setting: logical
wal_log_hints setting: on
max_connections setting: 1000
max_worker_processes setting: 8
max_wal_senders setting: 10
max_prepared_xacts setting: 0
max_locks_per_xact setting: 64
track_commit_timestamp setting: off
Maximum data alignment: 8
Database block size: 8192
Blocks per segment of large relation: 131072
WAL block size: 8192
Bytes per WAL segment: 16777216
Maximum length of identifiers: 64
Maximum columns in an index: 32
Maximum size of a TOAST chunk: 1996
Size of a large-object chunk: 2048
Date/time type storage: 64-bit integers
Float4 argument passing: by value
Float8 argument passing: by value
Data page checksum version: 0
Mock authentication nonce: e76af00554364729517eece5fa0fcf6f7f0662da3a154cabb730c2ab24edd267
重新加入节点:
repmgr node rejoin -h 192.168.123.20 -p 15433 -Upostgres -d repmgr --force-rewind
四,
主备手动切换
repmgr standby switchover --force -S postgres
这个命令比较方便,在从节点执行,不过需要配置免密,并且需要standby从库有上级,比如下面的,Upstream这一栏里,表明pg1的上级是pg2,pg2是主节点,pg1是从节点,
-bash-4.2$ repmgr cluster show
DEBUG: connecting to: "user=repmgr password=123456 dbname=repmgr host=192.168.123.20 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
DEBUG: connecting to: "user=repmgr password=123456 dbname=repmgr host=192.168.123.17 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
DEBUG: connecting to: "user=repmgr password=123456 dbname=repmgr host=192.168.123.20 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
DEBUG: connecting to: "user=repmgr password=123456 dbname=repmgr host=192.168.123.20 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----+------+---------+-----------+----------+----------+----------+----------+------------------------------------------------------------------------------
1 | pg1 | standby | running | pg2 | default | 100 | 2 | host=192.168.123.17 port=15433 user=repmgr dbname=repmgr password=123456
2 | pg2 | primary | * running | | default | 100 | 3 | host=192.168.123.20 port=15433 user=repmgr dbname=repmgr password=123456
那么,想快速的把这两个节点调换一个位置就是上面的switchover命令了,非常的方便,比如,现在我又想把pg2切换到从,pg1切换到主,那么,应该是在pg1上执行switchover命令
在切换期间,pg2的数据库需要重启一次,重启完毕后,看pg1的日志如下:
EBUG: minimum of 1 free slots (0 for siblings) required; 10 available
NOTICE: attempting to pause repmgrd on 2 nodes
DEBUG: connecting to: "user=repmgr password=123456 dbname=repmgr host=192.168.123.17 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
DEBUG: connecting to: "user=repmgr password=123456 dbname=repmgr host=192.168.123.20 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
DEBUG: pausing repmgrd on node "pg1" (ID: 1)
DEBUG: pausing repmgrd on node "pg2" (ID: 2)
NOTICE: local node "pg1" (ID: 1) will be promoted to primary; current primary "pg2" (ID: 2) will be demoted to standby
NOTICE: stopping current primary node "pg2" (ID: 2)
DEBUG: remote_command():
ssh -o Batchmode=yes -q -o ConnectTimeout=10 192.168.123.20 /usr/pgsql-12/bin/repmgr -f /etc/repmgr/12/repmgr.conf node service --action=stop --checkpoint --superuser=postgres
NOTICE: issuing CHECKPOINT on node "pg2" (ID: 2)
DETAIL: executing server command "/usr/bin/systemctl stop pg12"
INFO: checking for primary shutdown; 1 of 60 attempts ("shutdown_check_timeout")
DEBUG: ping status is: PQPING_OK
DEBUG: sleeping 1 second until next check
INFO: checking for primary shutdown; 2 of 60 attempts ("shutdown_check_timeout")
DEBUG: ping status is: PQPING_OK
DEBUG: sleeping 1 second until next check
INFO: checking for primary shutdown; 3 of 60 attempts ("shutdown_check_timeout")
DEBUG: ping status is: PQPING_OK
DEBUG: sleeping 1 second until next check
INFO: checking for primary shutdown; 4 of 60 attempts ("shutdown_check_timeout")
DEBUG: ping status is: PQPING_OK
DEBUG: sleeping 1 second until next check
INFO: checking for primary shutdown; 5 of 60 attempts ("shutdown_check_timeout")
DEBUG: ping status is: PQPING_OK
DEBUG: sleeping 1 second until next check
INFO: checking for primary shutdown; 6 of 60 attempts ("shutdown_check_timeout")
DEBUG: ping status is: PQPING_OK
DEBUG: sleeping 1 second until next check
INFO: checking for primary shutdown; 7 of 60 attempts ("shutdown_check_timeout")
DEBUG: ping status is: PQPING_OK
DEBUG: sleeping 1 second until next check
INFO: checking for primary shutdown; 8 of 60 attempts ("shutdown_check_timeout")
DEBUG: ping status is: PQPING_OK
DEBUG: sleeping 1 second until next check
INFO: checking for primary shutdown; 9 of 60 attempts ("shutdown_check_timeout")
DEBUG: ping status is: PQPING_OK
DEBUG: sleeping 1 second until next check
INFO: checking for primary shutdown; 10 of 60 attempts ("shutdown_check_timeout")
DEBUG: ping status is: PQPING_OK
DEBUG: sleeping 1 second until next check
INFO: checking for primary shutdown; 11 of 60 attempts ("shutdown_check_timeout")
DEBUG: ping status is: PQPING_OK
DEBUG: sleeping 1 second until next check
INFO: checking for primary shutdown; 12 of 60 attempts ("shutdown_check_timeout")
DEBUG: ping status is: PQPING_OK
DEBUG: sleeping 1 second until next check
INFO: checking for primary shutdown; 13 of 60 attempts ("shutdown_check_timeout")
DEBUG: ping status is: PQPING_OK
DEBUG: sleeping 1 second until next check
INFO: checking for primary shutdown; 14 of 60 attempts ("shutdown_check_timeout")
DEBUG: ping status is: PQPING_OK
DEBUG: sleeping 1 second until next check
INFO: checking for primary shutdown; 15 of 60 attempts ("shutdown_check_timeout")
DEBUG: ping status is: PQPING_OK
DEBUG: sleeping 1 second until next check
INFO: checking for primary shutdown; 16 of 60 attempts ("shutdown_check_timeout")
DEBUG: ping status is: PQPING_OK
DEBUG: sleeping 1 second until next check
INFO: checking for primary shutdown; 17 of 60 attempts ("shutdown_check_timeout")
DEBUG: ping status is: PQPING_NO_RESPONSE
DEBUG: remote_command():
ssh -o Batchmode=yes -q -o ConnectTimeout=10 192.168.123.20 /usr/pgsql-12/bin/repmgr -f /etc/repmgr/12/repmgr.conf node status --is-shutdown-cleanly
NOTICE: current primary has been cleanly shut down at location 0/3C000028
DEBUG: local node last receive LSN is 0/3C0000A0, primary shutdown checkpoint LSN is 0/3C000028
NOTICE: promoting standby to primary
DETAIL: promoting server "pg1" (ID: 1) using "/usr/pgsql-12/bin/pg_ctl promote -D /data/pgsql/data"
waiting for server to promote.... done
server promoted
NOTICE: waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
DEBUG: setting node 1 as primary and marking existing primary as failed
NOTICE: STANDBY PROMOTE successful
DETAIL: server "pg1" (ID: 1) was successfully promoted to primary
DEBUG: executing:
/usr/pgsql-12/bin/repmgr -f /etc/repmgr/12/repmgr.conf --no-wait -d \'user=repmgr password=123456 dbname=repmgr host=192.168.123.17 port=15433\' node rejoin > /tmp/node-rejoin.1711860263.log 2>&1 && echo "1" || echo "0"
DEBUG: remote_command():
ssh -o Batchmode=yes -q -o ConnectTimeout=10 192.168.123.20 /usr/pgsql-12/bin/repmgr -f /etc/repmgr/12/repmgr.conf --no-wait -d \'user=repmgr password=123456 dbname=repmgr host=192.168.123.17 port=15433\' node rejoin > /tmp/node-rejoin.1711860263.log 2>&1 && echo "1" || echo "0"
DEBUG: connecting to: "user=repmgr password=123456 dbname=repmgr host=192.168.123.20 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
ERROR: connection to database failed
DETAIL:
could not connect to server: Connection refused
Is the server running on host "192.168.123.20" and accepting
TCP/IP connections on port 15433?
DETAIL: attempted to connect using:
user=repmgr password=123456 dbname=repmgr host=192.168.123.20 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path=
ERROR: execution of "repmgr node rejoin" on demotion candidate "pg2" (ID: 2) failed
DETAIL: check log file "/tmp/node-rejoin.1711860263.log" on "pg2" for details
DEBUG: connecting to: "user=repmgr password=123456 dbname=repmgr host=192.168.123.20 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
ERROR: connection to database failed
DETAIL:
could not connect to server: Connection refused
Is the server running on host "192.168.123.20" and accepting
TCP/IP connections on port 15433?
DETAIL: attempted to connect using:
user=repmgr password=123456 dbname=repmgr host=192.168.123.20 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path=
INFO: sleeping 1 second; 1 of 60 attempts ("standby_reconnect_timeout") to reconnect to demoted primary
DEBUG: connecting to: "user=repmgr password=123456 dbname=repmgr host=192.168.123.20 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
NOTICE: replication slot "repmgr_slot_1" deleted on node 2
NOTICE: switchover was successful
DETAIL: node "pg1" is now primary and node "pg2" is attached as standby
DEBUG: unpausing repmgrd on node "pg1" (ID: 1)
DEBUG: connecting to: "user=repmgr password=123456 dbname=repmgr host=192.168.123.17 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
DEBUG: unpausing repmgrd on node "pg2" (ID: 2)
DEBUG: connecting to: "user=repmgr password=123456 dbname=repmgr host=192.168.123.20 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
NOTICE: STANDBY SWITCHOVER has completed successfully
在强调一次,需要配置免密才可以使用此功能,因为从切换到主的时候,需要ssh到主做一点事情,切换完毕后,原主也就是新从需要在注册一下,到pg2执行:
-bash-4.2$ repmgr standby register -F
INFO: connecting to local node "pg2" (ID: 2)
DEBUG: connecting to: "user=repmgr password=123456 dbname=repmgr host=192.168.123.20 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
INFO: connecting to primary database
DEBUG: connecting to: "user=repmgr password=123456 dbname=repmgr host=192.168.123.17 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
INFO: standby registration complete
NOTICE: standby node "pg2" (ID: 2) successfully registered
再看集群状态就十分正常了:
-bash-4.2$ repmgr service status
DEBUG: connecting to: "user=repmgr password=123456 dbname=repmgr host=192.168.123.20 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
DEBUG: connecting to: "user=repmgr password=123456 dbname=repmgr host=192.168.123.17 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
DEBUG: connecting to: "user=repmgr password=123456 dbname=repmgr host=192.168.123.20 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
DEBUG: connecting to: "user=repmgr password=123456 dbname=repmgr host=192.168.123.17 port=15433 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path="
ID | Name | Role | Status | Upstream | repmgrd | PID | Paused? | Upstream last seen
----+------+---------+-----------+----------+---------+------+---------+--------------------
1 | pg1 | primary | * running | | running | 7631 | no | n/a
2 | pg2 | standby | running | pg1 | running | 5511 | no | 0 second(s) ago