一 问题描述
发现线上环境一个从库出现延迟,延迟了2天了,还没追上主库。
查看当前运行的sql及事务,发现这个sql语句是在delete一个没主键的表。
二 问题模拟
这里在测试环境复现下这个问题。
2.1 在主库造数据
use baidd;
CREATE TABLE test2(id INT primary key AUTO_INCREMENT);
INSERT INTO test2(id) VALUES(1),(2),(3),(4);
INSERT INTO test2(id) SELECT id+(SELECT COUNT(*) FROM test2) FROM test2;
INSERT INTO test2(id) SELECT id+(SELECT COUNT(*) FROM test2) FROM test2;
……
多次运行如上sql。
2.2 在主库删除这个表的主键
alter table test2 modify id int;
alter table test2 drop primary key;
2.3 在主库清空这个表数据
#待这两百万条数据同步到从库后,在主库上删除这两百万条数据
delete from test2;
三 问题排查
3.1 在从库观察主从复制
#发现五分钟了从库还没同步删除掉,查看主从状态,发现复制位点很久都不变
root@localhost [(none)]>show slave status \G;
*************************** 1. row ***************************
Slave_IO_State: Waiting for source to send event
Master_Host: 10.106.210.206
Master_User: repl
Master_Port: 14728
Connect_Retry: 60
Master_Log_File: mysql-bin.000025
Read_Master_Log_Pos: 84258028
Relay_Log_File: host01-relay-bin.000004
Relay_Log_Pos: 63196660
Relay_Master_Log_File: mysql-bin.000025
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Replicate_Do_DB:
Replicate_Ignore_DB:
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 0
Last_Error:
Skip_Counter: 0
Exec_Master_Log_Pos: 63196524
Relay_Log_Space: 84258422
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Master_SSL_Allowed: No
Master_SSL_CA_File:
Master_SSL_CA_Path:
Master_SSL_Cert:
Master_SSL_Cipher:
Master_SSL_Key:
Seconds_Behind_Master: 96
Master_SSL_Verify_Server_Cert: No
Last_IO_Errno: 0
Last_IO_Error:
Last_SQL_Errno: 0
Last_SQL_Error:
Replicate_Ignore_Server_Ids:
Master_Server_Id: 210206
Master_UUID: 408abbb7-dc1b-11ee-a91f-fefcfee1f844
Master_Info_File: mysql.slave_master_info
SQL_Delay: 0
SQL_Remaining_Delay: NULL
Slave_SQL_Running_State: Replica has read all relay log; waiting for more updates
Master_Retry_Count: 86400
Master_Bind:
Last_IO_Error_Timestamp:
Last_SQL_Error_Timestamp:
Master_SSL_Crl:
Master_SSL_Crlpath:
Retrieved_Gtid_Set:
Executed_Gtid_Set: 408abbb7-dc1b-11ee-a91f-fefcfee1f844:1-1120,
b5896746-dc1a-11ee-b573-fefcfee443b5:1-1391
Auto_Position: 0
Replicate_Rewrite_DB:
Channel_Name:
Master_TLS_Version:
Master_public_key_path:
Get_master_public_key: 0
Network_Namespace:
1 row in set, 1 warning (0.00 sec)
ERROR:
No query specified
注意这里:
Relay_Log_File: host01-relay-bin.000004
Relay_Log_Pos: 63196660
3.2 查看当前运行的sql
#查看当前运行的事务
select * from information_schema.INNODB_TRX;
发现trx_query为空,以前线上从库出现延迟,能看到当前运行的sql的,不知为何这次看不到。
#查看该事务对应的进程
select * from information_schema.PROCESSLIST where id in(select trx_mysql_thread_id from information_schema.INNODB_TRX);
Info也是为空。
select * from `performance_schema`.threads where processlist_id =
(
select id from information_schema.PROCESSLIST where id in(select trx_mysql_thread_id from information_schema.INNODB_TRX)
);
3.3 查看binlog里对应的sql
用3.2命令查不到导致延迟的sql内容,只能去binlog里查看了。
查看从库中继日志里延迟时卡住的位置对应的sql。
上面3.1那里show slave status:
Relay_Log_File: host01-relay-bin.000004
Relay_Log_Pos: 63196660
mysqlbinlog -v host01-relay-bin.000004 > 000004.log
less 000004.log
搜索63196660
可以看到这个位置,在操作baidd.t2这个表:
#240427 17:47:17 server id 210206 end_log_pos 63196737 CRC32 0x71c38d61 Table_map: `baidd`.`test2` mapped to number 107
再一直往下翻,看到有很多对应的delete语句:
3.4 在从库查看捕获的慢sql有没有主键
show indexes from baidd.test2;
如果没有主键,就需要先在从库上为这个表建下主键,先解决这个延迟的问题。
四 着手处理
思路:
- 先在从库杀掉这个慢的delete的sql
- 在从库加主键,然后在从库delete就很快了
- 在主库加主键,从库报主键冲突时,跳过这个错误,继续同步即可
4.1 在从库停掉主从复制
stop slave;
发现2分钟都stop不掉,因为当前有慢事务在运行。
4.2 在从库杀掉这个慢事务
#查看慢事务对应的进程id,注意别多杀了,只杀运行时间很久的,导致延迟的sql
select * from information_schema.PROCESSLIST where id in(select trx_mysql_thread_id from information_schema.INNODB_TRX);
这里是kill 55588
Kill完后,stop slave就能很快执行完了。
4.3 在从库上给这个表加主键
set sql_log_bin=0;
SET GLOBAL super_read_only=off;
show variables like '%super_read_only%';
alter table test2 add primary key(id);
SET GLOBAL super_read_only=on;
4.4 开启同步
start slave;
set sql_log_bin=1;
发现很快就同步完了。
4.5 在主库上建下主键,避免以后再出现这个问题
alter table test2 add primary key(id);
4.6 在从库跳过这个建主键的事务
因为上面已经在从库建过主键了,所以从库复制会停止,提示主键冲突,可以通过跳过这个事务来处理。
4.6.1 查看从库复制状态
root@localhost [(none)]>show slave status \G;
*************************** 1. row ***************************
Slave_IO_State: Waiting for source to send event
Master_Host: 10.106.210.206
Master_User: repl
Master_Port: 14728
Connect_Retry: 60
Master_Log_File: mysql-bin.000025
Read_Master_Log_Pos: 84259101
Relay_Log_File: host01-relay-bin.000005
Relay_Log_Pos: 1195
Relay_Master_Log_File: mysql-bin.000025
Slave_IO_Running: Yes
Slave_SQL_Running: No
Replicate_Do_DB:
Replicate_Ignore_DB:
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 1068
Last_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction 'ANONYMOUS' at source log mysql-bin.000025, end_log_pos 84259101. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.
Skip_Counter: 0
Exec_Master_Log_Pos: 84258897
Relay_Log_Space: 84259617
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Master_SSL_Allowed: No
Master_SSL_CA_File:
Master_SSL_CA_Path:
Master_SSL_Cert:
Master_SSL_Cipher:
Master_SSL_Key:
Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
Last_IO_Errno: 0
Last_IO_Error:
Last_SQL_Errno: 1068
Last_SQL_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction 'ANONYMOUS' at source log mysql-bin.000025, end_log_pos 84259101. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.
Replicate_Ignore_Server_Ids:
Master_Server_Id: 210206
Master_UUID: 408abbb7-dc1b-11ee-a91f-fefcfee1f844
Master_Info_File: mysql.slave_master_info
SQL_Delay: 0
SQL_Remaining_Delay: NULL
Slave_SQL_Running_State:
Master_Retry_Count: 86400
Master_Bind:
Last_IO_Error_Timestamp:
Last_SQL_Error_Timestamp: 240427 18:04:40
Master_SSL_Crl:
Master_SSL_Crlpath:
Retrieved_Gtid_Set:
Executed_Gtid_Set: 408abbb7-dc1b-11ee-a91f-fefcfee1f844:1-1120,
b5896746-dc1a-11ee-b573-fefcfee443b5:1-1391
Auto_Position: 0
Replicate_Rewrite_DB:
Channel_Name:
Master_TLS_Version:
Master_public_key_path:
Get_master_public_key: 0
Network_Namespace:
1 row in set, 1 warning (0.00 sec)
注意看这几个:
#查看具体报错
select * from performance_schema.replication_applier_status_by_worker
LAST_ERROR_MESSAGE显示报错:
Worker 1 failed executing transaction 'ANONYMOUS' at source log mysql-bin.000025, end_log_pos 84259101; Error 'Multiple primary key defined' on query. Default database: 'baidd'. Query: 'alter table test2 add primary key(id)'
4.6.2 在从库跳过这个事务
4.6.2.1 针对未开启GTID的从库
show variables like '%gtid_mode%'; #off表示未开启gtid
只需要在从库执行以下命令就能跳过一个事务。
set sql_log_bin=0;
set global sql_slave_skip_counter=1;
4.6.2.2 针对开启了GTID的从库
思路:
将gtid_next设置为自己环境Executed_Gtid_Set的下一个位置,然后开始一个空事务。
示例:
set sql_log_bin=0;
set gtid_next='408abbb7-dc1b-11ee-a91f-fefcfee1f844:1121';
begin;
commit;
set gtid_next='automatic';
4.6.3 开启同步
start slave;
set sql_log_bin=1;
4.6.4 验证主从同步状态
show slave status \G;
确保复制没问题。
五 教训
表没有主键的时候,delete很多数据,会导致从库出现很长时间延迟,因此需要严格把控,定期检查,确保主库上不存在没有主键的表。