记录一次mysql长事务的经历

一.项目介绍

二.问题暴漏

三.问题排查

1.连接池方向

2.数据库方向

四.代码模拟

五.错误原因分析

1.MySQL参数优化

2.代码优化

六.总结

一.项目介绍

项目是springboot+nacos的微服务架构,商城购物类系统,分多个服务,问题出现在众多服务中的单个服务

二.问题暴漏

在系统日志中显示错误:connection holder is null,空连接

Caused by: java.sql.SQLException: connection holder is null
	at com.alibaba.druid.pool.DruidPooledConnection.checkStateInternal(DruidPooledConnection.java:1155)
	at com.alibaba.druid.pool.DruidPooledConnection.checkState(DruidPooledConnection.java:1148)
	at com.alibaba.druid.pool.DruidPooledConnection.prepareStatement(DruidPooledConnection.java:336)

这种错误是偶发现象,日志出现时,重启后消失,使用druid连接池

三.问题排查

1.连接池方向

初步认为连接池的问题,通过给druid开源项目提issue,尝试调整druid参数

参数调优过程:

1.调整maxActive,加大连接数

2.开启remove-abandoned,调整remove-abandoned-timeout和log-abandoned,清除未使用的连接

3.druid从1.1.9升级当时最新版本1.2.23

4.开启test-while-idle,启用keep-alive,加大min-evictable-idle-time-millis,用于连接检测

5.开启test-on-borrow和test-on-return,连接在借还时进行检测

spring:
  datasource:
    url: jdbc:mysql://localhost:3306/emall?useUnicode=true&characterEncoding=utf-8&useSSL=false&serverTimezone=UTC
    username: root
    password: 123456
    driver-class-name: com.mysql.cj.jdbc.Driver
    druid:
      # 初始化时建立物理连接的个数
      initial-size: 0
      # 最小空闲连接数
      min-idle: 0
      # 设置连接池中最大活动连接数
      max-active: 10
      # 获取连接时最大等待时间，单位毫秒,1000毫秒等1秒
      max-wait: 30000
      # 多久进行一次空闲连接的检测，单位毫秒（60,000 毫秒为1分钟）
      time-between-eviction-runs-millis: 60000
      # 连接在池中最小生存的时间，单位毫秒
      min-evictable-idle-time-millis: 300000
      # 设置用于连接验证的 SQL 查询语句
      validation-query: SELECT 1
      # 在空闲时检测连接有效性
      test-while-idle: true
      # 获取连接时不测试连接，影响性能
      test-on-borrow: false
      # 归还连接时不测试连接，影响性能
      test-on-return: false
      # 启用keepAlive，确保连接池中的连接不被数据库服务端关闭
      keep-alive: true
      # 是否清除长时间未使用的连接
      remove-abandoned: false
      #定义连接被认定为被遗弃之前的超时时间（单位：秒）
      remove-abandoned-timeout: 1800
      #是否记录遗弃连接的日志信息
      log-abandoned: true  # 记录遗弃连接的日志

通过以上几个参数的调整,错误是有变化的,好像是一套组合拳,removeAbandoned: true/testOnBorrow=true/升级版本,三个操作下去,出现connection closed

Caused by: java.sql.SQLException: connection closed
  at com.alibaba.druid.pool.DruidPooledConnection.checkStateInternal(DruidPooledConnection.java:1178)
  at com.alibaba.druid.pool.DruidPooledConnection.checkState(DruidPooledConnection.java:1169)
  at com.alibaba.druid.pool.DruidPooledConnection.prepareStatement(DruidPooledConnection.java:356)

关闭removeAbandoned: false,出现connection disabled

Caused by: java.sql.SQLException: connection disabled
        at com.alibaba.druid.pool.DruidPooledConnection.checkStateInternal(DruidPooledConnection.java:1184)
        at com.alibaba.druid.pool.DruidPooledConnection.checkState(DruidPooledConnection.java:1169)
        at com.alibaba.druid.pool.DruidPooledConnection.prepareStatement(DruidPooledConnection.java:356)

错误都出自druid连接池的DruidPooledConnection的checkStateInternal方法

以上操作并没有解决问题,

2.数据库方向

1.对mysql数据库进行参数调优,应该是开启testOnReturn: true,配置mysq的'max_connections'最大连接数为2000,组合操作下来,看到的新错误,Lock wait timeout exceeded; try restarting transaction

Cause: com.mysql.cj.jdbc.exceptions.MySQLTransactionRollbackException: Lock wait timeout exceeded; try restarting transaction

另一个错误是Statement cancelled due to timeout or client request

Cause: com.mysql.cj.jdbc.exceptions.MySQLTimeoutException: Statement cancelled due to timeout or client request

2.调整mysql以下参数进行观察

SET innodb_lock_wait_timeout = 100; #增加锁等待时间为100s
set global innodb_print_all_deadlocks=1;#控制是否将所有死锁信息打印到 MySQL 错误日志
SET GLOBAL max_allowed_packet = 10M; 指定客户端与服务器之间通信时单个数据包的最大大小4改10

以上操作没有解决问题

3.通过查询数据库中的事务,锁,连接情况,查看是否有慢sql,性能查询语句:

SELECT
  trx.trx_id  事务的唯一标识符,
  trx.trx_state   事务的状态,
  trx.trx_started 事务开始的时间,
  trx.trx_mysql_thread_id 事务关联的MySQL线程ID,
  trx.trx_isolation_level 事物隔离级别,
  t.name 线程名称,
  pl.Id 线程ID,
  pl.USER 线程所属的用户,
  pl.HOST 线程连接的主机,
  pl.Command  线程的当前命令,
  pl.TIME 线程的运行时间,
  pl.State 线程的状态,
  pl.Info 正在执行的SQL语句,
  es.THREAD_ID ,
  es.SQL_TEXT 当前执行的SQL语句,
  es.CURRENT_SCHEMA 当前使用的数据库,
  es.TIMER_WAIT/1000000000 语句的执行时间,
  il.lock_id 锁的唯一标识符,
  il.lock_mode 锁的模式,
  il.lock_type 锁的类型,
  il.lock_table 被锁定的表,
  il.lock_index 被锁定的索引,
  il.lock_space 被锁定的表空间,
  il.lock_page 被锁定的页,
  il.lock_rec 被锁定的记录
FROM
  information_schema.INNODB_TRX trx
  LEFT JOIN information_schema.PROCESSLIST pl ON trx.trx_mysql_thread_id = pl.Id
  LEFT JOIN information_schema.INNODB_LOCKS il ON trx.trx_id = il.lock_trx_id
	LEFT JOIN PERFORMANCE_SCHEMA.threads t ON t.processlist_id =  pl.id
	LEFT JOIN PERFORMANCE_SCHEMA.events_statements_current es ON es.THREAD_ID  = t.thread_id

查询mysql大量的长事务和锁等待,事务执行时间太长,这个是明显能查到的问题,应该是事务长时间未提交导致,即着手先解决长事务问题.

四.代码模拟

通过对事务不提交操作,模拟长事务

数据库出现长时间的事务未提交,锁类型:表锁和行锁,锁模式:IX:表级别意向排他锁,X:行级别排他锁,REC_NOT_GAP: 记录锁（不包括间隙,确保锁定的是精确的行记录，避免幻读）,InnoDB存储引擎使用多种锁模式来实现事务隔离和一致性

再执行一次方法,出现锁等待,

五.错误原因分析

可能长事务存在时间超过了druid连接池的生存时间,连接被连接池回收关闭,程序报错,并且大量的长事务造成锁等待或者死锁,消耗资源.

1.MySQL参数优化

可以缩短mysql的wait_timeout一个连接在服务器端闲置多长时间后会被自动关闭,默认是8小时,比如600秒

2.代码优化

1.缩小事务控制范围,比如,只有在更新数据库时,再开启事务,对于查询等这类逻辑操作无需开启事务.

2.发生异常时,捕捉异常,进行事务回滚.

3.逻辑判断需要提前结束方法时.对于已经开启的事务记得关闭.

六.总结

问题为何偶发,并且系统自上线的这几年来只有近期才出现问题.

可能是在近期MySQL提升了隔离级别,由读已提交（Read Committed）改为可重复读（Repeatable Read）,为保证数据的一致性,数据库的锁竞争会变的激烈,并发事务对同一资源（如表或行）的访问需求较高，导致锁(行锁或者表锁)争用严重,并且造成死锁,虽然 InnoDB 通常会自动检测和处理死锁，但在某些复杂情况下，可能会导致事务超时。较高的事务隔离级别（如 SERIALIZABLE）可能会导致更多的锁竞争和等待超时。之前适合RC级别的代码写法已经不适用RR级别.