PostgreSQL的学习心得和知识总结（一百五十五）|[performance]优化期间将 WHERE 子句中的 IN VALUES 替换为 ANY

目录结构

注：提前言明本文借鉴了以下博主、书籍或网站的内容，其列表如下：

1、参考书籍：《PostgreSQL数据库内核分析》
2、参考书籍：《数据库事务处理的艺术：事务管理与并发控制》
3、PostgreSQL数据库仓库链接，点击前往
4、日本著名PostgreSQL数据库专家铃木启修网站主页，点击前往
5、参考书籍：《PostgreSQL中文手册》
6、参考书籍：《PostgreSQL指南：内幕探索》，点击前往
7、参考书籍：《事务处理概念与技术》

1、本文内容全部来源于开源社区 GitHub和以上博主的贡献，本文也免费开源（可能会存在问题，评论区等待大佬们的指正）
2、本文目的：开源共享抛砖引玉一起学习
3、本文不提供任何资源不存在任何交易与任何组织和机构无关
4、大家可以根据需要自行复制粘贴以及作为其他个人用途，但是不允许转载不允许商用（写作不易，还请见谅 💖）
5、本文内容基于PostgreSQL master源码开发而成

优化期间将 WHERE 子句中的 IN VALUES 替换为 ANY

文章快速说明索引
功能实现背景说明
- 简介
- 引用
功能实现源码解析
- 现有语法分析
- 新增补丁解析

文章快速说明索引

学习目标：

做数据库内核开发久了就会有一种少年得志，年少轻狂的错觉，然鹅细细一品觉得自己其实不算特别优秀远远没有达到自己想要的。也许光鲜的表面掩盖了空洞的内在，每每想到于此，皆有夜半临渊如履薄冰之感。为了睡上几个踏实觉，即日起暂缓其他基于PostgreSQL数据库的兼容功能开发，近段时间将着重于学习分享Postgres的基础知识和实践内幕。

学习内容：（详见目录）

1、优化期间将 WHERE 子句中的 IN VALUES 替换为 ANY

学习时间：

2024年10月21日 21:53:26

学习产出：

1、PostgreSQL数据库基础知识回顾 1个
2、CSDN 技术博客 1篇
3、PostgreSQL数据库内核深入学习

注：下面我们所有的学习环境是Centos8+PostgreSQL master +Oracle19C+MySQL8.0

postgres=# select version();
                                                  version                                                   
------------------------------------------------------------------------------------------------------------
 PostgreSQL 18devel on x86_64-pc-linux-gnu, compiled by gcc (GCC) 8.5.0 20210514 (Red Hat 8.5.0-21), 64-bit
(1 row)

postgres=#

#-----------------------------------------------------------------------------#

SQL> select * from v$version;          

BANNER        Oracle Database 19c EE Extreme Perf Release 19.0.0.0.0 - Production	
BANNER_FULL	  Oracle Database 19c EE Extreme Perf Release 19.0.0.0.0 - Production Version 19.17.0.0.0	
BANNER_LEGACY Oracle Database 19c EE Extreme Perf Release 19.0.0.0.0 - Production	
CON_ID 0


#-----------------------------------------------------------------------------#

mysql> select version();
+-----------+
| version() |
+-----------+
| 8.0.27    |
+-----------+
1 row in set (0.06 sec)

mysql>

功能实现背景说明

原文链接：https://danolivo.substack.com/p/postgresql-values-any-transformation

discussion：Replace IN VALUES with ANY in WHERE clauses during optimization

简介

像往常一样，这个项目是由多个用户报告促成的，这些报告包含一些典型的抱怨，例如“SQL server执行查询的时间更快” 或 “Postgres 无法pick up我的索引”。这些报告共同的根本问题是经常使用的 VALUES 序列，通常在查询树中转换为 SEMI JOIN。

我还想讨论一个普遍的问题：开源 DBMS 是否应该纠正用户错误？我的意思是在开始搜索最佳计划之前优化查询，消除自连接、子查询和简化表达式 - 所有这些都可以通过适当的查询调整来实现。问题并不那么简单，因为 DBA 指出 Oracle 中查询规划的成本随着查询文本的复杂性而增长，这很可能是由于优化规则范围广泛等原因造成的。

现在，让我们将注意力转向 VALUES 构造。有趣的是，它不仅用于 INSERT 命令，而且还经常以集合包含测试的形式出现在 SELECT 查询中：

SELECT * FROM something WHERE x IN (VALUES (1), (2), ...);

在查询计划中，此语法结构转换为 SEMI JOIN。为了演示问题的本质，让我们生成一个测试表，其中某一列的数据分布不均匀：

postgres=# select version();
                                     version                                     
---------------------------------------------------------------------------------
 PostgreSQL 18devel on x86_64-pc-linux-gnu, compiled by gcc (GCC) 13.1.0, 64-bit
(1 row)

postgres=# CREATE EXTENSION tablefunc;
CREATE EXTENSION
postgres=# CREATE TABLE norm_test AS
postgres-#   SELECT abs(r::integer) AS x, 'abc'||r AS payload
postgres-#   FROM normal_rand(1000, 1., 10.) AS r;
SELECT 1000
postgres=# CREATE INDEX ON norm_test (x);
CREATE INDEX
postgres=# ANALYZE norm_test;
ANALYZE
postgres=#

这里，norm_test 表的值 x 服从正态分布，平均值为 1，标准差为 10 [1]。不同的值并不多，这些值都将包含在 MCV 统计信息中。因此，尽管分布不均匀，但仍可以准确计算每个值的重复数。此外，我们自然地在此列上引入了索引，从而简化了表的扫描。现在，让我们执行查询。查询很简单，对吧？使用两次索引扫描迭代来执行它是合理的。然而，在 Postgres 中，我们有：

postgres=# explain (verbose, costs off, analyze) SELECT * FROM norm_test WHERE x IN (VALUES (1), (29));
                                   QUERY PLAN                                    
---------------------------------------------------------------------------------
 Hash Semi Join (actual time=0.024..0.288 rows=97 loops=1)
   Output: norm_test.x, norm_test.payload
   Hash Cond: (norm_test.x = "*VALUES*".column1)
   ->  Seq Scan on public.norm_test (actual time=0.012..0.127 rows=1000 loops=1)
         Output: norm_test.x, norm_test.payload
   ->  Hash (actual time=0.005..0.006 rows=2 loops=1)
         Output: "*VALUES*".column1
         Buckets: 1024  Batches: 1  Memory Usage: 9kB
         ->  Values Scan on "*VALUES*" (actual time=0.001..0.002 rows=2 loops=1)
               Output: "*VALUES*".column1
 Planning Time: 0.522 ms
 Execution Time: 0.354 ms
(12 rows)

postgres=#

从这里开始，我稍微简化了解释，以便于理解。

嗯，当两个索引扫描就足够了时，是否要对所有表的元组进行顺序扫描？让我们禁用 HashJoin，看看会发生什么：

postgres=# SET enable_hashjoin = 'off';
SET
postgres=# explain (verbose, costs off, analyze) SELECT * FROM norm_test WHERE x IN (VALUES (1), (29));
                                         QUERY PLAN                                          
---------------------------------------------------------------------------------------------
 Nested Loop (actual time=0.184..0.309 rows=97 loops=1)
   Output: norm_test.x, norm_test.payload
   ->  Unique (actual time=0.010..0.014 rows=2 loops=1)
         Output: "*VALUES*".column1
         ->  Sort (actual time=0.009..0.010 rows=2 loops=1)
               Output: "*VALUES*".column1
               Sort Key: "*VALUES*".column1
               Sort Method: quicksort  Memory: 25kB
               ->  Values Scan on "*VALUES*" (actual time=0.002..0.003 rows=2 loops=1)
                     Output: "*VALUES*".column1
   ->  Bitmap Heap Scan on public.norm_test (actual time=0.089..0.135 rows=48 loops=2)
         Output: norm_test.x, norm_test.payload
         Recheck Cond: (norm_test.x = "*VALUES*".column1)
         Heap Blocks: exact=10
         ->  Bitmap Index Scan on norm_test_x_idx (actual time=0.061..0.061 rows=48 loops=2)
               Index Cond: (norm_test.x = "*VALUES*".column1)
 Planning Time: 0.442 ms
 Execution Time: 0.373 ms
(18 rows)

postgres=#

现在您可以看到 Postgres 已经挤出了最大值：在一次遍历每个外部值的 VALUES 集时，它会对表执行索引扫描。这比前一个选项有趣得多。但是，它并不像常规索引扫描那么简单。此外，如果您更仔细地查看查询说明，您会发现优化器在预测连接和索引扫描的基数时犯了一个错误。如果您重写没有 VALUES 的查询会发生什么：

postgres=# explain (verbose, costs off, analyze) SELECT * FROM norm_test WHERE x IN (1, 29);
                                      QUERY PLAN                                       
---------------------------------------------------------------------------------------
 Bitmap Heap Scan on public.norm_test (actual time=0.069..0.166 rows=97 loops=1)
   Output: x, payload
   Recheck Cond: (norm_test.x = ANY ('{1,29}'::integer[]))
   Heap Blocks: exact=8
   ->  Bitmap Index Scan on norm_test_x_idx (actual time=0.055..0.055 rows=97 loops=1)
         Index Cond: (norm_test.x = ANY ('{1,29}'::integer[]))
 Planning Time: 0.110 ms
 Execution Time: 0.192 ms
(8 rows)

postgres=# show enable_hashjoin ;
 enable_hashjoin 
-----------------
 off
(1 row)

postgres=# reset enable_hashjoin ;
RESET
postgres=# explain (verbose, costs off, analyze) SELECT * FROM norm_test WHERE x IN (1, 29);
                                      QUERY PLAN                                       
---------------------------------------------------------------------------------------
 Bitmap Heap Scan on public.norm_test (actual time=0.049..0.127 rows=97 loops=1)
   Output: x, payload
   Recheck Cond: (norm_test.x = ANY ('{1,29}'::integer[]))
   Heap Blocks: exact=8
   ->  Bitmap Index Scan on norm_test_x_idx (actual time=0.033..0.034 rows=97 loops=1)
         Index Cond: (norm_test.x = ANY ('{1,29}'::integer[]))
 Planning Time: 0.117 ms
 Execution Time: 0.157 ms
(8 rows)

postgres=#

如您所见，我们得到了一个仅包含索引扫描的查询计划，其成本几乎降低了一半。同时，通过从集合中估计每个值并将这两个值都包含在 MCV 统计信息中，Postgres 可以准确地预测此扫描的基数。

因此，使用 VALUES 序列本身并不是一个大问题（您始终可以使用 HashJoin 并对内部的 VALUES 进行哈希处理），但它却是一个危险的来源：

优化器可以选择 NestLoop，但使用庞大的 VALUES 列表会降低性能。
突然之间，可以选择 SeqScan 而不是 IndexScan。
优化器在预测 JOIN 操作及其底层操作的基数时会出现重大估计错误。

顺便说一句，为什么有人需要使用这样的表达式？

我猜这是自动化系统（ORM 或 Rest API）测试将对象纳入特定对象集时的特殊情况。由于 VALUES 描述了一个关系表，并且这种列表的值是表行，因此我们最有可能处理的是每行代表应用程序中对象实例的情况。当对象仅由一个属性表征时，我们的案例是一个极端情况。如果我的猜测是错误的，请在评论中纠正我 - 也许有人知道其他原因？

因此，将 x IN VALUES 构造传递给优化器是有风险的。为什么不通过将此 VALUES 构造转换为数组来解决这种情况呢？然后，我们将有一个像 x = ANY [...] 这样的构造，这是 Postgres 代码中 ScalarArrayOpExpr 操作的一个特例。它将简化查询树，消除不必要的连接的出现。此外，Postgres 基数评估机制可以与数组包含检查操作一起使用。如果数组足够小（<100 个元素），它将逐个元素执行统计评估。此外，Postgres 可以通过对值进行哈希处理来优化数组搜索（如果所需的内存适合 work_mem 值）——每个人都会很高兴，对吧？

好吧，我们决定在优化实验室中尝试这样做 - 令人惊讶的是，它结果相对简单。我们遇到的第一个怪癖是转换仅适用于标量值的操作：也就是说，到目前为止，通常不可能转换形式为(x,y) IN (VALUES (1,1), (2,2), ...)的表达式，以便结果与转换前的状态完全匹配。为什么？这很难解释 - 原因在于记录类型的比较运算符的设计 - 要教会 Postgres 完全类似于标量类型地使用这样的运算符，类型缓存需要进行大量重新设计。其次，您必须记住检查此子查询（是的，VALUES 在查询树中表示为子查询）是否存在易失性函数 - 就是这样 - 查询树变量器的一次传递进行转换，非常类似于 [2] 用数组替换 VALUES，如果可能的话将其构造化。奇怪的是，即使 VALUES 包含参数、函数调用和复杂表达式，也可以进行转换，如下所示：

-- 这个是现在pg的执行计划

[postgres@localhost:~/test/bin]$ ./psql 
psql (18devel)
Type "help" for help.

postgres=# CREATE TEMP TABLE onek (ten int, two real, four real);
CREATE TABLE
postgres=# PREPARE test (int,numeric, text) AS
postgres-#   SELECT ten FROM onek
postgres-#   WHERE sin(two)*four/($3::real) IN (VALUES (sin($2)), (2), ($1));
PREPARE
postgres=# explain (verbose, costs off, analyze) EXECUTE test(1, 2, '3');
                                            QUERY PLAN                                             
---------------------------------------------------------------------------------------------------
 Hash Semi Join (actual time=0.010..0.011 rows=0 loops=1)
   Output: onek.ten
   Hash Cond: (((sin((onek.two)::double precision) * onek.four) / '3'::real) = "*VALUES*".column1)
   ->  Seq Scan on pg_temp.onek (actual time=0.009..0.009 rows=0 loops=1)
         Output: onek.ten, onek.two, onek.four
   ->  Hash (never executed)
         Output: "*VALUES*".column1
         ->  Values Scan on "*VALUES*" (never executed)
               Output: "*VALUES*".column1
 Planning Time: 1.317 ms
 Execution Time: 0.062 ms
(11 rows)

postgres=#

下面是他们patch的计划：

[postgres@localhost:~/test/bin]$ ./psql 
psql (18devel)
Type "help" for help.

postgres=# CREATE TEMP TABLE onek (ten int, two real, four real);
CREATE TABLE
postgres=# PREPARE test (int,numeric, text) AS
  SELECT ten FROM onek
  WHERE sin(two)*four/($3::real) IN (VALUES (sin($2)), (2), ($1));
PREPARE
postgres=# explain (verbose, costs off, analyze) EXECUTE test(1, 2, '3');
                                                            QUERY PLAN                                                            
----------------------------------------------------------------------------------------------------------------------------------
 Seq Scan on pg_temp.onek (actual time=0.009..0.010 rows=0 loops=1)
   Output: ten
   Filter: (((sin((onek.two)::double precision) * onek.four) / '3'::real) = ANY ('{0.9092974268256817,2,1}'::double precision[]))
 Planning Time: 1.336 ms
 Execution Time: 0.036 ms
(5 rows)

postgres=#

该功能目前正在测试中。查询树结构非常稳定，考虑到对内核版本的依赖性很小，因此没有理由修改代码；它可以在 Postgres 中使用，直到版本 10 甚至更早。像往常一样，您可以使用在典型的 Ubuntu 22 环境中编译的库的二进制文件 - 它没有任何 UI，可以静态或动态加载。

现在，我上面提到的真正的圣战。由于我们将其作为外部库执行，因此我们必须拦截规划器钩子（以在优化之前简化查询树），这需要我们额外通过查询树。显然，系统中的大多数查询都不需要这种转换，并且此操作只会增加开销。但是，当它起作用时，它可以提供明显的效果（从我的观察来看，它确实如此）。

直到最近，PostgreSQL 社区才达成了共识 [3, 4]：如果可以通过更改查询本身来解决问题，那么使内核代码复杂化就没有意义了，因为这将不可避免地导致维护成本增加，并且（回想一下 Oracle 的经验）会影响优化器本身的性能。

然而，通过查看核心提交，我注意到社区的意见似乎正在发生变化。例如，今年，他们通过添加相关子查询 [5] 使子查询到 SEMI JOIN 转换的技术变得复杂。不久之后，他们允许父查询接收有关子查询结果排序顺序的信息 [6]，尽管以前为了简化规划，查询及其子查询是独立规划的。这看起来像是一种重新规划子查询的方法，不是吗？

您怎么看？开源项目是否能够支持多种转换规则，从而消除用户引入的冗余和复杂性，从而使查询更具可读性和可理解性？最重要的是 - 它值得吗？

引用

F.41. tablefunc — functions that return tables
OR-clause support for indexes
Discussion on missing optimizations, 2017
BUG #18643: EXPLAIN estimated rows mismatch, 2024
Commit 9f13376. pull-up correlated subqueries
Commit a65724d. Propagate pathkeys from CTEs up to the outer query

功能实现源码解析

现有语法分析

原作者patch备注，如下：

将意外出现的 'x IN (VALUES, …) 表达式转换为 x IN ‘ANY …’。第二种变体更好，因为它可以让规划器避免使用一个不必要的 SEMI JOIN 运算符。
这种表达式形式通常出现在自动生成的查询中，作为在一组其他查询中搜索对象的极端情况，当对象仅由一个属性描述时。
让这种不寻常的优化成为核心，因为如果没有这种构造，规划器只会多花几个周期。

现在的x = 'ANY ...' 和 x 'IN ...'，语法如下：

a_expr:		c_expr									{ $$ = $1; }
			...
			| a_expr subquery_Op sub_type '(' a_expr ')'		%prec Op
				{
					if ($3 == ANY_SUBLINK)
						$$ = (Node *) makeA_Expr(AEXPR_OP_ANY, $2, $1, $5, @2);
					else
						$$ = (Node *) makeA_Expr(AEXPR_OP_ALL, $2, $1, $5, @2);
				}
			...
			;

sub_type:	ANY										{ $$ = ANY_SUBLINK; }
			| SOME									{ $$ = ANY_SUBLINK; }
			| ALL									{ $$ = ALL_SUBLINK; }
		;

a_expr:		c_expr									{ $$ = $1; }
			...
			| a_expr IN_P in_expr
				{
					/* in_expr returns a SubLink or a list of a_exprs */
					if (IsA($3, SubLink))
					{
						/* generate foo = ANY (subquery) */
						SubLink	   *n = (SubLink *) $3;

						n->subLinkType = ANY_SUBLINK;
						n->subLinkId = 0;
						n->testexpr = $1;
						n->operName = NIL;		/* show it's IN not = ANY */
						n->location = @2;
						$$ = (Node *) n;
					}
					else
					{
						/* 生成标量 IN 表达式 */
						$$ = (Node *) makeSimpleA_Expr(AEXPR_IN, "=", $1, $3, @2);
					}
				}
			...
			;

看一下相关的执行计划，我这里在原来patch基础上增加了一个GUC参数进行控制：

在这里插入图片描述

如上，IN VALUES在这种情况下的执行计划不佳，这也是此次patch的目的。

继续对上面语法in_expr进行拆解，如下：

in_expr:	select_with_parens
				{
					SubLink	   *n = makeNode(SubLink);

					n->subselect = $1;
					/* other fields will be filled later */
					$$ = (Node *) n;
				}
			| '(' expr_list ')'						{ $$ = (Node *) $2; }
		;

select_with_parens:
			'(' select_no_parens ')'				{ $$ = $2; }
			| '(' select_with_parens ')'			{ $$ = $2; }
		;

select_no_parens:
			simple_select						{ $$ = $1; }
			...
			
simple_select:
				...
				| values_clause							{ $$ = $1; }
				...

values_clause:
			VALUES '(' expr_list ')'
				{
					SelectStmt *n = makeNode(SelectStmt);

					n->valuesLists = list_make1($3);
					$$ = (Node *) n;
				}
			| values_clause ',' '(' expr_list ')'
				{
					SelectStmt *n = (SelectStmt *) $1;

					n->valuesLists = lappend(n->valuesLists, $4);
					$$ = (Node *) n;
				}
		;

看到这里，做一个小结：

IN VALUES 在语法解析过程中，构造成了一个SubLink->subselect
IN 无VALUES 在语法解析过程中，构造成了一个$$ = (Node *) makeSimpleA_Expr(AEXPR_IN, "=", $1, $3, @2)
ANY ...在语法解析过程中，构造成了一个$$ = (Node *) makeA_Expr(AEXPR_OP_ANY, $2, $1, $5, @2)

接下来，调试如下SQL：

在这里插入图片描述

如上，这两个SQL的执行计划是一样的，接下来看一下(下面这个SQL)内部的转换过程如下：

在这里插入图片描述

此时的函数堆栈，如下：

transformAExprIn(ParseState * pstate, A_Expr * a)
transformExprRecurse(ParseState * pstate, Node * expr)
transformExpr(ParseState * pstate, Node * expr, ParseExprKind exprKind)
transformWhereClause(ParseState * pstate, Node * clause, ParseExprKind exprKind, const char * constructName)
transformSelectStmt(ParseState * pstate, SelectStmt * stmt)
transformStmt(ParseState * pstate, Node * parseTree)
transformOptionalSelectInto(ParseState * pstate, Node * parseTree)
transformExplainStmt(ParseState * pstate, ExplainStmt * stmt)
transformStmt(ParseState * pstate, Node * parseTree)
transformOptionalSelectInto(ParseState * pstate, Node * parseTree)
transformTopLevelStmt(ParseState * pstate, RawStmt * parseTree)
parse_analyze_fixedparams(RawStmt * parseTree, const char * sourceText, const Oid * paramTypes, int numParams, QueryEnvironment * queryEnv)
pg_analyze_and_rewrite_fixedparams(RawStmt * parsetree, const char * query_string, const Oid * paramTypes, int numParams, QueryEnvironment * queryEnv) 
exec_simple_query(const char * query_string)
...

新增补丁解析

调用入口，如下：

在这里插入图片描述

此时的函数堆栈，如下：

pull_up_sublinks_qual_recurse(PlannerInfo * root, Node * node, Node ** jtlink1, Relids available_rels1, Node ** jtlink2, Relids available_rels2)
pull_up_sublinks_jointree_recurse(PlannerInfo * root, Node * jtnode, Relids * relids)
pull_up_sublinks(PlannerInfo * root)
subquery_planner(PlannerGlobal * glob, Query * parse, PlannerInfo * parent_root, _Bool hasRecursion, double tuple_fraction, SetOperationStmt * setops)
standard_planner(Query * parse, const char * query_string, int cursorOptions, ParamListInfo boundParams)
planner(Query * parse, const char * query_string, int cursorOptions, ParamListInfo boundParams)
pg_plan_query(Query * querytree, const char * query_string, int cursorOptions, ParamListInfo boundParams)
standard_ExplainOneQuery(Query * query, int cursorOptions, IntoClause * into, ExplainState * es, const char * queryString, ParamListInfo params, QueryEnvironment * queryEnv)
ExplainOneQuery(Query * query, int cursorOptions, IntoClause * into, ExplainState * es, const char * queryString, ParamListInfo params, QueryEnvironment * queryEnv)
ExplainQuery(ParseState * pstate, ExplainStmt * stmt, ParamListInfo params, DestReceiver * dest)
standard_ProcessUtility(PlannedStmt * pstmt, const char * queryString, _Bool readOnlyTree, ProcessUtilityContext context, ParamListInfo params, QueryEnvironment * queryEnv, DestReceiver * dest, QueryCompletion * qc)
ProcessUtility(PlannedStmt * pstmt, const char * queryString, _Bool readOnlyTree, ProcessUtilityContext context, ParamListInfo params, QueryEnvironment * queryEnv, DestReceiver * dest, QueryCompletion * qc)
PortalRunUtility(Portal portal, PlannedStmt * pstmt, _Bool isTopLevel, _Bool setHoldSnapshot, DestReceiver * dest, QueryCompletion * qc)
FillPortalStore(Portal portal, _Bool isTopLevel)
PortalRun(Portal portal, long count, _Bool isTopLevel, _Bool run_once, DestReceiver * dest, DestReceiver * altdest, QueryCompletion * qc)
exec_simple_query(const char * query_string)
...

上面的GUC参数enable_convert_values_to_any是我新增的，可以忽略！

接下来就是此次patch的核心函数convert_VALUES_to_ANY，如下：

// src/backend/optimizer/plan/subselect.c

/*
 * Transform appropriate testexpr and const VALUES expression to SaOpExpr.
 * 将适当的 testexpr 和 const VALUES 表达式转换为 SaOpExpr
 *
 * Return NULL, if transformation isn't allowed.
 */
ScalarArrayOpExpr *
convert_VALUES_to_ANY(Query *query, Node *testexpr)
{
	RangeTblEntry	   *rte;
	Node			   *leftop;
	Oid					consttype;
	int16				typlen;
	bool				typbyval;
	char				typalign;
	ArrayType		   *arrayConst;
	Oid					arraytype;
	Node			   *arrayNode;
	Oid					matchOpno;
	Form_pg_operator	operform;
	ScalarArrayOpExpr  *saopexpr;
	ListCell		   *lc;
	Oid					inputcollid;
	HeapTuple			opertup;
	bool				have_param = false;
	List			   *consts = NIL;

	/* Extract left side of SAOP from test epression */

	if (!IsA(testexpr, OpExpr) ||
		list_length(((OpExpr *) testexpr)->args) != 2 ||
		!is_simple_values_sequence(query))
		return NULL;

	rte = linitial_node(RangeTblEntry,query->rtable);
	leftop = linitial(((OpExpr *) testexpr)->args);
	matchOpno = ((OpExpr *) testexpr)->opno;
	inputcollid = linitial_oid(rte->colcollations);

	foreach (lc, rte->values_lists)
	{
		List *elem = lfirst(lc);
		Node *value = linitial(elem);

		value = eval_const_expressions(NULL, value);

		if (!IsA(value, Const))
			have_param = true;
		else if (((Const *) value)->constisnull)
			/*
			 * Constant expression isn't converted because it is a NULL.
			 * NULLS just not supported by the construct_array routine.
			 */
			return NULL;

		consts = lappend(consts, value);

	}
	Assert(list_length(consts) == list_length(rte->values_lists));

	consttype = linitial_oid(rte->coltypes);
	Assert(list_length(rte->coltypes) == 1 && OidIsValid(consttype));
	arraytype = get_array_type(linitial_oid(rte->coltypes));
	if (!OidIsValid(arraytype))
		return NULL;

	/* TODO: remember parameters */
	if (have_param)
	{
		/*
		 * We need to construct an ArrayExpr given we have Param's not just
		 * Const's.
		 */
		ArrayExpr  *arrayExpr = makeNode(ArrayExpr);

		/* array_collid will be set by parse_collate.c */
		arrayExpr->element_typeid = consttype;
		arrayExpr->array_typeid = arraytype;
		arrayExpr->multidims = false;
		arrayExpr->elements = consts;
		arrayExpr->location = -1;

		arrayNode = (Node *) arrayExpr;
	}
	else
	{
		int			i = 0;
		ListCell   *lc1;
		Datum	   *elems;

		/* Direct creation of Const array */

		elems = (Datum *) palloc(sizeof(Datum) * list_length(consts));
		foreach (lc1, consts)
			elems[i++] = lfirst_node(Const, lc1)->constvalue;

		get_typlenbyvalalign(consttype, &typlen, &typbyval, &typalign);

		arrayConst = construct_array(elems, i, consttype,
									 typlen, typbyval, typalign);
		arrayNode = (Node *) makeConst(arraytype, -1, inputcollid,
									   -1, PointerGetDatum(arrayConst),
									   false, false);
		pfree(elems);
	}

	/* Lookup for operator to fetch necessary information for the SAOP node */
	opertup = SearchSysCache1(OPEROID, ObjectIdGetDatum(matchOpno));
	if (!HeapTupleIsValid(opertup))
		elog(ERROR, "cache lookup failed for operator %u", matchOpno);

	operform = (Form_pg_operator) GETSTRUCT(opertup);

	/* Build the SAOP expression node */
	saopexpr = makeNode(ScalarArrayOpExpr);
	saopexpr->opno = matchOpno;
	saopexpr->opfuncid = operform->oprcode;
	saopexpr->hashfuncid = InvalidOid;
	saopexpr->negfuncid = InvalidOid;
	saopexpr->useOr = true;
	saopexpr->inputcollid = inputcollid;
	saopexpr->args = list_make2(leftop, arrayNode);
	saopexpr->location = -1;

	ReleaseSysCache(opertup);

	return saopexpr;
}

对于都是Const的Value，直接创建 Const 数组，如下：

	else
	{
		int			i = 0;
		ListCell   *lc1;
		Datum	   *elems;

		/* Direct creation of Const array */

		elems = (Datum *) palloc(sizeof(Datum) * list_length(consts));
		foreach (lc1, consts)
			elems[i++] = lfirst_node(Const, lc1)->constvalue;

		get_typlenbyvalalign(consttype, &typlen, &typbyval, &typalign);

		arrayConst = construct_array(elems, i, consttype,
									 typlen, typbyval, typalign);
		arrayNode = (Node *) makeConst(arraytype, -1, inputcollid,
									   -1, PointerGetDatum(arrayConst),
									   false, false);
		pfree(elems);
	}

在这里插入图片描述

接下来从缓存中查找的操作符，如下：

operform = (Form_pg_operator) GETSTRUCT(opertup);

{ oid => '96', oid_symbol => 'Int4EqualOperator', descr => 'equal',
  oprname => '=', oprcanmerge => 't', oprcanhash => 't', oprleft => 'int4',
  oprright => 'int4', oprresult => 'bool', oprcom => '=(int4,int4)',
  oprnegate => '<>(int4,int4)', oprcode => 'int4eq', oprrest => 'eqsel',
  oprjoin => 'eqjoinsel' },

最后就是构造这个ANY，如下：

	/* Build the SAOP expression node */
	saopexpr = makeNode(ScalarArrayOpExpr);
	saopexpr->opno = matchOpno;
	saopexpr->opfuncid = operform->oprcode;
	saopexpr->hashfuncid = InvalidOid;
	saopexpr->negfuncid = InvalidOid;
	saopexpr->useOr = true;
	saopexpr->inputcollid = inputcollid;
	saopexpr->args = list_make2(leftop, arrayNode);
	saopexpr->location = -1;

如上这块的实现与上面make_scalar_array_op一致，有兴趣的小伙伴可以深入了解！

而对于有 Param( VALUES 包含参数、函数调用和复杂表达式等)，而不仅仅是 Const的情况，则需要构造一个 ArrayExpr。如下：

	if (have_param)
	{
		/*
		 * We need to construct an ArrayExpr given we have Param's not just
		 * Const's.
		 */
		ArrayExpr  *arrayExpr = makeNode(ArrayExpr);

		/* array_collid will be set by parse_collate.c */
		arrayExpr->element_typeid = consttype;
		arrayExpr->array_typeid = arraytype;
		arrayExpr->multidims = false;
		arrayExpr->elements = consts;
		arrayExpr->location = -1;

		arrayNode = (Node *) arrayExpr;
	}

在这里插入图片描述

元信息，如下：

{ oid => '701', array_type_oid => '1022',
  descr => 'double-precision floating point number, 8-byte storage',
  typname => 'float8', typlen => '8', typbyval => 'FLOAT8PASSBYVAL',
  typcategory => 'N', typispreferred => 't', typinput => 'float8in',
  typoutput => 'float8out', typreceive => 'float8recv', typsend => 'float8send',
  typalign => 'd' },

{ oid => '670', descr => 'equal',
  oprname => '=', oprcanmerge => 't', oprcanhash => 't', oprleft => 'float8',
  oprright => 'float8', oprresult => 'bool', oprcom => '=(float8,float8)',
  oprnegate => '<>(float8,float8)', oprcode => 'float8eq', oprrest => 'eqsel',
  oprjoin => 'eqjoinsel' },

这两种情况下的arrayNode分别如下所示：

在这里插入图片描述

{ oid => '1604', descr => 'sine',
  proname => 'sin', prorettype => 'float8', proargtypes => 'float8',
  prosrc => 'dsin' },

{ oid => '1746', descr => 'convert numeric to float8',
  proname => 'float8', prorettype => 'float8', proargtypes => 'numeric',
  prosrc => 'numeric_float8' },
 
{ oid => '316', descr => 'convert int4 to float8',
  proname => 'float8', proleakproof => 't', prorettype => 'float8',
  proargtypes => 'int4', prosrc => 'i4tod' },

在这里插入图片描述

关于上面node的打印，我这里使用的是vscode，如下：

-exec call elog_node_display(15, "have_param_true", arrayNode, 1)

-exec call elog_node_display(15, "have_param_false", arrayNode, 1)

对此感兴趣的小伙伴可以看一下本人之前的博客，如下：

PostgreSQL的学习心得和知识总结（七十二）|深入理解PostgreSQL数据库开源节点树打印工具pgNodeGraph的作用原理及继续维护pgNodeGraph的声明，点击前往