clickhouse MPPDB数据库--新特性使用示例

clickhouse 新特性:

从clickhouse 22.3至最新的版本24.3.2.23,clickhouse在快速发展中,每个版本都增加了一些新的特性,在数据写入、查询方面都有性能加速。
本文根据clickhouse blog中的clickhouse release blog中,学习并梳理了一些在实际工作中可能用到的新特性。

以下是如何基于docker,如果试用这些新性

docker run -d --name=ch -p 8123:8123 -p 9000:9000 -p 9009:9009 --ulimit nofile=262144:262144 -v D:/ch/latest/external:/external:rw -v  chlatest:/var/lib/clickhouse:rw -v D:/ch/latest/logs:/var/log/clickhouse-server:rw -v D:/ch/latest/etc/clickhouse-server:/etc/clickhouse-server:rw clickhouse/clickhouse-server:24.3.2.23

docker exec -it bash

clickhouse-client --format_csv_delimiter=','

transform函数

进行字典替换

transform(x, array_from, array_to, default)
transform(T, Array(T), Array(U), U) -> U
transform(x, array_from, array_to)

UK-house-price-dataset.csv

CREATE TABLE uk_price_paid
(
	price UInt32,
	date Date,
	postcode1 LowCardinality(String),
	postcode2 LowCardinality(String),
	type Enum8('terraced' = 1, 'semi-detached' = 2, 'detached' = 3, 'flat' = 4, 'other' = 0),
	is_new UInt8,
	duration Enum8('freehold' = 1, 'leasehold' = 2, 'unknown' = 0),
	addr1 String,
	addr2 String,
	street LowCardinality(String),
	locality LowCardinality(String),
	town LowCardinality(String),
	district LowCardinality(String),
	county LowCardinality(String)
)
ENGINE = MergeTree
ORDER BY (postcode1, postcode2, addr1, addr2);

INSERT INTO uk_price_paid
WITH
   splitByChar(' ', postcode) AS p
SELECT
	toUInt32(price_string) AS price,
	parseDateTimeBestEffortUS(time) AS date,
	p[1] AS postcode1,
	p[2] AS postcode2,
	transform(a, ['T', 'S', 'D', 'F', 'O'], ['terraced', 'semi-detached', 'detached', 'flat', 'other']) AS type,
	b = 'Y' AS is_new,
	transform(c, ['F', 'L', 'U'], ['freehold', 'leasehold', 'unknown']) AS duration, addr1, addr2, street, locality, town, district, county
FROM file('UK-house-price-dataset.csv','CSV','uuid_string String, price_string String, time String, postcode String, a String, b String, c String, addr1 String, addr2 String, street String, locality String, town String, district String, county String, d String, e String'
);


SELECT transform(number, [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], ['zero', 'one', 'two', 'three', 'four', 'five', 'six', 'seven', 'eight', 'nine'], NULL) AS numbers
FROM system.numbers
LIMIT 10

读取文件

可以自动识别文件的类型,推荐字段类型

SELECT * FROM (
WITH
   splitByChar(' ', postcode) AS p
SELECT
	toUInt32(price_string) AS price,
	parseDateTimeBestEffortUS(time) AS date,
	p[1] AS postcode1,
	p[2] AS postcode2,
	transform(a, ['T', 'S', 'D', 'F', 'O'], ['terraced', 'semi-detached', 'detached', 'flat', 'other']) AS type,
	b = 'Y' AS is_new,
	transform(c, ['F', 'L', 'U'], ['freehold', 'leasehold', 'unknown']) AS duration, addr1, addr2, street, locality, town, district, county
FROM file('UK-house-price-dataset.csv','CSV','uuid_string String, price_string String, time String, postcode String, a String, b String, c String, addr1 String, addr2 String, street String, locality String, town String, district String, county String, d String, e String'
) SETTINGS format_csv_delimiter=','
) LIMIT 2;

自定义函数

根据需要,编写自定义函数

CREATE OR REPLACE TABLE line_changes
(
    version UInt32,
    line_change_type Enum('Add' = 1, 'Delete' = 2, 'Modify' = 3),
    line_number UInt32,
    line_content String,
    time datetime default now()
)
ENGINE = MergeTree
ORDER BY time;

INSERT INTO default.line_changes (version,line_change_type,line_number,line_content) VALUES
(1, 'Add'   , 1, 'ClickHouse provides SQL'),
(2, 'Add'   , 2, 'with improvements'),
(3, 'Add'   , 3, 'that makes it more friendly for analytical tasks.'),
(4, 'Add'   , 2, 'with many extensions'),
(5, 'Modify', 3, 'and powerful improvements'),
(6, 'Delete', 1, ''),
(7, 'Add'   , 1, 'ClickHouse provides a superset of SQL');

-- add a string (str) into an array (arr) at a specific position (pos)
CREATE OR REPLACE FUNCTION add AS (arr, pos, str) -> 
  arrayConcat(arraySlice(arr, 1, pos-1), [str], arraySlice(arr, pos));
  
-- delete the element at a specific position (pos) from an array (arr)
CREATE OR REPLACE FUNCTION delete AS (arr, pos) -> 
  arrayConcat(arraySlice(arr, 1, pos-1), arraySlice(arr, pos+1));

-- replace the element at a specific position (pos) in an array (arr)
CREATE OR REPLACE FUNCTION modify AS (arr, pos, str) -> 
  arrayConcat(arraySlice(arr, 1, pos-1), [str], arraySlice(arr, pos+1));

arrayFold

SELECT arrayFold((acc, v) -> (acc + v), [10, 20, 30],  0::UInt64) AS sum;


CREATE OR REPLACE VIEW text_version AS
WITH T1 AS (
  SELECT arrayZip(
    groupArray(line_change_type),
    groupArray(line_number),
    groupArray(line_content)) as line_ops
  FROM (SELECT * FROM line_changes 
        WHERE version <= {version:UInt32} ORDER BY version ASC)
)
SELECT arrayJoin(
  arrayFold((acc, v) -> 
    if(v.'change_type' = 'Add',       add(acc, v.'line_nr', v.'content'),
    if(v.'change_type' = 'Delete', delete(acc, v.'line_nr'),
    if(v.'change_type' = 'Modify', modify(acc, v.'line_nr', v.'content'), []))),
    line_ops::Array(Tuple(change_type String, line_nr UInt32, content String)),
    []::Array(String))) as lines
FROM T1;

SELECT * FROM text_version(version = 3);

Parallel window functions

窗口函数采用并行计算,性能大幅提升

SELECT
	country,
	day,
	max(tempAvg) AS temperature,
	avg(temperature) OVER (PARTITION BY country ORDER BY day ASC ROWS BETWEEN 5 PRECEDING AND CURRENT ROW) AS moving_avg_temp
FROM noaa
WHERE country != ''
GROUP BY
	country,
	date AS day
ORDER BY
	country ASC,
	day ASC

FINAL

基于FINAL及enable_vertical_final,在如下引擎
ReplacingMergeTree、 AggregatingMergeTree引擎中,可以快速查询到最新的数据

SELECT
    postcode1,
    formatReadableQuantity(avg(price))
FROM uk_property_offers FINAL
GROUP BY postcode1
ORDER BY avg(price) DESC
LIMIT 3;

SELECT
    postcode1,
    formatReadableQuantity(avg(price))
FROM uk_property_offers
GROUP BY postcode1
ORDER BY avg(price) DESC
LIMIT 3
SETTINGS enable_vertical_final = 1;

Variant Type

SET allow_experimental_variant_type=1, 
   use_variant_as_common_type = 1;
   
SELECT
    map('Hello', 1, 'World', 'Mark') AS x,
    toTypeName(x) AS type
FORMAT Vertical;


SELECT
    arrayJoin([1, true, 3.4, 'Mark']) AS value,
    toTypeName(value)
Row 1:
──────
x:    {'Hello':1,'World':'Mark'}
type: Map(String, Variant(String, UInt8))


   ┌─value─┬─toTypeName(value)─────────────────────┐
1. │ true  │ Variant(Bool, Float64, String, UInt8) │
2. │ true  │ Variant(Bool, Float64, String, UInt8) │
3. │ 3.4   │ Variant(Bool, Float64, String, UInt8) │
4. │ Mark  │ Variant(Bool, Float64, String, UInt8) │
   └───────┴───────────────────────────────────────┘

字符相似性函数

  • byteHammingDistance: the Hamming distance between two strings or vectors of equal length is the number of positions at which the corresponding symbols are different. In other words, it measures the minimum number of substitutions required to change one string into the other, or equivalently, the minimum number of errors that could have transformed one string into the other. In a more general context, the Hamming distance is one of several string metrics for measuring the edit distance between two sequences. It is named after the American mathematician Richard Hamming.

    • karolin” and “kathrin” is 3.
    • karolin” and “kerstin” is 3.
    • kathrin” and “kerstin” is 4.
    • 0000 and 1111 is 4.
    • 2173896 and 2233796 is 3.
  • editDistance:a way of quantifying how dissimilar two strings (e.g., words) are to one another, that is measured by counting the minimum number of operations required to transform one string into the other.

  • damerauLevenshteinDistance: a string metric for measuring the edit distance between two sequences. Informally, the Damerau–Levenshtein distance between two words is the minimum number of operations (consisting of insertions, deletions or substitutions of a single character, or transposition of two adjacent characters) required to change one word into the other.

  • jaroWinklerSimilarity: a string metric measuring an edit distance between two sequences. It is a variant of the Jaro distance metric

  • levenshteinDistance: a string metric for measuring the edit distance between two sequences. Informally, the Damerau–Levenshtein distance between two words is the minimum number of operations (consisting of insertions, deletions or substitutions of a single character, or transposition of two adjacent characters) required to change one word into the other.

https://clickhouse.com/docs/en/sql-reference/functions/string-functions#dameraulevenshteindistance

CREATE TABLE domains
(
	`domain` String,
	`rank` Float64
)
ENGINE = MergeTree
ORDER BY domain;

INSERT INTO domains SELECT
	c2 AS domain,
	1 / c1 AS rank
FROM url('domains.csv', 'CSV');

SELECT
    domain,
    levenshteinDistance(domain, 'facebook.com') AS d1,
    damerauLevenshteinDistance(domain, 'facebook.com') AS d2,
    jaroSimilarity(domain, 'facebook.com') AS d3,
    jaroWinklerSimilarity(domain, 'facebook.com') AS d4
FROM domains
ORDER BY d1 ASC
LIMIT 10 
Query id: 6f499f27-8274-4787-819a-b510322bdce3

    ┌─domain────────┬─d1─┬─d2─┬─────────────────d3─┬─────────────────d4─┐
 1. │ facebook.com  │  0 │  0 │                  1 │                  1 │
 2. │ facebonk.com  │  1 │  1 │ 0.8838383838383838 │ 0.9303030303030303 │
 3. │ fabebook.com  │  1 │  1 │  0.914141414141414 │ 0.9313131313131312 │
 4. │ facabook.com  │  1 │  1 │ 0.9444444444444443 │  0.961111111111111 │
 5. │ facobook.com  │  1 │  1 │ 0.8535353535353535 │ 0.8974747474747474 │
 6. │ facebook1.com │  1 │  1 │ 0.9743589743589745 │ 0.9846153846153847 │
 7. │ faceook.com   │  1 │  1 │ 0.9722222222222221 │ 0.9833333333333333 │
 8. │ faacebook.com │  1 │  1 │ 0.9743589743589745 │ 0.9794871794871796 │
 9. │ faceboock.com │  1 │  1 │ 0.9326923076923077 │ 0.9596153846153846 │
10. │ facebool.com  │  1 │  1 │ 0.9444444444444443 │ 0.9666666666666666 │
    └───────────────┴────┴────┴────────────────────┴────────────────────┘

Vectorized distance functions

可以作为向量数据库使用,支持L2,cosineDistance,IP三种向量相似度的度量方法

https://clickhouse.com/blog/clickhouse-release-24-02

WITH 'dog' AS search_term,
(
  SELECT vector
  FROM glove
  WHERE word = search_term
  LIMIT 1
) AS target_vector
SELECT word, cosineDistance(vector, target_vector) AS score
FROM glove
WHERE lower(word) != lower(search_term)
ORDER BY score ASC
LIMIT 5;

WITH
	'dog' AS search_term,
	(
    	SELECT vector
    	FROM glove
    	WHERE word = search_term
    	LIMIT 1
	) AS target_vector
SELECT
	word,
	1 - dotProduct(vector, target_vector) AS score
FROM glove
WHERE lower(word) != lower(search_term)
ORDER BY score ASC
LIMIT 5;

Adaptive asynchronous inserts

Asynchronous inserts shift data batching from the client side to the server side: data from insert queries is inserted into a buffer first and then written to the database storage later or asynchronously respectively.
在这里插入图片描述

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:/a/518290.html

如若内容造成侵权/违法违规/事实不符,请联系我们进行投诉反馈qq邮箱809451989@qq.com,一经查实,立即删除!

相关文章

【C++入门】关键字、命名空间以及输入输出

&#x1f49e;&#x1f49e; 前言 hello hello~ &#xff0c;这里是大耳朵土土垚~&#x1f496;&#x1f496; &#xff0c;欢迎大家点赞&#x1f973;&#x1f973;关注&#x1f4a5;&#x1f4a5;收藏&#x1f339;&#x1f339;&#x1f339; &#x1f4a5;个人主页&#x…

基于Python近红外光谱分析与机器学、深度学习方法融合技术应用

郁磊副教授&#xff0c;主要从事MATLAB 编程、机器学习与数据挖掘、数据可视化和软件开发、人工智能近红外光谱分析、生物医学系统建模与仿真&#xff0c;具有丰富的实战应用经验&#xff0c;主编《MATLAB智能算法30个案例分析》、《MATLAB神经网络43个案例分析》相关著作。已发…

6:算法基础--6.1:线性结构 ,6.2:查找算法

转上一节&#xff1a; http://t.csdnimg.cn/ql5Cdhttp://t.csdnimg.cn/ql5Cd 课程内容提要&#xff1a; 6&#xff1a;知识点考点详解 6.1&#xff1a;线性结构 通常分析时间复杂度的方法是从算法中选取-种对于所研究的问题来说是基本运算的操作&#xff0c;以 该操作重…

51单片机入门:认识开发板

认识开发板 板载资源&#xff1a; 数码管模块 说明&#xff1a; 2个四位一体共阴数码管 详细&#xff1a; 2个四位一体&#xff1a;两个独立的四位数码管&#xff0c;每个四位数码管都是“一体”的设计&#xff0c;也就是说&#xff0c;每个数码管内部集成了四个独立的七段LE…

【Linux】Ubuntu 磁盘管理

准备一个U盘或者SD卡&#xff08;含读卡器&#xff09;&#xff0c;并将其格式化成 FAT32 格式&#xff0c;不要使用NTFS格式&#xff08;这是微软的专利&#xff0c;大部分Linux系统不支持&#xff09;和exFAT格式&#xff08;有的Linux系统也不支持&#xff09;。 如果Ubun…

Lafida多目数据集实测

Lafida 数据集 paper&#xff1a;J. Imaging | Free Full-Text | LaFiDa—A Laserscanner Multi-Fisheye Camera Dataset 官网数据&#xff1a;https://www.ipf.kit.edu/english/projekt_cv_szenen.php 官网&#xff1a;KIT-IPF-Software and Datasets - LaFiDa 标定数据下载&…

【蓝桥杯嵌入式】9届程序题刷题记录及反思

一、题目内容分析 二、LCD单字符高亮显示实现 本次要求显示两个字符&#xff0c;此函数高亮pos及它后面一个字符 void highlight(uint8_t *str,uint8_t pos) {int i 0;for(i 0; i < 20; i){if(i ! pos && i! (pos1))LCD_DisplayChar(Line3,(320 - (16 * i)),st…

Python输出不了中文怎么解决

在文件头加上#encoding&#xff1a;utf-8即可。 # encoding: utf-8 print helloworld print u"学习" print (unicode("学习", encoding"utf-8")) shell输出&#xff1a; helloworld 学习 学习 还可以用#-*- coding: UTF-8 -*- 来指定。

LangChain学习笔记—RAG(检索增强生成)

LangChain LangChain是一个软件开发框架&#xff0c;可以更轻松地使用大型语言模型&#xff08;LLM&#xff09;创建应用程序。它是一个具有 Python 和 JavaScript 代码库的开源工具。LangChain 允许开发人员将 GPT-4 等 LLM 与外部数据相结合&#xff0c;为聊天机器人、代码理…

C++ | Leetcode C++题解之第10题正则表达式匹配

题目&#xff1a; 题解&#xff1a; class Solution { public:bool isMatch(string s, string p) {int m s.size();int n p.size();auto matches [&](int i, int j) {if (i 0) {return false;}if (p[j - 1] .) {return true;}return s[i - 1] p[j - 1];};vector<…

WGCAT工单系统使用指南 - 工单有哪几种状态

WGCAT工单管理系统设计的工单生命周期比较简单易懂 1、待接收 2、处理中 3、已拒绝 4、已完成 5、已关闭

CYP450综述-20年-地表最强系列-文献精读-4

Discovery and modification of cytochrome P450 for plant natural products biosynthesis 发现与改造细胞色素P450以合成植物天然产品 一篇关于植物CYP450的综述&#xff0c;地表最强&#xff0c;总结的最全面的版本之一&#xff0c;各位看官有推荐请留言评论区~ Discovery…

App.vue触发axios报错及解决方案

App.vue触发axios报错及解决方案 修改根目录下vue.config.js文件 module.exports {publicPath: ./,assetsDir: assets,configureWebpack: {devServer: {client: {overlay: false}}} }重新npm run dev 搞定

python作业

1.找出10000以内能被5或6整除&#xff0c;但不能被两者同时整除的数(函数) 2.写一个方法&#xff0c;计算列表所有偶数下标元素的和(注意返回值) 3.根据完整的路径从路径中分离文件路径、文件名及扩展名。 4.根据标点符号对字符串进行分行 5.去掉字符串数组中每个字符串的空格 …

波奇学Linux:

面向数据报&#xff1a;udp没有发送缓冲区&#xff0c;发送几次数据报&#xff0c;读取几次数据报&#xff0c;write和read一一对应 tcp通信时只管识别数据&#xff0c;在应用层才对字节进行拼接分析&#xff0c;得到完整请求 简单来说&#xff1a;udp之间传递的是报文&#x…

【打印SQL执行日志】⭐️Mybatis-Plus通过配置在控制台打印执行日志

目录 前言 一、Mybatis-Plus 开启日志的方式 二、测试 三、日志分析 章末 前言 小伙伴们大家好&#xff0c;相信大家平时在处理问题时都有各自的方式&#xff0c;最常用以及最好用的感觉还是断点调试&#xff0c;但是涉及到操作数据库的执行时&#xff0c;默认的话在控制台…

Excel、PowerQuery 和 ChatGPT 终极手册(上)

原文&#xff1a;Ultimate ChatGPT Handbook for Enterprises 译者&#xff1a;飞龙 协议&#xff1a;CC BY-NC-SA 4.0 序言 在不断发展的数据管理和分析领域中&#xff0c;掌握 Excel 的查找功能不仅是一种技能&#xff0c;更是高效数据处理的基石。《使用 Power Query 和 Ch…

kex_exchange_identification: read: Connection reset by peer

换一台机器&#xff0c;登录到远程ip地址&#xff0c;查看ssh的日志。 sudo grep ssh /var/log/auth.log | grep 8.22(登录失败的ip) 可以看到&#xff0c;刚开始是因为登录密码不对&#xff0c;后边是直接拒绝了。应该是sshd的一种保护机制.超过多少次失败&#xff0c;后…

【热门话题】计算机视觉入门:探索数字世界中的“视觉智能”

&#x1f308;个人主页: 鑫宝Code &#x1f525;热门专栏: 闲话杂谈&#xff5c; 炫酷HTML | JavaScript基础 ​&#x1f4ab;个人格言: "如无必要&#xff0c;勿增实体" 文章目录 计算机视觉入门&#xff1a;探索数字世界中的“视觉智能”摘要正文一、计算机视…

@RequstBody,IOC,DI,@Autowired,@Resource,lombok,

要使用Jason数据格式必须用post方法&#xff0c;因为是通过请求体传送的&#xff0c;get没有请求体 Data不包括有参构造和无参构造方法