【数据中台】开源项目(5)-Amoro

介绍

        Amoro is a Lakehouse management system built on open data lake formats. Working with compute engines including Flink, Spark, and Trino, Amoro brings pluggable and self-managed features for Lakehouse to provide out-of-the-box data warehouse experience, and helps data platforms or products easily build infra-decoupled, stream-and-batch-fused and lake-native architecture。
        Amoro定位是一个搭建在 Apache Iceberg之上的流式湖仓服务,流式强调向实时能力的拓展,服务则强调管理、标准化度量,以及其他可以抽象到基础软件中的湖仓一体能力。
通过 Amoro,用户可以在 Flink、Spark、Trino 等引擎上实现更加优化的 CDC、流式更新、OLAP 等功能, 结合数据湖高效的离线处理能力,Arctic 能够服务于更多流批混用的场景;同时,Arctic 的结构自优化、并发冲突解决以及标准化的湖仓管理功能,将有效减少用户在数据湖管理和优化上的负担。
        开源地址: GitHub - NetEase/amoro: Amoro is a Lakehouse management system built on open data lake formats.

Amoro架构

The architecture of Amoro is as follows:
The core components of Amoro include:
  • AMS: Amoro Management Service provides Lakehouse management features, like self-optimizing, data expiration, etc. It also provides a unified catalog service for all computing engines, which can also be combined with existing metadata services.
  • Plugins: Amoro provides a wide selection of external plugins to meet different scenarios.
  • Optimizers: The self-optimizing execution engine plugin asynchronously performs merging, sorting, deduplication, layout optimization, and other operations on all type table format tables.
  • Terminal: SQL command-line tools, provide various implementations like local Spark and Kyuubi.
  • LogStore: Provide millisecond to second level SLAs for real-time data processing based on message queues like Kafka and Pulsar.

支持的格式

Amoro can manage tables of different table formats, similar to how MySQL/ClickHouse can choose different storage engines. Amoro meets diverse user needs by using different table formats. Currently, Amoro supports three table formats:
  • Iceberg format: means using the native table format of the Apache Iceberg, which has all the features and characteristics of Iceberg.
  • Mixed-Iceberg format: built on top of Iceberg format, which can accelerate data processing using LogStore and provides more efficient query performance and streaming read capability in CDC scenarios.
  • Mixed-Hive format: has the same features as the Mixed-Iceberg tables but is compatible with a Hive table. Support upgrading Hive tables to Mixed-Hive tables, and allow Hive’s native read and write methods after upgrading.

支持的引擎

Iceberg format

Iceberg format tables use the engine integration method provided by the Iceberg community. For details, please refer to: Iceberg Docs.

Paimon format

Paimon format tables use the engine integration method provided by the Paimon community. For details, please refer to: Paimon Docs.

Mixed format

Amoro support multiple processing engines for Mixed format as below:
Processing Engine
Version
Batch Read
Batch Write
Batch Overwrite
Streaming Read
Streaming Write
Create Table
Alter Table
Flink
1.15.x, 1.16.x and 1.17.x
Spark
3.1, 3.2, 3.3
Hive
2.x, 3.x
Trino
406

应用场景

Self-managed streaming Lakehouse

Amoro makes it easier for users to handle the challenges of writing to a real-time data lake, such as ingesting append-only event logs or CDC data from databases. In these scenarios, the rapid increase of fragment and redundant files cannot be ignored. To address this issue, Amoro provides a pluggable streaming data self-optimizing mechanism that automatically compacts fragment files and removes expired data, ensuring high-quality table queries while reducing system costs.

Stream-and-batch-fused data pipeline

Whether in the AI or BI business field , the requirement for real-time analysis is becoming increasingly high. The traditional approach of using one streaming job to complete all data processing from the source to the end is no longer applicable. There is an increasing demand for layered construction of streaming data pipeline, and the traditional layered construction approach based on message queues can cause a inconsistency problem between the streaming and batch data processing. Building a unified stream-and-batch-fused pipeline based on new data lake formats is the future direction for solving these problems. Amoro fully leverages the characteristics of the new data lake table formats about unified streaming and batch processing, not only ensuring the quality of data in the streaming pileline but also enhancing critical features such as incremental reading for CDC data and streaming dimension table association, helping users to build a stream-and-batch-fused data pipeline.

Cloud-native Lakehouse

Currently, most data platforms and products are tightly coupled with their underlying infrastructure(such as the storage layer). The migration of infrastructure, such as switching to cloud-native OSS, may require extensive adaptation efforts or even be impossible. However, Amoro provides an infra-decoupled, lake-native architecture built on top of the infrastructure. This allows products based on Amoro to interact with the underlying infrastructure through a unified interface (Amoro Catalog service), protecting upper-layer products from the impact of infrastructure switch.

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:/a/219834.html

如若内容造成侵权/违法违规/事实不符,请联系我们进行投诉反馈qq邮箱809451989@qq.com,一经查实,立即删除!

相关文章

Salesforce认证考试,这5招让你轻松过关!

认证是很多求职者获得第一份Salesforce工作的敲门砖。认证不仅是个人能力的体现,而且在学习备考的过程中,可以更系统地梳理知识,了解最新的产品和功能,对Salesforce有更全面和深入的认识。 大多数Salesforce从业者都至少持有一项…

Maxwell学习笔记

1 概述 Maxwell 是由美国 Zendesk 开源,用 Java 编写的 MySQL 实时抓取软件。 实时读取MySQL 二进制日志 Binlog,并生成 JSON 格式的消息,作为生产者发送给 Kafka,Kinesis、RabbitMQ、Redis、Google Cloud Pub/Sub、文件或其它平台…

多线程--11--ConcurrentHashMap

ConcurrentHashMap与HashMap等的区别 HashMap线程不安全 我们知道HashMap是线程不安全的,在多线程环境下,使用Hashmap进行put操作会引起死循环,导致CPU利用率接近100%,所以在并发情况下不能使用HashMap。 ConcurrentHashMap 主…

UVM实现component之间transaction级别的通信

my_model是从i_agt中得到my_transaction,并把 my_transaction传递给my_scoreboard。在UVM中,通常使用TLM(Transaction Level Modeling)实现component之间transaction级别 的通信。 在UVM的transaction级别的通信 中,数…

【已验证】SqlBulkCopy 执行批量插入的时候报超时问题-解决办法

把datatable里面的数据插入到数据库,但是数据量大的情况下批量插入会提示超时,所以把datatable的数据分批写入数据库的 using (SqlConnection connection new SqlConnection(ConnectionString)){connection.Open();int pageSize 100000;//SqlBulkCopy大…

Linux各目录结构说明

文章目录 目录说明源码放哪里?拓展:Linux里面安装软件是装在home目录还是opt目录还是/usr/local好? bin boot dev etc home lib lib64 lostfound media mnt opt proc root run sbin srv sys tmp usr var 目录说明 bin 存放二进制可执行文件&…

C语言每日一题(46)整数转罗马数字

力扣网12 整数转罗马数字 题目描述 罗马数字包含以下七种字符: I, V, X, L,C,D 和 M。 字符 数值 I 1 V 5 X 10 L 50 C 100 D …

手机如何设置防骚扰电话?

很多人都曾接到过烦人的推销电话,这些电话不仅让人感到烦恼,而且有时候还会接二连三地打来,让人不胜其烦。我们的手机号码似乎已经被泄露,很难避免这些骚扰。 有时,我们因无法忍受骚扰电话而选择立即将其拉黑&#xff…

ROW_NUMBER()函数——(分组后取每组最新的两条数据)

ROW_NUMBER() 功能:简单的说row_number()从1开始,为每一条分组记录返回一个数字。 用法一: ROW_NUMBER() OVER (ORDER BY col DESC) 说明:先把col列降序,再为降序后的每条col记录返回一个序号 用法二&#xf…

Linux系统---图书管理中的同步问题

顾得泉:个人主页 个人专栏:《Linux操作系统》 《C/C》 《LeedCode刷题》 键盘敲烂,年薪百万! 一、问题描述 (1)图书馆阅览室最多能够容纳N(N5)名学生,若有更多学生想…

玩转大数据7:数据湖与数据仓库的比较与选择

1. 引言 在当今数字化的世界中,数据被视为一种宝贵的资源,而数据湖和数据仓库则是两种重要的数据处理工具。本文将详细介绍这两种工具的概念、作用以及它们之间的区别和联系。 1.1. 数据湖的概念和作用 数据湖是一个集中式存储和处理大量数据的平台&a…

【FreeRTOS】消息队列——简介、常用API函数、注意事项、项目实现

在嵌入式系统开发中,任务间的通信是非常常见的需求。FreeRTOS提供了多种任务间通信的机制,其中之一就是消息队列。消息队列是一种非常灵活和高效的方式,用于在不同的任务之间传递数据。通过消息队列,任务可以异步地发送和接收消息…

优化您的Mac体验——System Dashboard Pro for Mac(系统仪表板)

作为Mac用户,我们都希望能够拥有一个高效、流畅的电脑体验。然而,在长时间使用后,我们的Mac可能会变得越来越慢,导致我们的工作效率下降。这时候,System Dashboard Pro for Mac(系统仪表板)就可以派上用场了。它是一款…

vivado时序方法检查2

TIMING-4 &#xff1a; 时钟树上的基准时钟重新定义无效 时钟树上的时钟重新定义无效。基准时钟 <clock_name> 是在时钟 <clock_name> 下游定义的 &#xff0c; 并覆盖其插入延迟和/ 或波形定义。 描述 基准时钟必须在时钟树的源时钟上定义。例如 &#xff0…

mongdb配置ssl

mongodb5.0.9 centos7.6 x86 1、正常启动mongod -f mongodb.conf 【前言】 ssl配置流程步骤&#xff0c;按照以下顺序处理即可。 1.生成证书&#xff0c;根证书&#xff0c;服务端证书&#xff0c;客户端证书 2.配置服务端ssl配置&#xff0c;测试she…

Arrays类 - Java

Arrays类 Arrays类1、常用方法案例 Arrays类 1、常用方法 Arrays 里面包含了一系列静态方法&#xff0c;用于管理或操作数组(比如排序和搜索)。 toString 返回数组的字符串形式 Arrays.toString(arr)【案例1】 sort 排序(自然排序和定制排序) Integer arr[]{1, -1, 7, 0, 8…

云服务器部署过程(从零开始)

首先介绍如何在 Linux 上复制粘贴 CtrlInsert&#xff0c;或者CtrlshiftC复制文本&#xff0c;使用ShiftInsert或CtrlshiftV 在终端中粘贴文本。 搭建java部署环境 要搭建java部署环境&#xff0c;那么首先就需要在Linux上安装jdk&#xff0c;MySQL等必需工具&#xff0c;接…

01_阿里云_Xshell连接服务器

PC使用Xshell连接阿里云服务器 问题引出 之前使用Xshell连接阿里云服务器连接的好好的&#xff0c;今天准备上去服务器学习Linux发现连不上了&#xff0c;后来发现是防火墙的问题&#xff0c;还有阿里云的安全组也需要设置 解决方案 方法一&#xff1a;&#xff08;简单粗暴…

CGAL的周期三角剖分(相关信息较少)

CGAL的周期二维三角剖分类旨在表示二维平面上的一组点的三角剖分。该三角剖分形成其计算空间的分区。它是一个单纯复体&#xff0c;即它包含任何k-单纯形的所有关联j-单纯形&#xff08;j<k&#xff09;&#xff0c;并且两个k-单纯形要么不重叠&#xff0c;要么共享一个公共…

博客访问量到达2万了!

博客访问量到达2万了&#xff01;这也发生的太快了吧&#xff0c;前两天才1万7千访问量&#xff0c;用了平台送的1500的流量券&#xff0c;粉丝从1个&#xff08;N年前的&#xff09;&#xff0c;蹭蹭的往上涨&#xff0c;这也太“假”了吧。关键我也是个菜鸟自学者&#xff0c…