LLM 构建Data Multi-Agents 赋能数据分析平台的实践之②:数据治理之二(自动处理)

前述

在前文的multi Agents for Data Analysis的设计说起,本文将继续探索和测试借助llm实现基于私有知识库的数据治理全自动化及智能化。整体设计如下:
在这里插入图片描述
整个体系设计了3个Agent以及一个Planer&Execute Agent,第一个Agent用于从企业数据标准私有知识库中检索生成与用户问题相关联的知识块,第二个Agent用于结合企业或者用户的数据做数据质量分析,第三个Agent用于根据用户的问题或者其他Agent的结果总结生成报告;Planer&Excute Agent用于根据用户的问题规划任务及调度Agents执行。**本文实践的例子流程是这样的:**根据用户的私有知识库检索数据治理的流程或者数据的标准范围,根据该标准或者流程,分析用户上传的数据的异常值及数据质量,撰写数据质量报告。

一、Agent①:知识检索

使用create_retriever_tool及 ZeroShotAgent构建第一个Agent用于从私有知识库检索生成相关的知识块。

model_id = "iic/nlp_corom_sentence-embedding_english-base"
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2")
loader = Docx2txtLoader('./standard documents (1).docx')
docs = loader.load()
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=500, chunk_overlap=150)
split_docs = text_splitter.split_documents(docs)
vectordb = Chroma.from_documents(documents=split_docs,embedding=embeddings)
retriever_tool = create_retriever_tool(
    vectordb.as_retriever(), "Search_for_data_retriever", "which you can retriever relate data block."
)

tools_kg = [retriever_tool]
memory = ConversationBufferMemory(
    memory_key="chat_history",
    human_prefix="Input",
    ai_prefix="Response",
    input_key="question",
    return_messages=False,
)
template_kg = 'You are a knowledge retrieval and generation tool that searches for relevant text based on user questions and generates answers accordingly.'
agent_kg = ZeroShotAgent.from_llm_and_tools(
            llm=llm,
            tools=tools_kg,
            prefix=template_kg,
        )

二、Agent②:数据分析

1、提示词设计:包含三个部分,工具的简单介绍,数据基础信息的提示(dhead),可参考的例子。

TEMPLATE_dt = """You are working with a pandas dataframe in Python. The name of the dataframe is `df`.
It is important to understand the attributes of the dataframe before working with it. This is the result of running `df.head().to_markdown()`

<df>
{dhead}
</df>

You are not meant to use only these rows to answer questions - they are meant as a way of telling you about the shape and schema of the dataframe.
You also do not have use only the information here to answer questions - you can run intermediate queries to do exporatory data analysis to give you more information as needed.

You possess  essential tools:  `python_repl_dt`: With this tool, you can analyze and process the data retrieved from df using Python code.

When facing a question, assess whether you need to employ these tools iteratively.

Example:
<question>It is known that the price interval for apples in the market is between 5 and 20 yuan per kilogram; your task is to detect any anomalous values within the set of apple market price quotes. </question>
<logic>
  First, To begin with, you should scrutinize the user's query, where in the user has furnished an essential detail: that the benchmark data range for apple prices is from 5 to 20 yuan per kilogram.
  Then, leverage `python_repl_dt` to detect any anomalous values based on the retrieved data.
  the code write by python_repl_dt:
  '''import pandas as pd
     df[df['apple'].lt(10) | df['apple'].gt(20)].to_csv('abnormal_data.csv')
  the output should print the anomalous,and output a file named apple market price anomalous.csv.
</logic>
If you have not found the precise answer, please iteratively use the tools.
"""

2、工具的构建及Agent的组装

#使用PythonAstREPLTool工具
repl = PythonAstREPLTool(
            locals={"df": df},
            name="python_repl_dt",
            description="The tool is used to generate Python code analysis based on the 'df' named pig_market_data ,runs the code and outputs both the code and the results of the computation.",
            #args_schema=PythonInputs,
)
tools_re = [repl]
#Agent_dt
template_dt = TEMPLATE_dt.format(dhead=df.head().to_markdown())
agent_dt = ZeroShotAgent.from_llm_and_tools(
            llm=llm,
            tools=tools_re,
            prefix=template_dt,
        )
agent_dt = AgentExecutor(agent=agent_dt, tools=tools_re, max_iterations=50, handle_parsing_errors=True,early_stopping_method="generate",verbose=True)

三、Agent③:总结报告

1、提示词设计:设计一个总结报告的基础框架,包含目的、背景、过程、结论、建议等。

template_bg = """
    As a Data Analys, please compose a comprehensive data report addressing the user's inquiry:{question}, adhering to the following guidelines:

  1.Introduction: Begin with a clear and concise overview of the purpose of the report, briefly outlining the user's question or problem statement. This section should set the context and provide a roadmap for the content that follows.

  2.Data Scope and Source: Specify the scope of the data analyzed, including relevant timeframes, geographical coverage, and any specific datasets utilized. Mention the sources from which the data was obtained, emphasizing their credibility and relevance to the analysis.

  3.Methodology: Describe the analytical methods employed to examine the data, including any statistical techniques, models, or tools used. Explain why these methods were chosen and how they contribute to answering the user's question. Outline any assumptions, limitations, or caveats associated with the methodology.

  4.Key Findings: Present the main insights derived from the analysis in a structured and visually appealing manner, using tables, charts, graphs, or other appropriate visualizations. Accompany each finding with a clear explanation and, where applicable, quantitative measures such as percentages, averages, or trends. Ensure findings are directly responsive to the user's inquiry and are contextualized within the broader data landscape.

  5.Interpretation and Implications: Interpret the key findings, drawing meaningful conclusions and highlighting their significance. Relate these conclusions back to the user's question, explaining how they address the initial concerns or objectives. Discuss any potential implications for business decisions, strategy, or further research, and offer actionable recommendations where appropriate.

  6.Quality Assurance and Limitations: Discuss the steps taken to ensure data quality throughout the analysis process, such as data cleaning, validation, and outlier detection. Acknowledge any limitations or challenges encountered during the analysis, including data gaps, inconsistencies, or inherent biases, and discuss how these may have influenced the results and conclusions.

  7.Conclusion and Next Steps: Summarize the key takeaways from the report, reinforcing the most important findings and their implications. Suggest potential avenues for future analysis or data collection that could further enhance understanding or address remaining questions. Encourage user engagement by inviting feedback or follow-up inquiries.

  8.Appendix and Supporting Materials: Include any additional information, detailed calculations, or raw data that support the analysis but would disrupt the flow of the main report. This might include detailed statistical outputs, full dataset summaries, or detailed descriptions of complex methodologies.

  By adhering to these guidelines, your data report will effectively communicate the results of your analysis, address the user's question thoroughly, and provide a robust foundation for informed decision-making.
"""

2、Agent的构建

prompt = ChatPromptTemplate.from_template(template_bg)
chain_bg = (
    {"question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

四、Planer&Execute Agent设计及构建

1、方式一:使用ZeroShotAgent+提示词的方式构建一个任务规划及调度执行的Agent
(1)提示词的设计:包含三部分,介绍该Agent的基础信息(角色定位、工具库)、任务执行的流程、可参考的例子

template= """
            As an AI data analyst, it is crucial to optimize workflow by utilizing tools within the tool library, especially "Search_for_data_standard" 、 "Analysis_for_pig_market_data" and "Tool_for_Data_analysis_report". Below are streamlined methods for efficiently handling user inquiries:

1. **Task Understanding & Strategy Formulation**:
   - Firstly, comprehensively understand the user's requirements and constraints, aligning with industry expertise and best practices.
   - Develop personalized data analysis strategies, setting operational guidelines for each data item based on established norms.
   - Based on the user's questions, output a matching task list.

2. **Utilizing "Search_for_data_standard" to Establish Data Analysis Framework and Standards**:
   - Within the strategy, define standards and guidelines for each data item to ensure alignment with business context and analysis objectives.

3. **Performing Data Analysis Using "Analysis_for_pig_market_data" Tool**:
   - Employ the "Analysis_for_pig_market_data" tool to analyze data uploaded by the user, following the framework and standards established by "Search_for_data_standard".
   — That is, the input of tool "Analysis_for_pig_market_data" should contain the framework and standards established by "Search_for_data_standard".
4. ** Utilize the "Tool_for_Data_analysis_report" to craft a report.
   — Compose a data analysis report based on industry knowledge and standards (query by "Search_for_data_standard" ), data analysis results("Analysis_for_pig_market_data" output), and user inquiries.

Example:
<question>Please calculate the cost and profit of the company's batch products for this year, and identify the batches of products that exceed the average cost. And generate a financial report</question>
<logic>
  First, Based on the user's questions, plan a task list and execution process:
  1. Use tool "Search_for_data_standard" to search and generate the company's financial management system and cost accounting process;
  2. According to the system or process, use tool "Analysis_for_pig_market_data" to analyze financial data, including batch cost, profit, and batch product ranking;
  3. Write financial analysis reports using "Tool_for_Data_analysis_report" tools based on management systems, cost accounting processes, data analysis, etc.
<task1>Use tool "Search_for_data_standard" to search and generate......
task1 output: The cost of product batches should be between 10-20 yuan/kg
<task2>use tool "Analysis_for_pig_market_data" to analyze financial data......
task2 input: the batch cost of the product should be between 10-20 yuan/kg. Analyze products with unreasonable batch costs
task2 output:
 '''import pandas as pd
     df[df['apple'].lt(10) | df['apple'].gt(20)].to_csv("the batch cost of the product  anomalous.csv", index=False)'''
<task3>Write a report using tool "Tool_for_Data_analysis_report"......
task3 input: the company's financial management system and cost accounting process,Financial data analysis results.....
task3 output: Our company's batch cost analysis overview is as follows: the average cost of a batch is 15 yuan/batch, and the batches of products that exceed the abnormal cost warning value include A, B, C, etc......
</logic>

        """

(2)工具库及P-Agent的构建

tools = [
            Tool(
                name = "Search_for_data_standard",
                func=agent_KG.invoke,
                description="By retrieving relevant text, obtain industry standards, data analysis strategies related to user questions, and summarize and generate answers."
                ),
            Tool(
                name = "Analysis_for_pig_market_data",
                func=agent_dt.invoke,
                description="A tool used to excute code analyze the situation in the data named pig_market_data, where the data is the slaughter weight and price data of the pig market."
                ),
            Tool(
                name = "Tool_for_Data_analysis_report",
                func=chain_bg.invoke,
                description="As a Data Analyst, please compose a comprehensive data report addressing the user's inquiry."
                ),
        ]

agent = ZeroShotAgent.from_llm_and_tools(
            llm=llm,
            tools=tools,
            prefix=template,
        )
agent_executor = AgentExecutor(agent=agent, tools=tools, max_iterations=150, handle_parsing_errors=True,early_stopping_method="generate",verbose=True)

2、方式二、使用langchain 的plan & excute模块

from langchain_experimental.plan_and_execute import PlanAndExecute, load_agent_executor, load_chat_planner

planner = load_chat_planner(llm,template)
executor = load_agent_executor(llm, tools, verbose=True)

agent = PlanAndExecute(planner=planner, executor=executor, verbose=True)

五、实践结果讨论

任务一:
从标准私有知识库中检索数据标准(生猪体重标准范围),根据此标准分析用户数据中的异常值,生成一份异常数据表,并根据数据标准与异常数据情况撰写数据质量分析报告。

agent_executor.invoke({'input':'''Obtain the weight standards for pigs in the market; 
According to this standard, Identify abnormal weight values in data named pig_market_data from various provinces and cities across the country that exceed the slaughter standard weight range,
And generate a csv containing the identified outliers,above all, 
Write a data quality analysis report.'''})

①任务规划:

steps=[Step(value='Use tool "Search_for_data_standard" to search and generate weight standards for pigs in the market; ')
Step(value='According to the weight standards, use tool "Analysis_for_pig_market_data" to analyze the data named pig_market_data from various provinces and cities across the country, identifying abnormal weight values that exceed the slaughter standard weight range; ')
Step(value='Generate a CSV file containing the identified outliers;')
Step(value='Write a data quality analysis report based on the weight standards, data analysis results, and identified outliers.\n</logic>\n\n<task1>Use tool "Search_for_data_standard" to search and generate weight standards for pigs in the market.\ntask1 output: The weight standards for pigs in the market should be between 100-150 kg.\n\n<task2>Use tool "Analysis_for_pig_market_data" to analyze the data named pig_market_data from various provinces and cities across the country.\ntask2 input: The weight standards for pigs in the market should be between 100-150 kg. Analyze data to identify abnormal weight values that exceed the slaughter standard weight range.\ntask2 output: \n```python\nimport pandas as pd\ndf[df[\'weight\'].lt(100) | df[\'weight\'].gt(150)].to_csv("abnormal_weight_values.csv", index=False)\n```\n\n<task3>Write a report using tool "Tool_for_Data_analysis_report" based on weight standards, data analysis results, and identified outliers.\ntask3 input: Weight standards for pigs in the market, data analysis results, identified outliers in the data.\ntask3 output: The data quality analysis report includes information on the weight standards, analysis of abnormal weight values exceeding the slaughter standard weight range, and recommendations for data quality improvement.\n')]

②执行第一个Agent:知识检索生成,从私有知识库检索生成生猪出栏体重标准为70~200kg。
在这里插入图片描述

③执行第二个Agent:数据质量分析,根据生猪出栏体重标准范围【70,200】,查询异常值并生成异常数据文件。
在这里插入图片描述

④执行第三个Agent:数据质量报告撰写,根据提供的数据分析报告模板撰写报告。
在这里插入图片描述
任务二:从私有知识库中检索生成数据质量分析流程根据此流程分析用户数据中的异常值,生成一份异常数据表,并根据数据标准与异常数据情况撰写数据质量分析报告。

agent_executor.invoke({'input':'Obtain the data quality analysis process for pigs in the market; According to this process, Identify abnormal weight values in data named pig_market_data from various provinces and cities across the country that exceed the slaughter standard weight range,And generate a csv containing the identified outliers,above all, Write a data quality analysis report.'})

①任务规划

I need to first search for the data quality analysis process for pigs in the market to understand the standards and guidelines. Then, I should analyze the pig_market_data to identify abnormal weight values that exceed the slaughter standard weight range. Finally, I need to generate a csv file with the identified outliers and write a data quality analysis report.

在这里插入图片描述

②执行第一个Agent:检索生成数据分析流程

Observation: {'input': 'Data quality analysis process for pigs in the market', 'output': 'The data quality analysis process for pigs in the market involves querying abnormal data, using methods like Z-score and GNN to discover anomalies, and generating a data report to evaluate data quality in terms of integrity, consistency, timeliness, and accuracy.'}

在这里插入图片描述
③执行第二个Agent:根据数据质量分析流程对数据质量分析,使用Z-score算法
在这里插入图片描述
生成的异常数据文件
在这里插入图片描述

④执行第三个Agent:数据质量分析报告
在这里插入图片描述

六、讨论

通过本次测试,初步验证了将LLM构建的Data Multi-Agents嵌入数据平台用作数据治理的可行性,并实现了多个Agent的协同工作、以及根据私有知识库自动分析数据质量。在测试过程中有如下问题尚需考虑:
1、大模型选择问题:本次实践针对每个Agent的核心—大模型均做了测试,理论上每个Agent可以使用不同的大模型作为核心。经过本次测试,有如下考虑或者建议,针对每个Agent的用途可以选择适合的大模型,比如数据分析(python code)可以选择代码生成能力强的大模型,Plan&Excute Agent可以选择任务规划能力强的大模型,甚至针对特定领域需要微调一个适合、匹配的大模型。
2、知识检索Agent可能还需借助最新的RAG技术,以获取在海量数据标准中能检索匹配生成到更准确的知识块。
3、数据质量分析Agent应对多表、多库以及各表之间的关联等更复杂的场景,需要建立一个数据资源基础信息库,如元数据、数据血缘、数据库表关联关系等信息,大模型处理起来会更全面和精准。

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:/a/534336.html

如若内容造成侵权/违法违规/事实不符,请联系我们进行投诉反馈qq邮箱809451989@qq.com,一经查实,立即删除!

相关文章

结合ArcGIS+SWAT模型+Century模型:流域生态系统水-碳-氮耦合过程模拟

原文链接&#xff1a;结合ArcGISSWAT模型Century模型&#xff1a;流域生态系统水-碳-氮耦合过程模拟https://mp.weixin.qq.com/s?__bizMzUzNTczMDMxMg&tempkeyMTI2NV9sMGRZNUJoVkNVc1ZzSzRuMl9XXzhqX0R3cXpESWFwM1E4cFY4ejNqWFh3VUl0dlZkNWk4b20ydFdFTy1xS2ZObGN0Z0ZXSjly…

大话设计模式——9.单例模式(Singleton Pattern)

简介 确保一个类只有一个实例&#xff0c;并提供全局访问点来获取该实例&#xff0c;是最简单的设计模式。 UML图&#xff1a; 单例模式共有两种创建方式&#xff1a; 饿汉式&#xff08;线程安全&#xff09; 提前创建实例&#xff0c;好处在于该实例全局唯一&#xff0c;不…

c++之旅第九弹——模版

大家好啊&#xff0c;这里是c之旅第九弹&#xff0c;跟随我的步伐来开始这一篇的学习吧&#xff01; 如果有知识性错误&#xff0c;欢迎各位指正&#xff01;&#xff01;一起加油&#xff01;&#xff01; 创作不易&#xff0c;希望大家多多支持哦&#xff01; 一.模版的概念…

改进的注意力机制的yolov8和UCMCTrackerDeepSort的多目标跟踪系统

基于yolov8和UCMCTracker/DeepSort的注意力机制多目标跟踪系统 本项目是一个强大的多目标跟踪系统&#xff0c;基于[yolov8]链接和[UCMCTracker/DeepSot]/链接构建。 &#x1f3af; 功能 多目标跟踪&#xff1a;可以实现对视频中的多目标进行跟踪。目标检测&#xff1a;可以实…

2023年上半年信息系统项目管理师综合知识真题与答案解释(2)

2023年上半年信息系统项目管理师综合知识真题与答案解释(2) And Her Name Is? 她的名字是&#xff1f; During my second month of college, our professor gave us a pop quiz. 在我上大学的第二个月&#xff0c;我们的教授给了我们一个流行测验。 I was a conscientio…

自然语言控制机械臂:ChatGPT与机器人技术的融合创新(上)

引言&#xff1a; 自OpenAI发布ChatGPT以来&#xff0c;世界正迅速朝着更广泛地将AI技术融合到机器人设备中的趋势发展。机械手臂&#xff0c;作为自动化与智能化技术的重要组成部分&#xff0c;在制造业、医疗、服务业等领域的应用日益广泛。随着AI技术的进步&#xff0c;机械…

开源大数据集群部署(二十)Trino部署

作者&#xff1a;櫰木 1 解压trino的包到opt目录 cd /root/bigdata tar -xzvf trino-server-389.tar.gz -C /opt/ ln -s /opt/trino-server-389 /opt/trino2 创建trino用户&#xff0c;并配置专属jdk11 useradd trino su – trino chown -R trino:hadoop /opt/trino-server-…

async+await——用法——基础积累

对于asyncawait&#xff0c;我一直都不太会用。。。。 今天记录一下asyncawait的实际用法&#xff1a; 下面是一个实际的使用场景&#xff1a; 上面的代码如下&#xff1a; async fnConfirmCR(){let type this.crType;let crId this.crId;if(typeof crId object){let ne…

一起学习python——基础篇(13)

前言&#xff0c;python编程语言对于我个人来说学习的目的是为了测试。我主要做的是移动端的开发工作&#xff0c;常见的测试主要分为两块&#xff0c;一块为移动端独立的页面功能&#xff0c;另外一块就是和其他人对接工作。 对接内容主要有硬件通信协议、软件接口文档。而涉…

andorid 矢量图fillColor设置无效

问题&#xff1a;andorid 矢量图fillColor设置无效 解决&#xff1a;去掉如下 android:tint一行

股票手续费怎么降下来?这些技巧帮你省钱!

在股票交易中&#xff0c;手续费是每个投资者都必须面对的成本。降低手续费可以有效地增加投资回报。以下是一些降低股票手续费的方法&#xff1a; 1. 选择低佣金的券商&#xff1a;不同的证券公司提供的佣金费率不同&#xff0c;选择佣金较低的券商可以直接减少交易成本 2. 增…

antd+vue——datepicker日期控件——禁用日期功能

需求&#xff1a;今天之前的日期禁用 <a-date-pickerv-model.trim"formNE.deliveryTime":disabled-date"disabledDate"valueFormat"YYYY-MM-DD"allowClearstyle"width: 100%" />禁用日期的范围&#xff1a; //时间范围 disab…

【C语言】C语言题库【附源码+持续更新】

欢迎来到英杰社区https://bbs.csdn.net/topics/617804998 目录 1、练习2-1 Programming in C is fun! 2、练习2-3 输出倒三角图案 3、练习2-4 温度转换 4、练习2-6 计算物体自由下落的距离 5、练习2-8 计算摄氏温度 6、练习2-9 整数四则运算 7、练习2-10 计算分段函数[1…

3D目标检测跟踪 | 基于kitti+waymo数据集的自动驾驶场景的3D目标检测+跟踪渲染可视化

项目应用场景 面向自动驾驶场景的 3D 目标检测目标跟踪&#xff0c;基于kittiwaymo数据集的自动驾驶场景的3D目标检测跟踪渲染可视化查看。 项目效果 项目细节 > 具体参见项目 README.md (1) Kitti detection 数据集结构 # For Kitti Detection Dataset └── k…

力扣347. 前 K 个高频元素

思路&#xff1a;记录元素出现的次数用map&#xff1b; 要维护前k个元素&#xff0c;不至于把所有元素都排序再取前k个&#xff0c;而是新建一个堆&#xff0c;用小根堆存放前k个最大的数。 为什么是小根堆&#xff1f;因为堆每次出数据时只出堆顶&#xff0c;每次把当前最小的…

Excel 函数与公式应用大全

Excel 函数与公式应用大全 常用Excel函数实际应用示例本期图书推荐Excel 函数与公式应用大全内容简介获取方式 AI爆款文案&#xff1a;巧用AI大模型让文案变现插上翅膀 文案变现一本通内容简介获取方式 Excel 是一款功能强大的电子表格软件&#xff0c;广泛应用于商业、财务、教…

代码随想录算法训练营三刷day51 | 动态规划 309.最佳买卖股票时机含冷冻期 714.买卖股票的最佳时机含手续费

三刷day51 309.最佳买卖股票时机含冷冻期1.确定dp数组以及下标的含义2. 确定递推公式3.dp数组如何初始化4.确定遍历顺序5.举例推导dp数组 714.买卖股票的最佳时机含手续费 309.最佳买卖股票时机含冷冻期 题目链接 解题思路&#xff1a; 相对于动态规划&#xff1a;122.买卖股票…

【JavaEE初阶系列】——文件操作 IO 之 文件系统操作

目录 &#x1f4dd;认识文件 &#x1f6a9;树型结构组织 和 目录 &#x1f388;绝对路径和相对路径 &#x1f6a9;文件类型 &#x1f4dd;文件系统操作 &#x1f388;File 概述 &#x1f388;File类的使用 1. 绝对路径 vs 相对路径 2. 路径分隔符 3. 静态成员变量 4…

SCT2A23STER 电源降压转换芯片 1.2A 4.5V-100V

SCT2A23是一种1.2A降压型直流变换器&#xff0c;输入电压范围从4.5V至100V&#xff0c;集成了530mΩ高压侧MOSFET和220mΩ低压侧MOSFET。SCT2A23选用恒导通时刻&#xff08;COT&#xff09;形式控制&#xff0c;支撑PFM形式&#xff0c;具有典型的160uA低静态电流&#xff0c;有…

【C++题解】1329. 求梯形的面积

问题&#xff1a;1329. 求梯形的面积 类型&#xff1a;基本运算、小数运算 题目描述&#xff1a; 梯形面积的求解公式为S(ab)h/2 。从键盘读入一个梯形的上底 a、下底 b 和高 h &#xff0c;请计算表梯形的面积。&#xff08;结果保留1位小数&#xff09;。&#xff08;5.1.1…