(五)、基于 LangChain 实现大模型应用程序开发 | 基于知识库的个性化问答 (文档加载 Document Loading)

😄 大语言模型(Large Language Model, LLM), 可以回答许多不同的问题。但是大语言模型的知识来源于其训练数据集，并没有用户的信息（比如用户的个人数据，公司的自有数据），也没有最新发生时事的信息（在大模型数据训练后发表的文章或者新闻）。因此大模型能给出的答案比较受限。

所有我们可以基于 LangChain 提供的个人数据访问能力，指导开发者如何使用 LangChain 开发能够访问用户个人数据、提供个性化服务的大模型应用。

本博客主要集中于：文档加载

用户个人数据可以以多种形式呈现：PDF 文档、视频、网页等。
基于 LangChain 提供给 LLM 访问用户个人数据的能力，首先要加载并处理用户的多样化、非结构化个人数据。
在本博客中，我们首先介绍如何加载文档（包括文档、视频、网页等），这是访问个人数据的第一步。

文章目录

0、初始化openai环境
1、PDF 文档
2、YouTube 音频
3、网页文档
4、Notion 文档

0、初始化openai环境

from langchain.chat_models import ChatOpenAI
import os
import openai
# 运行此API配置，需要将目录中的.env中api_key替换为自己的
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file
openai.api_key = os.environ['OPENAI_API_KEY']

1、PDF 文档

# 安个pdf依赖库
!pip install -q pypdf

采用langchain的PyPDFLoader加载

创建一个 PyPDFLoader Class 实例，输入为待加载的pdf文档路径
以吴恩达老师的《2009年机器学习课程字幕文件》作为示例，下载地址：https://see.stanford.edu/materials/aimlcs229/transcripts/MachineLearning-Lecture01.pdf

from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader("./data/MachineLearning-Lecture01.pdf")
# 调用 PyPDFLoader Class 的函数 load对pdf文件进行加载
pages = loader.load()

print(len(pages)) # 可以看到输出22，刚好pdf就是22页，是按页数分块
pages

22
[Document(page_content='MachineLearning-Lecture01  \nInstructor (Andrew Ng):  Okay. Good morning. Welcome to CS229, the machine \nlearning class. So what I wanna do today is ju st spend a little time going over the logistics \nof the class, and then we\'ll start to  talk a bit about machine learning.  \nBy way of introduction, my name\'s  Andrew Ng and I\'ll be instru ctor for this class. And so \nI personally work in machine learning, and I\' ve worked on it for about 15 years now, and \nI actually think that machine learning is th e most exciting field of all the computer \nsciences. So I\'m actually always excited about  teaching this class. Sometimes I actually \nthink that machine learning is not only the most exciting thin g in computer science, but \nthe most exciting thing in all of human e ndeavor, so maybe a little bias there.  \nI also want to introduce the TAs, who are all graduate students doing research in or \nrelated to the machine learni ng and all aspects of machin e learning. Paul Baumstarck \nworks in machine learning and computer vision.  Catie Chang is actually a neuroscientist \nwho applies machine learning algorithms to try to understand the human brain. Tom Do \nis another PhD student, works in computa tional biology and in sort of the basic \nfundamentals of human learning. Zico Kolter is  the head TA — he\'s head TA two years \nin a row now — works in machine learning a nd applies them to a bunch of robots. And \nDaniel Ramage is — I guess he\'s not here  — Daniel applies l earning algorithms to \nproblems in natural language processing.  \nSo you\'ll get to know the TAs and me much be tter throughout this quarter, but just from \nthe sorts of things the TA\'s do, I hope you can  already tell that machine learning is a \nhighly interdisciplinary topic in which just the TAs find l earning algorithms to problems \nin computer vision and biology and robots a nd language. And machine learning is one of \nthose things that has and is having a large impact on many applications.  \nSo just in my own daily work, I actually frequently end up talking to people like \nhelicopter pilots to biologists to people in  computer systems or databases to economists \nand sort of also an unending stream of  people from industry coming to Stanford \ninterested in applying machine learni ng methods to their own problems.  \nSo yeah, this is fun. A couple of weeks ago, a student actually forwar ded to me an article \nin "Computer World" about the 12 IT skills th at employers can\'t say no to. So it\'s about \nsort of the 12 most desirabl e skills in all of IT and all of information technology, and \ntopping the list was actually machine lear ning. So I think this is a good time to be \nlearning this stuff and learning algorithms and having a large impact on many segments \nof science and industry.  \nI\'m actually curious about something. Learni ng algorithms is one of the things that \ntouches many areas of science and industrie s, and I\'m just kind of curious. How many \npeople here are computer science majors, are in the computer science department? Okay. \nAbout half of you. How many people are from  EE? Oh, okay, maybe about a fifth. How ', metadata={'source': './data/MachineLearning-Lecture01.pdf', 'page': 0}),
 Document(page_content="many biologers are there here? Wow, just a few, not many. I'm surprised. Anyone from \nstatistics? Okay, a few. So where are the rest of you from?  \nStudent : iCME.  \nInstructor (Andrew Ng) : Say again?  \nStudent : iCME.  \nInstructor (Andrew Ng) : iCME. Cool.  \nStudent : [Inaudible].  \nInstructor (Andrew Ng) : Civi and what else?  \nStudent : [Inaudible]  \nInstructor (Andrew Ng) : Synthesis, [inaudible] systems. Yeah, cool.  \nStudent : Chemi.  \nInstructor (Andrew Ng) : Chemi. Cool.  \nStudent : [Inaudible].  \nInstructor (Andrew Ng) : Aero/astro. Yes, right. Yeah, okay, cool. Anyone else?  \nStudent : [Inaudible].  \nInstructor (Andrew Ng) : Pardon? MSNE. All ri ght. Cool. Yeah.  \nStudent : [Inaudible].  \nInstructor (Andrew Ng) : Pardon?  \nStudent : [Inaudible].  \nInstructor (Andrew Ng) : Endo —  \nStudent : [Inaudible].  \nInstructor (Andrew Ng) : Oh, I see, industry. Okay. Cool. Great, great. So as you can \ntell from a cross-section of th is class, I think we're a very diverse audience in this room, \nand that's one of the things that makes this class fun to teach and fun to be in, I think.  ", metadata={'source': './data/MachineLearning-Lecture01.pdf', 'page': 1}),
....]

可以看到每个Document对象包含了2个属性：

1、.page_conttent
2、.metadata

# 选一页出来瞧瞧：page_content
pages[0].page_content
MachineLearning-Lecture01  \nInstructor (Andrew Ng):  Okay. Good morn....

# 文档页面相关的描述性数据：，meta_data
pages[0].metadata
{'source': './data/MachineLearning-Lecture01.pdf', 'page': 0}

2、YouTube 音频

在这部分的内容中，对于给定的 YouTube 视频链接，我们会详细讨论：

1、利用langchain加载工具，为指定的 YouTube 视频链接下载对应的音频至本地
2、通过OpenAIWhisperPaser工具，将这些音频文件转化为可读的文本内容

# 安装个依赖库
!pip -q install yt_dlp
!pip -q install pydub
!pip install ffmpeg
!pip install youtube-dl
!pip install ffprobe-python

构建一个 GenericLoader 实例来对 Youtube 视频的下载到本地并加载。

from langchain.document_loaders.generic import GenericLoader
from langchain.document_loaders.parsers import OpenAIWhisperParser
from langchain.document_loaders.blob_loaders.youtube_audio import YoutubeAudioLoader

url="https://www.youtube.com/watch?v=_PHdzsQaDgw"
save_dir="./data/"

# 创建一个 GenericLoader Class 实例
loader = GenericLoader(
    #将链接url中的Youtube视频的音频下载下来,存在本地路径save_dir
    YoutubeAudioLoader([url],save_dir), 
    
    #使用OpenAIWhisperPaser解析器将音频转化为文本
    OpenAIWhisperParser()
)

# 调用 GenericLoader Class 的函数 load对视频的音频文件进行加载
pages = loader.load()


print(pages)
print("Page_content: ", pages[0].page_content)
print("Meta Data: ", pages[0].metadata)

[Document(page_content='大家好,欢....', metadata={'source': 'data})]
Page_content: 大家好,欢迎来到我的频道 今天...
Meta Data:  {'source': 'data....}

3、网页文档

以 GitHub 上的一个markdown格式文档为例，学习如何对其进行加载
构建一个WebBaseLoader实例来对网页进行加载

from langchain.document_loaders import WebBaseLoader


# 创建一个 WebBaseLoader Class 实例
url = "https://github.com/datawhalechina/d2l-ai-solutions-manual/blob/master/docs/README.md"
header = {'User-Agent': 'python-requests/2.27.1', 
          'Accept-Encoding': 'gzip, deflate, br', 
          'Accept': '*/*',
          'Connection': 'keep-alive'}
loader = WebBaseLoader(web_path=url,header_template=header)

# 调用 WebBaseLoader Class 的函数 load对文件进行加载
pages = loader.load()



print("Type of pages: ", type(pages))
print("Length of pages: ", len(pages))

page = pages[0]
print("Type of page: ", type(page))
print("Page_content: ", page.page_content)
print("Meta Data: ", page.metadata)

在这里插入图片描述

因为是网页，数据非常脏，通常来讲，我们需要进行对这种数据进行进一步处理(Post Processing)

import json
convert_to_json = json.loads(page.page_content)
extracted_markdow = convert_to_json['payload']['blob']['richText'] # 提取出markdown的内容
print(extracted_markdow)

在这里插入图片描述

4、Notion 文档

点击Notion示例文档(https://yolospace.notion.site/Blendle-s-Employee-Handbook-e31bff7da17346ee99f531087d8b133f)。右上方复制按钮(Duplicate)，复制文档到你的Notion空间
点击右上方⋯ 按钮，选择导出为Mardown&CSV。导出的文件将为zip文件夹
解压并保存mardown文档到本地路径docs/Notion_DB/

使用NotionDirectoryLoader来对Notion Markdown文档进行加载

from langchain.document_loaders import NotionDirectoryLoader
loader = NotionDirectoryLoader("./data/Notion_DB")
pages = loader.load()

print("Length of pages: ", len(pages))

page = pages[0]
print("Type of page: ", type(page))
print("Page_content: ", page.page_content)
print("Meta Data: ", page.metadata)