完整地实现了推荐系统的构建、实验和评估过程,为不同推荐算法在同一数据集上的性能比较提供了可重复实验的框架

{
   
 "cells": [
  {
   
   "cell_type": "markdown",
   "metadata": {
   },
   "source": [
    "# 基于用户的协同过滤算法"
   ]
  },
  {
   
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
   },
   "outputs": [],
   "source": [
    "# 导入包\n",
    "import random\n",
    "import math\n",
    "import time\n",
    "from tqdm import tqdm"
   ]
  },
  {
   
   "cell_type": "markdown",
   "metadata": {
   },
   "source": [
    "## 一. 通用函数定义"
   ]
  },
  {
   
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
   },
   "outputs": [],
   "source": [
    "# 定义装饰器,监控运行时间\n",
    "def timmer(func):\n",
    "    def wrapper(*args, **kwargs):\n",
    "        start_time = time.time()\n",
    "        res = func(*args, **kwargs)\n",
    "        stop_time = time.time()\n",
    "        print('Func %s, run time: %s' % (func.__name__, stop_time - start_time))\n",
    "        return res\n",
    "    return wrapper"
   ]
  },
  {
   
   "cell_type": "markdown",
   "metadata": {
   },
   "source": [
    "### 1. 数据处理相关\n",
    "1. load data\n",
    "2. split data"
   ]
  },
  {
   
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
   },
   "outputs": [],
   "source": [
    "class Dataset():\n",
    "    \n",
    "    def __init__(self, fp):\n",
    "        # fp: data file path\n",
    "        self.data = self.loadData(fp)\n",
    "    \n",
    "    @timmer\n",
    "    def loadData(self, fp):\n",
    "        data = []\n",
    "        for l in open(fp):\n",
    "            data.append(tuple(map(int, l.strip().split('::')[:2])))\n",
    "        return data\n",
    "    \n",
    "    @timmer\n",
    "    def splitData(self, M, k, seed=1):\n",
    "        '''\n",
    "        :params: data, 加载的所有(user, item)数据条目\n",
    "        :params: M, 划分的数目,最后需要取M折的平均\n",
    "        :params: k, 本次是第几次划分,k~[0, M)\n",
    "        :params: seed, random的种子数,对于不同的k应设置成一样的\n",
    "        :return: train, test\n",
    "        '''\n",
    "        train, test = [], []\n",
    "        random.seed(seed)\n",
    "        for user, item in self.data:\n",
    "            # 这里与书中的不一致,本人认为取M-1较为合理,因randint是左右都覆盖的\n",
    "            if random.randint(0, M-1) == k:  \n",
    "                test.append((user, item))\n",
    "            else:\n",
    "                train.append((user, item))\n",
    "\n",
    "        # 处理成字典的形式,user->set(items)\n",
    "        def convert_dict(data):\n",
    "            data_dict = {}\n",
    "            for user, item in data:\n",
    "                if user not in data_dict:\n",
    "                    data_dict[user] = set()\n",
    "                data_dict[user].add(item)\n",
    "            data_dict = {k: list(data_dict[k]) for k in data_dict}\n",
    "            return data_dict\n",
    "\n",
    "        return convert_dict(train), convert_dict(test)"
   ]
  },
  {
   
   "cell_type": "markdown",
   "metadata": {
   },
   "source": [
    "### 2. 评价指标\n",
    "1. Precision\n",
    "2. Recall\n",
    "3. Coverage\n",
    "4. Popularity(Novelty)"
   ]
  },
  {
   
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
   },
   "outputs": [],
   "source": [
    "class Metric():\n",
    "    \n",
    "    def __init__(self, train, test, GetRecommendation):\n",
    "        '''\n",
    "        :params: train, 训练数据\n",
    "        :params: test, 测试数据\n",
    "        :params: GetRecommendation, 为某个用户获取推荐物品的接口函数\n",
    "        '''\n",
    "        self.train = train\n",
    "        self.test = test\n",
    "        self.GetRecommendation = GetRecommendation\n",
    "        self.recs = self.getRec()\n",
    "        \n",
    "    # 为test中的每个用户进行推荐\n",
    "    def getRec(self):\n",
    "        recs = {}\n",
    "        for user in self.test:\n",
    "            rank = self.GetRecommendation(user)\n",
    "            recs[user] = rank\n",
    "        return recs\n",
    "        \n",
    "    # 定义精确率指标计算方式\n",
    "    def precision(self):\n",
    "        all, hit = 0, 0\n",
    "        for user in self.test:\n",
    "            test_items = set(self.test[user])\n",
    "            rank = self.recs[user]\n",
    "            for item, score in rank:\n",
    "                if item in test_items:\n",
    "                    hit += 1\n",
    "            all += len(rank)\n",
    "        return round(hit / all * 100, 2)\n",
    "    \n",
    "    # 定义召回率指标计算方式\n",
    "    def recall(self):\n",
    "        all, hit = 0, 0\n",
    "        for user in self.test:\n",
    "            test_items = set(self.test[user])\n",
    "            rank = self.recs[user]\n",
    "            for item, score in rank:\n",
    "                if item in test_items:\n",
    "                    hit += 1\n",
    "            all += len(test_items)\n",
    "        return round(hit / all * 100, 2)\n",
    "    \n",
    "    # 定义覆盖率指标计算方式\n",
    "    def coverage(self):\n",
    "        all_item, recom_item = set(), set()\n",
    "        for user in self.test:\n",
    "            for item in self.train[user]:\n",
    "                all_item.add(item)\n",
    "            rank = self.recs[user]\n",
    "            for item, score in rank:\n",
    "                recom_item.add(item)\n",
    "        return round(len(recom_item) / len(all_item) * 100, 2)\n",
    "    \n",
    "    # 定义新颖度指标计算方式\n",
    "    def popularity(self):\n",
    "        # 计算物品的流行度\n",
    "        item_pop = {}\n",
    "        for user in self.train:\n",
    "            for item in self.train[user]:\n",
    "                if item not in item_pop:\n",
    "                    item_pop[item] = 0\n",
    "                item_pop[item] += 1\n",
    "\n",
    "        num, pop = 0, 0\n",
    "        for user in self.test:\n",
    "            rank = self.recs[user]\n",
    "            for item, score in rank:\n",
    "                # 取对数,防止因长尾问题带来的被流行物品所主导\n",
    "                pop += math.log(1 + item_pop[item])\n",
    "                num += 1\n",
    "        return round(pop / num, 6)\n",
    "    \n",
    "    def eval(self):\n",
    "        metric = {'Precision': self.precision(),\n",
    "                  'Recall': self.recall(),\n",
    "                  'Coverage': self.coverage(),\n",
    "                  'Popularity': self.popularity()}\n",
    "        print('Metric:', metric)\n",
    "        return metric"
   ]
  },
  {
   
   "cell_type": "markdown",
   "metadata": {
   },
   "source": [
    "## 二. 算法实现\n",
    "1. Random\n",
    "2. MostPopular\n",
    "3. UserCF\n",
    "4. UserIIF"
   ]
  },
  {
   
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
   },
   "outputs": [],
   "source": [
    "# 1. 随机推荐\n",
    "def Random(train, K, N):\n",
    "    '''\n",
    "    :params: train, 训练数据集\n",
    "    :params: K, 可忽略\n",
    "    :params: N, 超参数,设置取TopN推荐物品数目\n",
    "    :return: GetRecommendation,推荐接口函数\n",
    "    '''\n",
    "    items = {}\n",
    "    for user in train:\n",
    "        for item in train[user]:\n",
    "            items[item] = 1\n",
    "    \n",
    "    def GetRecommendation(user):\n",
    "        # 随机推荐N个未见过的\n",
    "        user_items = set(train[user])\n",
    "        rec_items = {k: items[k] for k in items if k not in user_items}\n",
    "        rec_items = list(rec_items.items())\n",
    "        random.shuffle(rec_items)\n",
    "        return rec_items[:N]\n",
    "    \n",
    "    return GetRecommendation"
   ]
  },
  {
   
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
   },
   "outputs": [],
   "source": [
    "# 2. 热门推荐\n",
    "def MostPopular(train, K, N):\n",
    "    '''\n",
    "    :params: train, 训练数据集\n",
    "    :params: K, 可忽略\n",
    "    :params: N, 超参数,设置取TopN推荐物品数目\n",
    "    :return: GetRecommendation, 推荐接口函数\n",
    "    '''\n",
    "    items = {}\n",
    "    for user in train:\n",
    "        for item in train[user]:\n",
    "            if item not in items:\n",
    "                items[item] = 0\n",
    "            items[item] += 1\n",
    "        \n",
    "    def GetRecommendation(user):\n",
    "        # 随机推荐N个没见过的最热门的\n",
    "        user_items = set(train[user])\n",
    "        rec_items = {k: items[k] for k in items if k not in user_items}\n",
    "        rec_items = list(sorted(rec_items.items(), key=lambda x: x[1], reverse=True))\n",
    "        return rec_items[:N]\n",
    "    \n",
    "    return GetRecommendation"
   ]
  },
  {
   
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
   },
   "outputs": [],
   "source": [
    "# 3. 基于用户余弦相似度的推荐\n",
    "def UserCF(train, K, N):\n",
    "    '''\n",
    "    :params: train, 训练数据集\n",
    "    :params: K, 超参数,设置取TopK相似用户数目\n",
    "    :params: N, 超参数,设置取TopN推荐物品数目\n",
    "    :return: GetRecommendation, 推荐接口函数\n",
    "    '''\n",
    "    # 计算item->user的倒排索引\n",
    "    item_users = {}\n",
    "    for user in train:\n",
    "        for item in train[user]:\n",
    "            if item not in item_users:\n",
    "                item_users[item] = []\n",
    "            item_users[item].append(user)\n",
    "    \n",
    "    # 计算用户相似度矩阵\n",
    "    sim = {}\n",
    "    num = {}\n",
    "    for item in item_users:\n",
    "        users = item_users[item]\n",
    "        for i in range(len(users)):\n",
    "            u = users[i]\n",
    "            if u not in num:\n",
    "                num[u] = 0\n",
    "            num[u] += 1\n",
    "            if u not in sim:\n",
    "                sim[u] = {}\n",
    "            for j in range(len(users)):\n",
    "                if j == i: continue\n",
    "                v = users[j]\n",
    "                if v not in sim[u]:\n",
    "                    sim[u][v] = 0\n",
    "                sim[u][v] += 1\n",
    "    for u in sim:\n",
    "        for v in sim[u]:\n",
    "            sim[u][v] /= math.sqrt(num[u] * num[v])\n",
    "    \n",
    "    # 按照相似度排序\n",
    "    sorted_user_sim = {k: list(sorted(v.items(), \\\n",
    "                               key=lambda x: x[1], reverse=True)) \\\n",
    "                       for k, v in sim.items()}\n",
    "    \n",
    "    # 获取接口函数\n",
    "    def GetRecommendation(user):\n",
    "        items = {}\n",
    "        seen_items = set(train[user])\n",
    "        for u, _ in sorted_user_sim[user][:K]:\n",
    "            for item in train[u]:\n",
    "                # 要去掉用户见过的\n",
    "                if item not in seen_items:\n",
    "                    if item not in items:\n",
    "                        items[item] = 0\n",
    "                    items[item] += sim[user][u]\n",
    "        recs = list(sorted(items.items(), key=lambda x: x[1], reverse=True))[:N]\n",
    "        return recs\n",
    "    \n",
    "    return GetRecommendation"
   ]
  },
  {
   
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
   },
   "outputs": [],
   "source": [
    "# 4. 基于改进的用户余弦相似度的推荐\n",
    "def UserIIF(train, K, N):\n",
    "    '''\n",
    "    :params: train, 训练数据集\n",
    "    :params: K, 超参数,设置取TopK相似用户数目\n",
    "    :params: N, 超参数,设置取TopN推荐物品数目\n",
    "    :return: GetRecommendation, 推荐接口函数\n",
    "    '''\n",
    "    # 计算item->user的倒排索引\n",
    "    item_users = {}\n",
    "    for user in train:\n",
    "        for item in train[user]:\n",
    "            if item not in item_users:\n",
    "                item_users[item] = []\n",
    "            item_users[item].append(user)\n",
    "    \n",
    "    # 计算用户相似度矩阵\n",
    "    sim = {}\n",
    "    num = {}\n",
    "    for item in item_users:\n",
    "        users = item_users[item]\n",
    "        for i in range(len(users)):\n",
    "            u = users[i]\n",
    "            if u not in num:\n",
    "                num[u] = 0\n",
    "            num[u] += 1\n",
    "            if u not in sim:\n",
    "                sim[u] = {}\n",
    "            for j in range(len(users)):\n",
    "                if j == i: continue\n",
    "                v = users[j]\n",
    "                if v not in sim[u]:\n",
    "                    sim[u][v] = 0\n",
    "                # 相比UserCF,主要是改进了这里\n",
    "                sim[u][v] += 1 / math.log(1 + len(users))\n",
    "    for u in sim:\n",
    "        for v in sim[u]:\n",
    "            sim[u][v] /&#

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:/a/962351.html

如若内容造成侵权/违法违规/事实不符,请联系我们进行投诉反馈qq邮箱809451989@qq.com,一经查实,立即删除!

相关文章

sizeof和strlen的对比与一些杂记

1.sizeof和strlen的对比 1.1sizeof (1)sizeof是一种操作符 (2)sizeof计算的是类型或变量所占空间的大小,单位是字节 注意事项: (1)sizeof 返回的值类型是 size_t,这是一…

书生大模型实战营6

文章目录 L1——基础岛玩转书生「多模态对话」与「AI搜索」产品MindSearch 开源的 AI 搜索引擎书生浦语 InternLM 开源模型官方的对话类产品书生万象 InternVL 开源的视觉语言模型官方的对话产品在知乎上的提交 L1——基础岛 玩转书生「多模态对话」与「AI搜索」产品 MindSea…

three.js+WebGL踩坑经验合集(6.1):负缩放,负定矩阵和行列式的关系(2D版本)

春节忙完一轮,总算可以继续来写博客了。希望在春节假期结束之前能多更新几篇。 这一篇会偏理论多一点。笔者本没打算在这一系列里面重点讲理论,所以像相机矩阵推导这种网上已经很多优质文章的内容,笔者就一笔带过。 然而关于负缩放&#xf…

Baklib解析内容中台与人工智能技术带来的价值与机遇

内容概要 在数字化转型的浪潮中,内容中台与人工智能技术的结合为企业提供了前所未有的发展机遇。内容中台作为一种新的内容管理和生产模式,通过统一管理和协调各种内容资源,帮助企业更高效地整合内外部数据。而人工智能技术则以其强大的数据…

Learning Vue 读书笔记 Chapter 4

4.1 Vue中的嵌套组件和数据流 我们将嵌套的组件称为子组件,而包含它们的组件则称为它们的父组件。 父组件可以通过 props向子组件传递数据,而子组件则可以通过自定义事件(emits)向父组件发送事件。 4.1.1 使用Props向子组件传递…

小程序电商运营内容真实性增强策略及开源链动2+1模式AI智能名片S2B2C商城系统源码的应用探索

摘要:随着互联网技术的不断发展,小程序电商已成为现代商业的重要组成部分。然而,如何在竞争激烈的市场中增强小程序内容的真实性,提高用户信任度,成为电商运营者面临的一大挑战。本文首先探讨了通过图片、视频等方式增…

【游戏设计原理】96 - 成就感

成就感是玩家体验的核心,它来自完成一件让自己满意的任务,而这种任务通常需要一定的努力和挑战。游戏设计师的目标是通过合理设计任务,不断为玩家提供成就感,保持他们的参与热情。 ARCS行为模式(注意力、关联性、自信…

Linux系统上安装与配置 MySQL( CentOS 7 )

目录 1. 下载并安装 MySQL 官方 Yum Repository 2. 启动 MySQL 并查看运行状态 3. 找到 root 用户的初始密码 4. 修改 root 用户密码 5. 设置允许远程登录 6. 在云服务器配置 MySQL 端口 7. 关闭防火墙 8. 解决密码错误的问题 前言 在 Linux 服务器上安装并配置 MySQL …

读书笔记-《Redis设计与实现》(一)数据结构与对象(下)

各位朋友新年快乐~ 今天我们来继续学习 Redis 。 01 整数集合 当集合仅包含整数值,并且元素数量不多时,Redis 就会采用整数集合来作为集合键的底层实现。 typedef struct intset {// 编码方式uint32_t encoding;// 元素数量uint32_t length;// 数组in…

IP服务模型

1. IP数据报 IP数据报中除了包含需要传输的数据外,还包括目标终端的IP地址和发送终端的IP地址。 数据报通过网络从一台路由器跳到另一台路由器,一路从IP源地址传递到IP目标地址。每个路由器都包含一个转发表,该表告诉它在匹配到特定目标地址…

上海亚商投顾:沪指冲高回落 大金融板块全天强势 上海亚商投

上海亚商投顾前言:无惧大盘涨跌,解密龙虎榜资金,跟踪一线游资和机构资金动向,识别短期热点和强势个股。 一.市场情绪 市场全天冲高回落,深成指、创业板指午后翻绿。大金融板块全天强势,天茂集团…

数据分析系列--④RapidMiner进行关联分析(案例)

一、核心概念 1.项集(Itemset) 2.规则(Rule) 3.支持度(Support) 3.1 支持度的定义 3.2 支持度的意义 3.3 支持度的应用 3.4 支持度的示例 3.5 支持度的调整 3.6 支持度与其他指标的关系 4.置信度&#xff0…

HTB靶场Adminstrator

文章目录 靶机信息域环境初步信息收集与权限验证FTP 登录尝试SMB 枚举尝试WinRM 登录olivia域用户枚举 获取Michael权限BloodHound 提取域信息GenericAll 获取Benjamin权限ForceChangePasswordftp登录benjamin 获取Emily权限pwsafehashcat 获取Ethan权限获取管理员(Administrat…

C语言指针专题三 -- 指针数组

目录 1. 指针数组的核心原理 2. 指针数组与二维数组的区别 3. 编程实例 4. 常见陷阱与防御 5. 总结 1. 指针数组的核心原理 指针数组是一种特殊数组,其所有元素均为指针类型。每个元素存储一个内存地址,可指向不同类型的数据(通常指向同…

Spring Boot - 数据库集成06 - 集成ElasticSearch

Spring boot 集成 ElasticSearch 文章目录 Spring boot 集成 ElasticSearch一:前置工作1:项目搭建和依赖导入2:客户端连接相关构建3:实体类相关注解配置说明 二:客户端client相关操作说明1:检索流程1.1&…

深入理解MySQL 的 索引

索引是一种用来快速检索数据的一种结构, 索引使用的好不好关系到对应的数据库性能方面, 这篇文章我们就来详细的介绍一下数据库的索引。 1. 页面的大小: B 树索引是一种 Key-Value 结构,通过 Key 可以快速查找到对应的 Value。B 树索引由根页面(Root&am…

vue之pinia组件的使用

1、搭建pinia环境 cnpm i pinia #安装pinia的组件 cnpm i nanoid #唯一id,相当于uuid cnpm install axios #网络请求组件 2、存储读取数据 存储数据 >> Count.ts文件import {defineStore} from piniaexport const useCountStore defineStore(count,{// a…

【MySQL — 数据库增删改查操作】深入解析MySQL的 Update 和 Delete 操作

1. 测试数据 mysql> select* from exam1; ----------------------------------------- | id | name | Chinese | Math | English | ----------------------------------------- | 1 | 唐三藏 | 67.0 | 98.0 | 56.0 | | 2 | 孙悟空 | 87.0 | 78.…

数据结构与算法之二叉树: LeetCode LCP 10. 二叉树任务调度 (Ts版)

二叉树任务调度 https://leetcode.cn/problems/er-cha-shu-ren-wu-diao-du/description/ 描述 任务调度优化是计算机性能优化的关键任务之一。在任务众多时,不同的调度策略可能会得到不同的总体执行时间,因此寻求一个最优的调度方案是非常有必要的 通…

JAVA 接口、抽象类的关系和用处 详细解析

接口 - Java教程 - 廖雪峰的官方网站 一个 抽象类 如果实现了一个接口,可以只选择实现接口中的 部分方法(所有的方法都要有,可以一部分已经写具体,另一部分继续保留抽象),原因在于: 抽象类本身…