《斯坦福数据挖掘教程·第三版》读书笔记(英文版) Chapter 2 MapReduce and the New Software Stack

来源:《斯坦福数据挖掘教程·第三版》对应的公开英文书和PPT

Chapter 2 MapReduce and the New Software Stack

Computing cluster means large collections of commodity hardware, including conventional processors (“compute nodes”) connected by Ethernet cables or inexpensive switches.

The software stack begins with a new form of file system, called a “distributed file system,” which features much larger units than the disk blocks in a conventional operating system. Distributed file systems also provide replication of data or redundancy to protect against the frequent media failures that occur when data is distributed over thousands of low-cost compute nodes.

Compute nodes are stored on racks, perhaps 8–64 on a rack. The nodes on a single rack are connected by a network, typically gigabit Ethernet. There can be many racks of compute nodes, and racks are connected by another level of network or a switch. The bandwidth of inter-rack communication is somewhat greater than the inter-rack Ethernet, but given the number of pairs of nodes that might need to communicate between racks, this bandwidth may be essential.

在这里插入图片描述

Some important calculations take minutes or even hours on thousands of compute nodes. If we had to abort and restart the computation every time one component failed, then the computation might never complete successfully.
The solution to this problem takes two forms:

  1. Files must be stored redundantly. If we did not duplicate the file at several compute nodes, then if one node failed, all its files would be unavailable until the node is replaced. If we did not back up the files at all, and the disk crashes, the files would be lost forever.
  2. Computations must be divided into tasks, such that if any one task fails to execute to completion, it can be restarted without affecting other tasks. This strategy is followed by the MapReduce programming system.

在这里插入图片描述
在这里插入图片描述

在这里插入图片描述

在这里插入图片描述

在这里插入图片描述

在这里插入图片描述

在这里插入图片描述

在这里插入图片描述

Summary of Chapter 2

  • Cluster Computing: A common architecture for very large-scale applications is a cluster of compute nodes (processor chip, main memory, and disk). Compute nodes are mounted in racks, and the nodes on a rack are connected, typically by gigabit Ethernet. Racks are also connected by a high-speed network or switch.
  • Distributed File Systems: An architecture for very large-scale file systems has developed recently. Files are composed of chunks of about 64 megabytes, and each chunk is replicated several times, on different compute nodes or racks.
  • MapReduce: This programming system allows one to exploit parallelism inherent in cluster computing, and manages the hardware failures that can occur during a long computation on many nodes. Many Map tasks and many Reduce tasks are managed by a Master process. Tasks on a failed compute node are rerun by the Master.
  • The Map Function: This function is written by the user. It takes a collection of input objects and turns each into zero or more key-value pairs. Keys are not necessarily unique.
  • The Reduce Function: A MapReduce programming system sorts all the key-value pairs produced by all the Map tasks, forms all the values associated with a given key into a list and distributes key-list pairs to Reduce tasks. Each Reduce task combines the elements on each list, by applying the function written by the user. The results produced by all the Reduce tasks form the output of the MapReduce process.
  • Reducers: It is often convenient to refer to the application of the Reduce function to a single key and its associated value list as a “reducer.”
  • Hadoop: This programming system is an open-source implementation of a distributed file system (HDFS, the Hadoop Distributed File System) and MapReduce (Hadoop itself). It is available through the Apache Foundation.
  • Managing Compute-Node Failures: MapReduce systems support restart of tasks that fail because their compute node, or the rack containing that node, fail. Because Map and Reduce tasks deliver their output only after they finish (the blocking property), it is possible to restart a failed task without concern for possible repetition of the effects of that task. It is necessary to restart the entire job only if the node at which the Master
    executes fails.
  • Applications of MapReduce: While not all parallel algorithms are suitable for implementation in the MapReduce framework, there are simple implementations of matrix-vector and matrix-matrix multiplication. Also, the principal operators of relational algebra are easily implemented in MapReduce.
  • Workflow Systems: MapReduce has been generalized to systems that support any acyclic collection of functions, each of which can be instantiated by any number of tasks, each responsible for executing that function on a portion of the data.
  • Spark: This popular workflow system introduces Resilient, Distributed Datasets (RDD’s) and a language in which many common operations on RDD’s can be written. Spark has a number of efficiencies, including lazy evaluation of RDD’s to avoid secondary storage of intermediate results and the recording of lineage for RDD’s so they can be reconstructed as needed.
  • TensorFlow: This workflow system is specifically designed to support machine-learning. Data is represented as multidimensional arrays, or tensors, and built-in operations perform many powerful operations, such as linear algebra and model training.
  • Recursive Workflows: When implementing a recursive collection of functions, it is not always possible to preserve the ability to restart any failed task, because recursive tasks may have produced output that was consumed by another task before the failure. A number of schemes for checkpointing parts of the computation to allow restart of single tasks, or restart all tasks from a recent point, have been proposed.
  • Communication-Cost: Many applications of MapReduce or similar systems do very simple things for each task. Then, the dominant cost is usually the cost of transporting data from where it is created to where it is used. In these cases, efficiency of a MapReduce algorithm can be estimated by calculating the sum of the sizes of the inputs to all the tasks.
  • Multiway Joins: It is sometimes more efficient to replicate tuples of the relations involved in a join and have the join of three or more relations computed as a single MapReduce job. The technique of Lagrangean multipliers can be used to optimize the degree of replication for each of the participating relations.
  • Star Joins: Analytic queries often involve a very large fact table joined with smaller dimension tables. These joins can always be done efficiently by the multiway-join technique. An alternative is to distribute the fact table and replicate the dimension tables permanently, using the same strategy as would be used if we were taking the multiway join of the fact table and every dimension table.
  • Replication Rate and Reducer Size: It is often convenient to measure communication by the replication rate, which is the communication per input. Also, the reducer size is the maximum number of inputs associated with any reducer. For many problems, it is possible to derive a lower bound on replication rate as a function of the reducer size.
  • Representing Problems as Graphs: It is possible to represent many problems that are amenable to MapReduce computation by a graph in which nodes represent inputs and outputs. An output is connected to all the inputs that are needed to compute that output.
  • Mapping Schemas: Given the graph of a problem, and given a reducer size, a mapping schema is an assignment of the inputs to one or more reducers so that no reducer is assigned more inputs than the reducer size permits, and yet for every output there is some reducer that gets all the inputs needed to compute that output. The requirement that there be a mapping schema for any MapReduce algorithm is a good expression of what makes MapReduce algorithms different from general parallel computations.
  • Matrix Multiplication by MapReduce: There is a family of one-pass MapReduce algorithms that performs multiplication of n × n matrices with the minimum possible replication rate r = 2 n 2 q r =\frac {2n^2}q r=q2n2, where q is the reducer size. On the other hand, a two-pass MapReduce algorithm for the same problem with the same reducer size can use up to a factor of n less communication.

END

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:/a/15583.html

如若内容造成侵权/违法违规/事实不符,请联系我们进行投诉反馈qq邮箱809451989@qq.com,一经查实,立即删除!

相关文章

centos8 mysql 主从复制

♥️作者:小刘在C站 ♥️个人主页:小刘主页 ♥️每天分享云计算网络运维课堂笔记,努力不一定有收获,但一定会有收获加油!一起努力,共赴美好人生! ♥️夕阳下,是最美的绽放,树高千尺,落叶归根人生不易,人间真情 目录 Linux centos8

使用D435i深度相机运行ORB-SLAM3

下载安装链接 下载ORB-SLAM3地址: git clone https://github.com/UZ-SLAMLab/ORB_SLAM3.git eigen3多版本安装:https://blog.csdn.net/weixin_41756645/article/details/129570141 ORB-SLAM2中eigen3版本为:3.2.10版本 ORB-SLAM3中eigen3版…

【分布式】一致性哈希和哈希槽

当我们拥有了多台存储服务器之后,现在有多个key,希望可以将这些个key均匀的缓存到这些服务器上,可以使用哪些方案呢? 1. 普通哈希取模法 1.1 直接哈希取模 这是一种最容易想到的方法,使用取模算法hash(k…

AI绘图实战(七):室内设计线稿渲染、景观设计手绘稿改动、建筑照片转线稿|Stable Diffusion成为设计师生产力工具

S:AI能取代设计师么? I :至少在设计行业,目前AI扮演的主要角色还是超级工具,要顶替?除非甲方对设计效果无所畏惧~~ 预先学习: 安装及其问题解决参考:《Windows安装Stable Diffusion …

javaScript:cropperjs是一款非常强大却又简单的图片裁剪工具

cropperjs是一款非常强大却又简单的图片裁剪工具,它可以进行非常灵活的配置,支持手机端使用,支持包括IE9以上的现代浏览器。(关键是使用方法简单,几行代码就可以搞定) 官方github文档:GitHub -…

流程图拖拽视觉编程-流程编辑器

目录 一、简介 二、流程编辑器-视图实现 三、参考资料 一、简介 前期文章: 流程图拖拽视觉编程--概述_Jason~shen的博客-CSDN博客 本期内容: 本期将介绍流程编辑器模块的实现方法,效果图如下所示。该模块基于QT Graphics/View实现&…

使用FFMPEG库封装264视频和acc音频数据到MP4文件中

准备 ffmepeg 4.4 一段H264的视频文件 一段acc格式的音频文件 封装流程 1.使用avformat_open_input分别打开视频和音频文件,初始化其AVFormatContext,使用avformat_find_stream_info获取编码器基本信息 2.使用avformat_alloc_output_context2初始化…

solidity 安全 如何阻止重入攻击

什么是可重入攻击? 我们使用合约的过程中,经常会遇到这种情况,智能合约能够调用外部的合约;这些外部合约又可以回调到调用他们的智能合约;在这种情况下,我们说智能合约被重新输入,这种情况被称为…

Hive ---- Hive 安装

Hive ---- Hive 安装 1. Hive安装地址2. Hive安装部署1. 安装Hive2. 启动并使用Hive 3. MySQL安装1. 安装MySQL2. 配置MySQL3. 卸载MySQL说明 4. 配置Hive元数据存储到MySQL1. 配置元数据到MySQL2. 验证元数据是否配置成功3. 查看MySQL中的元数据 5. Hive服务部署1. hiveserver…

图像处理:均值滤波算法

目录 前言 概念介绍 基本原理 Opencv实现中值滤波 Python手写实现均值滤波 参考文章 前言 在此之前,我曾在此篇中推导过图像处理:推导五种滤波算法(均值、中值、高斯、双边、引导)。这在此基础上,我想更深入地研…

wvp开发环境搭建

代码下载地址 代码下载地址 https://gitee.com/pan648540858/wvp-GB28181-pro.git 开发工具 采用jetbrain idea 利用开发工具下载代码 文件-新建-来自版本控制的项目 url是上面的代码下载链接,点击克隆即可 下图是已经克隆并打开的代码 安装依赖环境 安装redi…

d2l Transformer

终于到变形金刚了,他的主要特征在于多头自注意力的使用,以及摒弃了rnn的操作。 目录 1.原理 2.多头注意力 3.逐位前馈网络FFN 4.层归一化 5.残差连接 6.Encoder 7.Decoder 8.训练 9.预测 1.原理 主要贡献:1.纯使用attention的Enco…

计算机网络学习03(OSI、TCP/IP网络分层模型详解))

1、OSI 七层模型 OSI 七层模型 是国际标准化组织提出一个网络分层模型,其大体结构以及每一层提供的功能如下图所示: 每一层都专注做一件事情,并且每一层都需要使用下一层提供的功能比如传输层需要使用网络层提供的路由和寻址功能&#xff0…

创建NAT模式KVM虚拟机

创建NAT模式KVM虚拟机 1 添加脚本执行权限(上传脚本文件至root目录)。 首先需要给脚本赋予执行权限。 # chmod x qemu-ifup-NAT 2 启动虚拟机。 通过命令启动虚拟机。(记得安装net-tools) # yum install net-tools -y # qemu-kvm -m 1024 -drive fi…

WSL怎么使用本机进行代理联网

文章目录 WSL怎么使用本机代理进行联网问题来源设置v2rayN设置wsl总结参考 WSL怎么使用本机代理进行联网 问题来源 使用WSL克隆github的代码网速很慢,无响应,导致项目无法下载,真的愁人。就想到为WSL设置xx上网,是否就会好很多。…

超级详细的华为OSPF实验及配置

什么是OSPF? 开放式最短路径优先OSPF(Open Shortest Path First)是IETF组织开发的一个基于链路状态的内部网关协议(Interior Gateway Protocol)。 目前针对IPv4协议使用的是OSPF Version 2(RFC2328&#x…

网络安全:通过445端口暴力破解植入木马。

网络安全:通过445端口暴力破解植入木马。 木马制作工具,如:灰鸽子等等 445端口是文件共享端口。可以进入对方文件硬盘进行植入木马: 使用文件共享进入对方磁盘: 在cmd输入net use \\x.x.x.x\ipc$ 之后会让你输入账号…

“数字中国·福启海丝”多屏互动光影艺术秀27日在福州举办

作为深化“数字海丝”的核心区、海上丝绸之路的枢纽城市,为喜迎第六届数字中国建设峰会盛大召开之际,福州市人民政府特此举办“数字中国福启海丝”多屏互动光影秀活动。本次光影秀活动是由福建省文化和旅游厅指导,福州市人民政府主办&#xf…

AutoGPT、AgentGPT、BabyAGI、HuggingGPT、CAMEL:各种基于GPT-4自治系统总结

ChatGPT和LLM技术的出现使得这些最先进的语言模型席卷了世界,不仅是AI的开发人员,爱好者和一些组织也在研究探索集成和构建这些模型的创新方法。各种平台如雨后春笋般涌现,集成并促进新应用程序的开发。 AutoGPT的火爆让我们看到越来越多的自…

机器学习实战:Python基于SVD奇异值分解进行矩阵分解(八)

文章目录 1 前言1.1 奇异值分解1.2 奇异值分解的应用 2 简单计算SVD2.1 NumPy 计算 SVD2.2 scikit-learn 计算截断 SVD2.3 scikit-learn 计算随机 SVD 3 demo数据演示3.1 导入函数3.2 导入数据3.3 计算SVD 4 讨论 1 前言 1.1 奇异值分解 奇异值分解(Singular Valu…