Data Mining数据挖掘—2. Classification分类

3. Classification

Given a collection of records (training set)
– each record contains a set of attributes
– one of the attributes is the class (label) that should be predicted
Find a model for class attribute as a function of the values of other attributes
Goal: previously unseen records should be assigned a class as accurately as possible
Application area: Direct Marketing, Fraud Detection
Why need it?
Classic programming is not so easy due to missing knowledge and difficult formalization as an algorithm

3.1 Definition

Given: a set of labeled records, consisting of
• data fields (a.k.a. attributes or features)
• a class label (e.g., true/false)
Generate: a function which can be used for classifying previously unseen records
• input: a record
• output: a class label

3.2 k Nearest Neighbors

The k nearest neighbors of a record x are data points that have the k smallest distance to x.
require: stored records, distance metric, value of k
steps: Compute distance to each training record -> Identify k nearest neighbors –> Use class labels of nearest neighbors to determine the class label of unknown record by taking majority vote or weighing the vote according to distance

choose k
Rule of thumb: Test k values between 1 and 10.
k too small -> sensitive to noise point
k too large -> neighborhood may include points from other classes

It’s very accurate but slow as training data needs to be searched. Can handle decision boundaries that are not parallel to the axes.

3.3 Nearest Centroids = Rocchio classifier

assign each data point to nearest centroid (center of all points of that class) 每个类别视为一个点(质心),然后在测试阶段,根据未标记数据点与这些质心的距离来决定数据点所属的类别。Nearest Centroid is a simple classification algorithm that calculates the centroid (average position) of each class based on training data. During testing, it classifies new instances by assigning them to the class whose centroid is closest.

k-NN vs. Nearest Centroid
k-NN

  • slow at classification time (linear in number of data points)
  • requires much memory (storing all data points)
  • robust to outliers

Nearest Centroid

  • fast at classification time (linear in number of classes)
  • requires only little memory (storing only the centroids)
  • robust to label noise
  • robust to class imbalance

Which classifier is better? Strongly depends on the problem at hand!

3.4 Bayes Classifier

P(C|A): conditional probability (How likely is C, given that we observe A)
Conditional Probability and Bayes Theorem

Example1 - Bayes Throrem
Example2 - Bayes Throrem

Estimating the Prior Probability P(C )
counting the records in the training set that are labeled with class Cj, dividing the count by the overall number of records
Estimating the Conditional Probability P(A | C)
Naïve Bayes assumes that all attributes are statistically independent
The independence assumption allows the joint probability P(A|C) to be reformulated as the product of the individual probabilities P(Ai|Cj).
P(A1,A2,…A3|Cj) = P(A1|Cj) * P(A2|Cj) * … * P(An|Cj)
Estimating the Probabilities P(Ai|Cj)
count how often an attribute value co-occurs with class Cj, divide by the overall number of instances in class Cj
Bayes Theorem Example

Handling Numerical Attributes
1.Discretize numerical attributes before learning classifier.
2.Make assumption that numerical attributes have a normal distribution given the class.
Use training data to estimate parameters of the distribution (e.g., mean and standard deviation)
Once the probability distribution is known, it can be used to estimate the conditional probability P(Ai|Cj)
Normal Distribution
Normal Distribution Example 1

Normal Distribution Example 2

Handling Missing Values
Missing values may occur in training and in unseen classification records. 一定要对缺失值进行处理

Zero Frequency Problem
one of the conditional probabilities is zero -> the entire expression becomes zero
It is not unlikely that an exactly same data point has not yet been observed. 有可能存在尚未观察到完全相同的数据点,因此概率有可能是0。
解决方法:Laplace Smoothing Laplace Smoothing

Decision Boundary of Naive Bayes Classifier
Decision Boundary of Naive Bayes Classifier

Summary

  • Robust to isolated noise points
  • Handle missing values by ignoring the instance during probability estimate calculations在概率估计计算过程中,通过忽略具有缺失值的实例来处理缺失值
  • Robust to irrelevant attributes [reasons: the probabilistic framework + conditional independence assumption + probability smoothing techniques]
  • Independence assumption may not hold for some attributes [Use other techniques such as Bayesian Belief Networks (BBN)]

Naïve Bayes works surprisingly well even if independence assumption is clearly violated [Reasons: Robustness to Violations + Effective Classification + Maximum Probability Assignment + Simple and Efficient]

Too many redundant attributes will cause problems. Solution: Select attribute subset as Naïve Bayes often works as well or better with just a fraction of all attributes.

Technical advantages:
(1) Learning Naïve Bayes classifiers is computationally cheap (probabilities are estimated in one pass over the training data)
(2) Storing the probabilities does not require a lot of memory

Redundant Variables
Violate independence assumption in Naive Bayes [Can, at large scale, skew the result]
May also skew the distance measures in k-NN, but the effect is not as drastic (Depends on the distance measure used)

Irrelevant Variables
For Naive Bayes: p(x=v|A) = p(x=v|B) for any value v, since it is random, it does not depend on the class variable. The overall result does not change

kNN vs. Naïve Bayes

kNNNaïve Bayes
computation\faster
dataless sensitive to outliersuse all data, less sensitive to label noise
redundant attributesless problematic\
irrelevant attributes\less problematic
pre-selectionyesyes

3.5 Lazy vs. Eager Learning

K-NN is a “lazy” methods.
They do not build an explicit model! “learning” is only performed on demand for unseen records.
Nearest Centroid and Naive Bayes are simple “eager” methods (also: decision tree, rule sets, …)
classify unseen instances, learn a model

3.6 Model Evaluation

3.6.1 Metrics for Performance Evaluation

Confusion MatrixConfusion Matrix

Accuracy & Error Rate
Accuracy & Error Rate

Baseline: naive guessing(always predict majority class)

3.6.2 Limitation of Accuracy: Unbalanced Data

1. Precision & Recall
Precision & Recall

2. F1-Measure越大越好
F1-Score combines precision and recall into one measure by using harmonic mean (tends to be closer to the smaller of the two)
For F1-value to be large, both p and r must be large. 当 Precision 和 Recall 都很高时,F1 score 也会趋向于高值,表示模型在正类别和负类别的预测中取得了较好的平衡。
F1-Score
F1-Measure Graph

confidence scores: how sure the algorithms is with its prediction
3. ROC Curves:

  1. Sort classifications according to confidence scores对于每个测试样本,模型输出一个置信度分数,例如,在朴素贝叶斯中可能是预测的概率。首先,将所有测试样本按照这些置信度分数从高到低排序。
  2. Evaluate
    correct prediction -> draw one step up如果模型的预测是正确的,将 ROC 曲线向上移动一步(增加真正例率)。
    incorrect prediction -> draw one step to the right如果模型的预测是错误的,将 ROC 曲线向右移动一步(增加假正例率)。
    Interpreting ROC Curves

False Positive Rate是指在所有实际为负例的样本中,被错误地判定为正例的比例,FPR = FP / (FP+TN)
True Positive Rate指在所有实际为正例的样本中,被正确地判定为正例的比例=召回率,TPR = TP / (TP+FN)
曲线越接近左上角: 表示模型性能越好,具有更高的TPR和更低的FPR。
4. Cost Matrixcost matrix
Computing Cost of Classification

3.7 Decision Tree Classifiers

Decision Tree Classifiers

There can be more than one tree that fits the same data!
Decision Boundary
Decision boundary: border line between two neighboring regions of different classes.
Decision boundary is parallel to axes because test condition involves a single attribute at-a-time
Finding an optimal decision tree is NP-hard
Tree building algorithms use a greedy, top-down, recursive partitioning strategy to induce a reasonable solution, also known as: divide and conquer. For example, Hunt’s Algorithm, ID3, CHAID, C4.5

3.7.1 Hunt’s Algorithm

Let Dt be the set of training records that reach a node t.
General Procedure:
If Dt contains only records that belong to the same class yt, then t is a leaf node labeled as yt
If Dt contains records that belong to more than one class, use an attribute test to split the data into smaller subsets
Recursively apply the procedure to each subset
Hunt's Algorithm

3.7.2 Split

Splitting Based on Nominal Attributes

Splitting Based on Nominal Attributes
Splitting Based on Ordinal Attributes
Splitting Based on Ordinal Attributes
Splitting Based on Continuous Attributes
Splitting Based on Continuous Attributes

Discretization to form an ordinal categorical attribute
equal-interval binning
equal-frequency binning
binning based on user-provided boundaries

Binary Decision: (A < v) or (A > v)
usually sufficient in practice
consider all possible splits
find the best cut (i.e., the best v) based on a purity measure
can be computationally expensive

3.7.3 Common measures of node impurity

How to determine the Best Split?
Nodes with homogeneous class distribution are preferred. Need a measure of node impurity.
1. Gini Index
GINI
Splitting Based on GINI

Binary Attributes
Binary Attributes: Computing GINI Index

Continuous Attributes
Continuous Attributes: Computing Gini Index

Continuous Attributes: Computing Gini Index

Continuous Attributes: Computing Gini Index

2. Entropy (Information Gain)
Entropy

Splitting Based on Information Gain

How to Find the Best Split

Decision Tree Advantages & Disadvantages
Advantages: Inexpensive to construct; Fast at classifying unknown records; Easy to interpret by humans for small-sized trees; Accuracy is comparable to other classification techniques for many simple data sets
Disadvantages: Decisions are based only one a single attribute at a time; Can only represent decision boundaries that are parallel to the axes

Decision Tree vs. k-NN

k-NNDecision Tree
Decision Boundariesarbitraryrectangular
Sensitivity to Scalesneed normalizationdoes not need normalization
Runtime & Memorycheap to train but expensive for classificationexpensive to train but cheap for classification

3.8 Overfitting

Overfitting: Good accuracy on training data, but poor on test data.
Symptoms: Tree too deep and too many branches
Typical causes of overfitting: too little training data, noise, poor learning algorithm
Which tree do you prefer?
Occam’s Razor
If you have two theories that explain a phenomenon equally well, choose the simpler one!

Learning Curve
Learning Curve

Holdout Method
The holdout method reserves a certain amount for testing and uses the remainder for training
Typical: one third for testing, the rest for training

For unbalanced datasets (few or none instances of some classes), samples might not be representative. -> Stratified sample: balances the data - Make sure that each class is represented with approximately equal proportions in both subsets. Other attributes may also be considered for stratification, e.g., gender, age, …

Leave One Out
Iterate over all examples
– train a model on all examples but the current one
– evaluate on the current one
每个样本都被当作测试集,而其余的样本组成训练集。这个过程会重复执行,直到每个样本都被作为测试集被验证过一次。
Yields a very accurate estimate but is computationally infeasible in most cases

Cross-Validation (k-fold cross-validation)
Compromise of Leave One Out and decent runtime
Cross-validation avoids overlapping test sets
Steps:
Step 1: Data is split into k subsets of equal size (Stratification may be applied)
Step 2: Each subset in turn is used for testing
and the remainder for training
The error estimates are averaged to yield an overall error estimate
Frequently used value for k : 10 (the gold standard for folds was long set to 10)

How to Address Overfitting?

  1. Pre-Pruning (Early Stopping Rule)
    Stop the algorithm before it becomes a fully-grown tree
    Typical stopping conditions for a node: Stop if all instances belong to the same class or Stop if all the attribute values are the same
    Less restrictive conditions: Stop if number of instances within a node is less than some user-specified threshold or Stop if expanding the current node only slightly improves the impurity measure (user-specified threshold)
  2. Post-pruning
    Grow decision tree to its entire size -> Trim the nodes of the decision tree in a bottom-up fashion [ using a validation data set or an estimate of the generalization error ] -> If generalization error improves after trimming [ replace sub-tree by a leaf node / Class label of leaf node is determined from majority class of instances in the sub-tree]

Training vs. Generalization Errors
Training error = resubstitution error = apparent error: errors made in training, misclassified training instances -> can be computed
Generalization error: errors made on unseen data, no apparent evidence -> must be estimated

Estimating the Generalization Error
Estimating the Generalization Error

Example of Post-Pruning
Example of Post-Pruning

3.9 Alternative Classification Methods

Some cases are not nicely expressible in trees and rule sets.(Example: if at least two of Employed, Owns House, and Balance Account are yes → Get Credit is yes)

3.9.1 Artificial Neural Networks (ANN)

ANN

General Structure of ANN

Algorithm for learning ANN

Decision Boundaries of ANN: Arbitrarily shaped objects & Fuzzy boundaries

3.9.2 Support Vector Machines

Find a linear hyperplane (decision boundary) that will separate the data.SVM

What is computed?
a separating hyper plane, defined by its support vectors (hence the name)

“support vectors” refer to the training data points that are crucial in defining the decision boundary (or hyperplane).
Challenges: Computing an optimal separation is expensive and it requires good approximations
Dealing with noisy data: introducing “slack variables” in margin computation

3.9.3 Nonlinear Support Vector Machines

  • Transform data into higher dimensional space将输入数据从原始特征空间映射到更高维的空间
  • Transformation in higher dimensional space
    Kernel function
    Different variants: polynomial function, radial basis function, …
  • Finding a hyperplane in higher dimensional space

3.10 Hyperparameter Selection

A hyperparameter is a parameter which influences the learning process and whose value is set before the learning begins. For example, pruning thresholds for trees and rules; gamma and C for SVMs; learning rate, hidden layers for ANNs.
parameters are learned from the training data. For example, weights in an ANN, probabilities in Naïve Bayes, splits in a tree.

How to determine good hyperparameters?
(1) manually play around with different hyperparameter settings
(2) have your machine automatically test many different settings(Hyperparameter Optimization)

Hyperparameter Optimization
Goal: Find the combination of hyperparameter values that results in learning the model with the lowest generalization error
How to determine the parameter value combinations to be tested?

  • Grid Search: Test all combinations in user-defined ranges
  • Random Search: Test combinations of random parameter values
  • Evolutionary Search: Keep specific parameter values that worked well

3.11 Model Selection

From all learned models M, select the model mbest that is expected to generalize best to unseen records

Model Selection Using a Validation Set

Model Selection using Cross-Validation
Model Evaluation using Nested Cross Validation

grid search for model selection
cross-validation for model evaluation

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:/a/227424.html

如若内容造成侵权/违法违规/事实不符,请联系我们进行投诉反馈qq邮箱809451989@qq.com,一经查实,立即删除!

相关文章

vuepress-----14、保护私密信息

秘钥存储文件 使用秘钥 忽略提交

通过异步序列化提高图表性能 Diagramming for WPF

通过异步序列化提高图表性能 2023 年 12 月 6 日 MindFusion.Diagramming for WPF 4.0.0 添加了异步加载和保存文件的功能&#xff0c;从而提高了响应能力。 MindFusion.Diagramming for WPF 提供了一个全面的工具集&#xff0c;用于创建各种图表&#xff0c;包括组织结构图、图…

『PyTorch学习笔记』如何快速下载huggingface模型/数据—全方法总结

如何快速下载huggingface模型/数据—全方法总结 文章目录 一. 如何快速下载huggingface大模型1.1. IDM(Windows)下载安装连接1.2. 推荐 huggingface 镜像站1.3. 管理huggingface_hub cache-system(缓存系统) 二. 参考文献 一. 如何快速下载huggingface大模型 推荐 huggingface…

苹果mac电脑如何彻底删除卸载软件?

在苹果电脑上安装和使用软件非常容易&#xff0c;但是卸载软件却可能会变得复杂和困难。不像在Windows上&#xff0c;你不能简单地在控制面板中找到已安装的程序并卸载它们。因此&#xff0c;在这篇文章中&#xff0c;我们将讨论苹果电脑怎么彻底删除软件。 CleanMyMac X全新版…

通信线缆是什么

通信线缆 电子元器件百科 文章目录 通信线缆前言一、通信线缆是什么二、通信线缆的类别三、通信线缆应用实例四、通信线缆的作用原理总结前言 每种线缆都有其特定的特性和用途。通信线缆起到连接和传输信号的作用,是实现通信和数据传输的重要组成部分。 一、通信线缆是什么 …

高级搜索——ST表,离线RMQ问题

文章目录 前言可重复贡献问题ST表的定义ST表的存储结构ST表的预处理预处理的实现 ST表的区间查询对于k的获取区间查询的实现 OJ链接 前言 对于查询区间最值的方法&#xff0c;我们常用的就是线段树&#xff0c;树状数组&#xff0c;单调队列&#xff0c;而树状数组更适合用于快…

AI报告专题:创造性和生成式人工智能

今天分享的AI系列深度研究报告&#xff1a;《AI报告专题&#xff1a;创造性和生成式人工智能》。 &#xff08;报告出品方&#xff1a;Capgemini&#xff09; 报告共计&#xff1a;64页 AI一代 生成式人工智能 (AI)正在迅速改变我们与技术的交互方式&#xff0c;使机器能够创…

Java实现屏幕截图程序(一)

在Java中&#xff0c;可以使用Robot类来实现屏幕截图程序。Robot类提供了一组用于生成输入事件和控制鼠标和键盘的方法。 Java实现屏幕截图的步骤如下&#xff1a; 导入Robot类 import java.awt.Robot;创建Robot对象 Robot robot new Robot();获取屏幕分辨率信息 Dimensi…

redis-学习笔记(hash)

Redis 自身已经是 键值对 结构了 Redis 自身的键值对就是通过 哈希 的方式来组织的 把 key 这一层组织完成后, 到了 value 这一层, 还可以用 哈希类型 来组织 (简单的说就是哈希里面套哈希 [数组里面套数组 -> 二维数组] ) [ field value ] hset key field value [ field va…

urllib 异常、cookie、handler及代理(四)

目录 一、urllib异常 二、urllib cookie登录 三、urllib handler 处理器的基本使用 四、urllib 代理和代理池 参考 一、urllib异常 URLError/HTTPError 简介&#xff1a; 1.HTTPError类是URLError类的子类 2.导入的包urllib.error.HTTPError urllib.error.URLError 3.h…

如何将idea中导入的文件夹中的项目识别为maven项目

问题描述 大家经常遇到导入某个文件夹的时候&#xff0c;需要将某个子文件夹识别为maven项目 解决方案

XUbuntu22.04之8款免费UML工具(一百九十七)

简介&#xff1a; CSDN博客专家&#xff0c;专注Android/Linux系统&#xff0c;分享多mic语音方案、音视频、编解码等技术&#xff0c;与大家一起成长&#xff01; 优质专栏&#xff1a;Audio工程师进阶系列【原创干货持续更新中……】&#x1f680; 优质专栏&#xff1a;多媒…

【计算机组成体系结构】SRAM和DRAM

RAM — Random Access Memory 随机访问存储器 —指定某一存储单元地址的时候&#xff0c;存储单元的读取速度并不会因为存储单元的物理位置改变 SRAM即为 Static RAM 静态随机访问存储器 — 用于主存DRAM即为 Dynamic RAM 动态随机访问存储器 — 用于Cache 一、SRAM和DRAM的特…

oomall课堂笔记

一、项目分层结构介绍 controller层&#xff08;控制器层&#xff09;&#xff1a; 作用&#xff1a;负责输出和输入&#xff0c;接收前端数据&#xff0c;把结果返回给前端。 1.处理用户请求&#xff0c;接收用户参数 2.调用service层处理业务&#xff0c;返回响应 servi…

Javaweb之Maven仓库的详细解析

2.3 Maven仓库 仓库&#xff1a;用于存储资源&#xff0c;管理各种jar包 仓库的本质就是一个目录(文件夹)&#xff0c;这个目录被用来存储开发中所有依赖(就是jar包)和插件 Maven仓库分为&#xff1a; 本地仓库&#xff1a;自己计算机上的一个目录(用来存储jar包) 中央仓库&a…

利用R语言heatmap.2函数进行聚类并画热图

数据聚类然后展示聚类热图是生物信息中组学数据分析的常用方法&#xff0c;在R语言中有很多函数可以实现&#xff0c;譬如heatmap,kmeans等&#xff0c;除此外还有一个用得比较多的就是heatmap.2。最近在网上看到一个笔记文章关于《一步一步学heatmap.2函数》&#xff0c;在此与…

python 涉及opencv mediapipe知识,眨眼计数 供初学者参考

基本思路 我们知道正面侦测到人脸时&#xff0c;任意一只眼睛水平方向上的两个特征点构成水平距离&#xff0c;上下两个特征点构成垂直距离 当头像靠近或者远离摄像头时&#xff0c;垂直距离与水平距离的比值基本恒定 根据这一思路 当闭眼时 垂直距离变小 比值固定小于某一个…

主动而非被动:确保网络安全运营弹性的途径

金融部门处理威胁的经验对网络安全领域的任何人都有启发——没有什么可以替代提前摆脱潜在的风险和问题。 从狂野西部的银行劫匪到勒索软件即服务 (RaaS)&#xff0c;全球金融生态系统面临的威胁多年来发生了巨大变化。技术进步带动了金融业的快速发展&#xff0c;从现金交易到…

【开放集检测OSR】open-set recognition(OSR)开集识别概念辨析

开放集学习 Openset Learning 主动学习 Active Learning 例外检测 Out-of-Distribution open-set recognition(OSR)开集识别 anomaly detection和outlier detection 文章目录 OOD检测OSR开放集识别OSR开放集识别在训练和测试阶段的数据集使用数据分布似然函数OSR开放集识别的特…

2023人工智能和市场营销的融合报告:创造性合作的新时代需要新的原则

今天分享的人工智能系列深度研究报告&#xff1a;《2023人工智能和市场营销的融合报告&#xff1a;创造性合作的新时代需要新的原则》。 &#xff08;报告出品方&#xff1a;M&CSAATCHITHINKS&#xff09; 报告共计&#xff1a;11页 生成型人工智能的兴起和重要性 生成式…