Data Mining2 复习笔记6 - Optimization Hyperparameter Tuning

6. Optimization & Hyperparameter Tuning

Why Hyperparameter Tuning?
Many learning algorithms for classification, regression, … Many of those have hyperparameters: k and distance function for k nearest neighbors, splitting and pruning options in decision tree learning, …

But what is their effect?
Hard to tell in general and rules of thumb are rare.

Parameters vs. Hyperparameters
Parameters are learned during training
Typical examples: Coefficients in (linear) regression, Weights in neural networks, …
Training: Find set of of parameters so that objective function is minimized/maximized (on a holdout set)

Hyperparameters are fixed before training
Typical examples: Network layout and learning rate in neural networks, k in kNN, …
Training: Find set of of parameters so that objective function is minimized/maximized (on a holdout set), given a previously fixed set of hyperparameters

Hyperparameter Tuning – a Naive Approach

  1. run classification/regression algorithm
  2. look at the results (e.g., accuracy, RMSE, …)
  3. choose different parameter settings, go to 1

Questions: when to stop? how to select the next parameter setting to test?

Hyperparameter Tuning – Avoid Overfitting!
Recap overfitting: classifiers may overadapt to training data. The same holds for parameter settings
Possible danger: finding parameters that work well on the training set but not on the test set
Remedy: train / test / validation split

Example

Example

Example

6.1 Hyperparameter Tuning: Brute Force

Try all parameter combinations that exist → we need a better strategy than brute force!

Hyperparameter tuning is an optimization problem
Finding optimal values for N variables
Properties of the problem:

  • the underlying model is unknown, i.e., we do not know changing a variable will influence the results
  • we can tell how good a solution is when we see it, i.e., by running a classifier with the given parameter set
  • but looking at each solution is costly

Related problem: feature subset selection
Given n features, brute force requires 2^n evaluations
e.g. for 20 features, that is already one million → ten million with cross validation

Knapsack problem
given a maximum weight you can carry and a set of items with different weight and monetary value. Pack those items that maximize the monetary value

Problem is NP hard – i.e., deterministic algorithms require an exponential amount of time
Note: feature subset selection for N features requires 2^n evaluations

Many optimization problems are NP hard
Routing problems (Traveling Salesman Problem)
Integer factorization: hard enough to be used for cryptography
Resource use optimization. e.g., minimizing cutoff waste
Chip design - minimizing chip sizes

Properties of Brute Force search
guaranteed to find the best parameter setting, too slow in most practical cases

6.1.1 Grid Search

performs a brute force search with equal-width steps on non-discrete numerical attributes
(e.g., 10,20,30,…,100)
Hyperparameter with a wide range (e.g., 0.0001 to 1,000,000)
with ten equal-width steps, the first step would be 1,000
but what if the optimum is around 0.1?
logarithmic steps may perform better for some parameters

Needed:
solutions that take less time/computation and often find the best parameter setting or find a near-optimal parameter setting

6.2 Hyperparameter Tuning: One After Another

Given n parameters with m degrees of freedom – brute force takes m^n runs of the base classifier

Simple tweak:

  1. start with default settings
  2. try all options for the first parameter
    2a. fix best setting for first parameter
  3. try all options for the second parameter
    3a. fix best setting for second parameter

This reduces the runtime to n*m
i.e., no longer exponential – but we may miss the best solution

6.2.1 Interaction Effects

Interaction effects make parameter tuning hard. i.e., changing one parameter may change the optimal settings for another one
Example: two parameters p and q, each with values 0,1, and 2 – the table depicts classification accuracy

Example: two parameters p and q, each with values 0,1, and 2. The table depicts classification accuracy. If we try to optimize one parameter by another (first p, then q). We end at p=0,q=0 in six out of nine cases. On average, we investigate 2.3 solutions.
(0.5-local optimum, 0.7-globe optimum)
Example

6.3 Hill climbing with variations

6.3.1 Hill-Climbing Search (greedy local search)

“Like climbing Everest in thick fog with amnesia” always search in the direction of the steepest ascend.
Hill-Climbing Search

Problem

Example

6.3.2 Variations of Hill Climbing Search

  • Stochastic hill climbing
    random selection among the uphill moves
    the selection probability can vary with the steepness of the uphill move
  • First-choice hill climbing
    generating successors randomly until a better one is found, then pick
    that one
  • Random-restart hill climbing
    run hill climbing with different seeds
    tries to avoid getting stuck in local maxima

6.4 Beam search

Local Beam Search
Keep track of k states rather than just one
Start with k randomly generated states
At each iteration, all the successors of all k states are generated
Select the k best successors from the complete list and repeat

6.5 Random search

Grid Search vs. Random Search
All the examples discussed so far use fixed grids
Challenges: some hyperparameters are pretty sensitive
e.g., 0.02 is a good value, but 0 and 0.05 are not – others have little influence
but it is hard to know upfront which
grid search may easily miss best parameters but random search often yields better results

6.6 Genetic Programming

Genetic Algorithms is inspired by evolution:
use a population of individuals (solutions) -> create new individuals by crossover -> introduce random mutations -> from each generation, keep only the best solutions (“survival of the fittest”)
Standard Genetic Algorithm (SGA)

6.6.1 SGA

Basic ingredients:

  • individuals: the solutions
    hyperparameter tuning: a hyperparameter setting
  • a fitness function
    hyperparameter tuning: performance of a hyperparameter setting (i.e., run learner with those parameters)
  • acrossover method
    hyperparameter tuning: create a new setting from two others
  • amutation method
    hyperparameter tuning: change one parameter
  • survivor selection

SGA

Example

Example

Example

Crossover OR Mutation?
Decade long debate: which one is better / necessary …
Answer (at least, rather wide agreement): it depends on the problem, but
in general, it is good to have both – both have another role
mutation-only-EA is possible, crossover-only-EA would not work

Exploration: Discovering promising areas in the search space, i.e. gaining information on the problem
Exploitation: Optimising within a promising area, i.e. using information

There is co-operation AND competition between them
Crossover is explorative, it makes a big jump to an area
somewhere “in between” two (parent) areas
Mutation is exploitative, it creates random small diversions, thereby staying near (in the area of) the parent

Crossover OR Mutation?

Only crossover can combine information from two parents
Remember: sample from entire value range
Only mutation can introduce new information (alleles)
To hit the optimum you often need a ‘lucky’ mutation

6.6.2 Genetic Feature Subset Selection

Feature Subset Selection can also be solved by Genetic Programming
Individuals: feature subsets
Representation: binary – 1 = feature is included; – 0 = feature is not included
Fitness: classification performance
Crossover: combine selections of two subsets
Mutation: flip bits

6.6.3 Selecting a Learner by Meta Learning

So far, we have looked at finding good parameters for a learner – the learner was always fixed
A similar problem is selecting a learner for the task at hand
Again, we could go with search. Another approach is meta learning

Meta Learning i.e., learning about learning
Goal: learn how well a learner will perform on a given dataset features: dataset characteristics, learning algorithm
prediction target: accuracy, RMSE, …

Also known as AutoML
Basic idea: train a regression model

  • data points: individual datasets plus ML approach
  • target: expected accuracy/RMSE etc.

Examples for features: number of instances/attributes, fraction of nominal/numerical attributes, min/max/average entropy of attributes, skewness of classes, …


Recap: search heuristics are good for problems where finding an optimal solution is difficult, evaluating a solution candidate is easy, the search space of possible solutions is large
Possible solution: genetic programming

We have encountered such problems quite frequently
Example: learning an optimal decision tree from data

6.6.4 Genetic Decision Tree Learning

Population: candidate decision trees (initialization: e.g., trained on small subsets of data)
Create new decision trees by means of crossover & mutation
Fitness function: e.g., accuracy
Example

Example

swap can happen in different level, just randomly

Example

6.6.5 Combination of GP with other Learning Methods

Rule Learning (“Learning Classifier Systems”)
Population: set of rule sets (!)
Crossover: combining rules from two sets
Mutation: changing a rule

Artificial Neural Networks
Easiest solution: fixed network layout
The network is then represented as an ordered set (vector) of weights
e.g., [0.8, 0.2, 0.5, 0.1, 0.1, 0.2]
Crossover and mutation are straight forward
Variant: AutoMLP - Searches for best combination of hidden layers and learning rate

请添加图片描述

6.7 Hyperparameter learning

Hyperparameter tuning as a learning problem: Given a set of hyperparameters H, predict performance p of model. The prediction model is referred to as a surrogate model or oracle
Rationale:
Training and evaluating an actual model is costly
Learning and predicting with the surrogate model is fast

Hyperparameter learning

Note:
we want to use not too many runs of the actual model, i.e., the surrogate model will have few training points - use a simple model.
Most well-known: bayesian optimization

Summary: Grid Search, Random Search, Learning hyperparameters / bayesian optimization

Grid search
Inefficient
Fixed grid sizes may miss good parameters (Smaller grid sizes would be even less efficient!)

Random search
Often finds good solutions in less time

Learning hyperparameters / bayesian optimization
Sucessively tests hyperparameters close to local optima
Similar to hill climbing
Difference: explicit surrogate model

6.8 Summary

Summary

Summary

Hyperparameter Tuning: Criticism

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:/a/693639.html

如若内容造成侵权/违法违规/事实不符,请联系我们进行投诉反馈qq邮箱809451989@qq.com,一经查实,立即删除!

相关文章

软件游戏d3dcompiler_47.dll缺失怎么办,多种有效的解决方法分享

在计算机使用过程中,我们可能会遇到各种软件错误提示,其中之一就是“d3dcompiler47.dll缺失”。这个错误提示可能会影响到我们的正常使用,甚至导致某些软件无法运行。那么,d3dcompiler47.dll缺失究竟会造成哪些问题呢?…

看似不同的事情,却是相同的坑

目录 一、背景二、过程1.遭遇战-微盘股的下杀2.不失为一件好事3.一切向后看吧,最近的学习感受4.该有的心境 三、总结 一、背景 也在一点点改变,期间势必要经历流血的过程;所谓无疯狂不成长,积极的心态去应对,去总结总…

R语言数据探索和分析22-使用随机森林和聚类算法探索和预测健康状况

一、研究背景 在两个实验中,使用了一组综合性的生物统计数据来探索和预测健康状况(特别是疾病的发生)。实验的核心在于应用高级数据分析技术,具体包括随机森林分类和聚类分析,来洞察和预测个体的健康状况。首先&#…

专业学习|南开大学《随机过程》学习笔记(一)

(1)有哪些经典的关于基本随机过程的书籍推荐? 对于想要系统学习基本随机过程的学生来说,可以参考Sheldon M.Rose编著的经典著作《随机过程》。该书涉及的内容也比较宽泛。但并不局限于单个细节论证。 此外,萨缪尔科林(…

SpringAOP 常见应用场景

文章目录 SpringAOP1 概念2 常见应用场景3 AOP的几种通知类型分别有什么常见的应用场景4 AOP实现 性能监控4.1 首先,定义一个切面类,用于实现性能监控逻辑:4.2 定义自定义注解4.3 注解修饰监控的方法 5 AOP实现 API调用统计5.1 定义切面类&am…

连续状态方程的离散化例子

连续状态方程的离散化 在控制系统中,连续状态方程的离散化是一个重要的步骤,用于将连续时间系统转换为离散时间系统,以便在数字控制器中实现。这通常涉及将连续时间的微分方程转换为离散时间的差分方程。常用的离散化方法 前向欧拉法(Forward Euler)简单易实现,但精度较…

在Anaconda中安装keras-contrib库

文章目录 1. 有git2. 无git2.1 步骤12.2 步骤22.3 步骤3 1. 有git 如果环境里有git,直接运行以下命令: pip install githttps://www.github.com/farizrahman4u/keras-contrib.git2. 无git 2.1 步骤1 打开网址:https://github.com/keras-tea…

刷代码随想录有感(97):动态规划——斐波那契数列

题干&#xff1a; 代码&#xff1a; class Solution { public:int fib(int n) {if(n < 1)return n;vector<int> dp(n 1);dp[0] 0;dp[1] 1;for(int i 2; i < n; i){dp[i] dp[i - 1] dp[i - 2];}return dp[n];} }; 动态规划五部曲&#xff1a; 1.dp数组的定…

【数据结构】二叉树专题

前言 本篇博客我们来看一些二叉树的经典题型&#xff0c;也是对上篇博客的补充 &#x1f493; 个人主页&#xff1a;小张同学zkf ⏩ 文章专栏&#xff1a;数据结构 若有问题 评论区见&#x1f4dd; &#x1f389;欢迎大家点赞&#x1f44d;收藏⭐文章 ​ 目录 1.单值二叉树 …

鲜为人知的英伟达创始人:早早退出,身价不如黄仁勋零头

内容提要 普里姆因为婚姻纠纷等个人生活的干扰无法专注在工作上&#xff0c;在成立公司的10年后&#xff0c;也就是2003年宣布退休离开英伟达&#xff0c;并在2006年出售剩余的所有英伟达股份&#xff0c;过上不与外界联系、离群索居的生活&#xff0c;在家中鼓捣着如何“拯救…

数据结构【堆排序】

前言 在上一篇文章主要讲解了二叉树的基本概念和堆的概念以及接口的实现&#xff08;点此处跳转&#xff09; 我们简回顾下堆的基本概念&#xff1a; 1.堆分为大堆和小堆 大堆&#xff1a;父亲结点比左右孩子都大&#xff0c;根结点是最大的小堆&#xff1a;父亲结点比左右孩…

Redis系列-4 Redis集群介绍

Redis集群 Redis提供了持久化能力&#xff0c;保证了重启不会丢失数据&#xff1b;但Redis重启至完全恢复期间&#xff0c;缓存不可用。另外&#xff0c;对于高并发场景下&#xff0c;单点Redis服务器的性能不能满足吞吐量要求&#xff0c;需要进行横向扩展。此时&#xff0c;…

Java基础_Stream流

Java基础_Stream流 Stream流的简单使用Stream流的获取Stream流的中间方法Stream流的终结方法综合练习数字过滤字符串过滤并收集自定义对象过滤并收集 来源Gitee地址 Stream流的简单使用 public class StreamDemo01 {public static void main(String[] args) {/*** 创建集合添加…

【C++ | 拷贝赋值运算符函数】一文了解C++的 拷贝赋值运算符函数

&#x1f601;博客主页&#x1f601;&#xff1a;&#x1f680;https://blog.csdn.net/wkd_007&#x1f680; &#x1f911;博客内容&#x1f911;&#xff1a;&#x1f36d;嵌入式开发、Linux、C语言、C、数据结构、音视频&#x1f36d; ⏰发布时间⏰&#xff1a;2024-06-09 1…

API接口测试工具:jmeter的安装、汉化、Jmeter桌面快捷图标和基本使用

文章目录 测试工具&#xff1a;JmeterJmeter安装和配置Jmeter汉化设置中文语言&#xff1a;永久方式设置中文语言&#xff1a;临时方式 设置Jmeter桌面快捷图标jmeter基本用法Jmeter无法保存测试问题解决 测试工具&#xff1a;Jmeter Jmeter依赖于JDK&#xff0c;所以必须确保…

kafka集成flink api编写教程

1.引入依赖&#xff08;pox.xml&#xff09; <dependencies><dependency><groupId>org.apache.flink</groupId><artifactId>flink-java</artifactId><version>1.13.6</version></dependency><dependency><gro…

C# WPF入门学习主线篇(十六)—— Grid布局容器

C# WPF入门学习主线篇&#xff08;十六&#xff09;—— Grid布局容器 欢迎来到C# WPF入门学习系列的第十六篇。在前几篇文章中&#xff0c;我们已经探讨了 Canvas、StackPanel、WrapPanel 和 DockPanel 布局容器及其使用方法。本篇博客将介绍另一种功能强大且灵活的布局容器—…

MT76X8 RF定频使用方法

一、从下面网址下载QA软件包&#xff0c;然后在WIN系统下安装QA环境。https://download.csdn.net/download/zhouwu_linux/89408573?spm1001.2014.3001.5503 在WINDOWS 7系统下先安装WinPcap_4_1_3.exe。 二、硬件连接。 模块上电&#xff0c;PC机 的IP配置成为10.10.18.100&a…

GraphQL(6):认证与中间件

下面用简单来讲述GraphQL的认证示例 1 实现代码 在代码中添加过滤器&#xff1a; 完整代码如下&#xff1a; const express require(express); const {buildSchema} require(graphql); const grapqlHTTP require(express-graphql).graphqlHTTP; // 定义schema&#xff0c;…

Wireshark TS | 应用传输丢包问题

问题背景 仍然是来自于朋友分享的一个案例&#xff0c;实际案例不难&#xff0c;原因也就是互联网线路丢包产生的重传问题。但从一开始只看到数据包截图的判断结果&#xff0c;和最后拿到实际数据包的分析结果&#xff0c;却不是一个结论&#xff0c;方向有点跑偏&#xff0c;…