工具系列:TensorFlow决策森林_(5)使用文本和神经网络特征

文章目录

    • 设置
    • 使用原始文本作为特征
    • 使用预训练的文本嵌入
    • 同时训练决策树和神经网络
      • 构建模型
      • 训练和评估模型

欢迎来到 TensorFlow决策森林TF-DF)的 中级教程
在本文中,您将学习有关 TF-DF的一些更高级的功能,包括如何处理自然语言特征。

本文假设您已经熟悉在决策森林中介绍的概念,特别是关于TF-DF的安装。

在本文中,您将会:

  1. 训练一个原生地将文本特征作为分类集合的随机森林。

  2. 使用TensorFlow Hub模块训练一个使用文本特征的随机森林。在这种情况下(迁移学习),该模块已经在一个大型文本语料库上进行了预训练。

  3. 同时训练一个梯度提升决策树(GBDT)和一个神经网络。GBDT将使用神经网络的输出作为输入。

设置

# 安装 TensorFlow Decision Forests 库
!pip install tensorflow_decision_forests
Collecting tensorflow_decision_forests
  Using cached tensorflow_decision_forests-1.8.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.0 kB)
Requirement already satisfied: numpy in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from tensorflow_decision_forests) (1.26.2)
Requirement already satisfied: pandas in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from tensorflow_decision_forests) (2.1.3)
Requirement already satisfied: tensorflow~=2.15.0 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from tensorflow_decision_forests) (2.15.0)
Requirement already satisfied: six in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from tensorflow_decision_forests) (1.16.0)
Requirement already satisfied: absl-py in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from tensorflow_decision_forests) (1.4.0)
Requirement already satisfied: wheel in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from tensorflow_decision_forests) (0.41.2)
Collecting wurlitzer (from tensorflow_decision_forests)
  Using cached wurlitzer-3.0.3-py3-none-any.whl (7.3 kB)
Requirement already satisfied: astunparse>=1.6.0 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from tensorflow~=2.15.0->tensorflow_decision_forests) (1.6.3)
Requirement already satisfied: flatbuffers>=23.5.26 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from tensorflow~=2.15.0->tensorflow_decision_forests) (23.5.26)
Requirement already satisfied: gast!=0.5.0,!=0.5.1,!=0.5.2,>=0.2.1 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from tensorflow~=2.15.0->tensorflow_decision_forests) (0.5.4)
Requirement already satisfied: google-pasta>=0.1.1 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from tensorflow~=2.15.0->tensorflow_decision_forests) (0.2.0)
Requirement already satisfied: h5py>=2.9.0 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from tensorflow~=2.15.0->tensorflow_decision_forests) (3.10.0)
Requirement already satisfied: libclang>=13.0.0 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from tensorflow~=2.15.0->tensorflow_decision_forests) (16.0.6)
Requirement already satisfied: ml-dtypes~=0.2.0 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from tensorflow~=2.15.0->tensorflow_decision_forests) (0.2.0)
Requirement already satisfied: opt-einsum>=2.3.2 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from tensorflow~=2.15.0->tensorflow_decision_forests) (3.3.0)
Requirement already satisfied: packaging in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from tensorflow~=2.15.0->tensorflow_decision_forests) (23.2)
Requirement already satisfied: protobuf!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<5.0.0dev,>=3.20.3 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from tensorflow~=2.15.0->tensorflow_decision_forests) (3.20.3)
Requirement already satisfied: setuptools in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from tensorflow~=2.15.0->tensorflow_decision_forests) (68.2.2)
Requirement already satisfied: termcolor>=1.1.0 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from tensorflow~=2.15.0->tensorflow_decision_forests) (2.3.0)
Requirement already satisfied: typing-extensions>=3.6.6 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from tensorflow~=2.15.0->tensorflow_decision_forests) (4.8.0)
Requirement already satisfied: wrapt<1.15,>=1.11.0 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from tensorflow~=2.15.0->tensorflow_decision_forests) (1.14.1)
Requirement already satisfied: tensorflow-io-gcs-filesystem>=0.23.1 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from tensorflow~=2.15.0->tensorflow_decision_forests) (0.34.0)
Requirement already satisfied: grpcio<2.0,>=1.24.3 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from tensorflow~=2.15.0->tensorflow_decision_forests) (1.60.0rc1)
Requirement already satisfied: tensorboard<2.16,>=2.15 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from tensorflow~=2.15.0->tensorflow_decision_forests) (2.15.1)
Requirement already satisfied: tensorflow-estimator<2.16,>=2.15.0 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from tensorflow~=2.15.0->tensorflow_decision_forests) (2.15.0)
Requirement already satisfied: keras<2.16,>=2.15.0 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from tensorflow~=2.15.0->tensorflow_decision_forests) (2.15.0)
Requirement already satisfied: python-dateutil>=2.8.2 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from pandas->tensorflow_decision_forests) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from pandas->tensorflow_decision_forests) (2023.3.post1)
Requirement already satisfied: tzdata>=2022.1 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from pandas->tensorflow_decision_forests) (2023.3)
Requirement already satisfied: google-auth<3,>=1.6.3 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from tensorboard<2.16,>=2.15->tensorflow~=2.15.0->tensorflow_decision_forests) (2.23.4)
Requirement already satisfied: google-auth-oauthlib<2,>=0.5 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from tensorboard<2.16,>=2.15->tensorflow~=2.15.0->tensorflow_decision_forests) (1.1.0)
Requirement already satisfied: markdown>=2.6.8 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from tensorboard<2.16,>=2.15->tensorflow~=2.15.0->tensorflow_decision_forests) (3.5.1)
Requirement already satisfied: requests<3,>=2.21.0 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from tensorboard<2.16,>=2.15->tensorflow~=2.15.0->tensorflow_decision_forests) (2.31.0)
Requirement already satisfied: tensorboard-data-server<0.8.0,>=0.7.0 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from tensorboard<2.16,>=2.15->tensorflow~=2.15.0->tensorflow_decision_forests) (0.7.2)
Requirement already satisfied: werkzeug>=1.0.1 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from tensorboard<2.16,>=2.15->tensorflow~=2.15.0->tensorflow_decision_forests) (3.0.1)
Requirement already satisfied: cachetools<6.0,>=2.0.0 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from google-auth<3,>=1.6.3->tensorboard<2.16,>=2.15->tensorflow~=2.15.0->tensorflow_decision_forests) (5.3.2)
Requirement already satisfied: pyasn1-modules>=0.2.1 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from google-auth<3,>=1.6.3->tensorboard<2.16,>=2.15->tensorflow~=2.15.0->tensorflow_decision_forests) (0.3.0)
Requirement already satisfied: rsa<5,>=3.1.4 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from google-auth<3,>=1.6.3->tensorboard<2.16,>=2.15->tensorflow~=2.15.0->tensorflow_decision_forests) (4.9)
Requirement already satisfied: requests-oauthlib>=0.7.0 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from google-auth-oauthlib<2,>=0.5->tensorboard<2.16,>=2.15->tensorflow~=2.15.0->tensorflow_decision_forests) (1.3.1)
Requirement already satisfied: importlib-metadata>=4.4 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from markdown>=2.6.8->tensorboard<2.16,>=2.15->tensorflow~=2.15.0->tensorflow_decision_forests) (6.8.0)
Requirement already satisfied: charset-normalizer<4,>=2 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from requests<3,>=2.21.0->tensorboard<2.16,>=2.15->tensorflow~=2.15.0->tensorflow_decision_forests) (3.3.2)
Requirement already satisfied: idna<4,>=2.5 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from requests<3,>=2.21.0->tensorboard<2.16,>=2.15->tensorflow~=2.15.0->tensorflow_decision_forests) (3.4)
Requirement already satisfied: urllib3<3,>=1.21.1 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from requests<3,>=2.21.0->tensorboard<2.16,>=2.15->tensorflow~=2.15.0->tensorflow_decision_forests) (2.1.0)
Requirement already satisfied: certifi>=2017.4.17 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from requests<3,>=2.21.0->tensorboard<2.16,>=2.15->tensorflow~=2.15.0->tensorflow_decision_forests) (2023.11.17)
Requirement already satisfied: MarkupSafe>=2.1.1 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from werkzeug>=1.0.1->tensorboard<2.16,>=2.15->tensorflow~=2.15.0->tensorflow_decision_forests) (2.1.3)
Requirement already satisfied: zipp>=0.5 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from importlib-metadata>=4.4->markdown>=2.6.8->tensorboard<2.16,>=2.15->tensorflow~=2.15.0->tensorflow_decision_forests) (3.17.0)
Requirement already satisfied: pyasn1<0.6.0,>=0.4.6 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from pyasn1-modules>=0.2.1->google-auth<3,>=1.6.3->tensorboard<2.16,>=2.15->tensorflow~=2.15.0->tensorflow_decision_forests) (0.5.0)
Requirement already satisfied: oauthlib>=3.0.0 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from requests-oauthlib>=0.7.0->google-auth-oauthlib<2,>=0.5->tensorboard<2.16,>=2.15->tensorflow~=2.15.0->tensorflow_decision_forests) (3.2.2)
Using cached tensorflow_decision_forests-1.8.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (15.3 MB)
Installing collected packages: wurlitzer, tensorflow_decision_forests
Successfully installed tensorflow_decision_forests-1.8.1 wurlitzer-3.0.3

Wurlitzer 是在 Colabs 中显示详细的训练日志所需的(当在模型构造函数中使用 verbose=2 时)。

# 安装wurlitzer模块,用于在Jupyter Notebook中显示C语言的输出结果
!pip install wurlitzer
Requirement already satisfied: wurlitzer in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (3.0.3)

导入必要的库。

# 导入所需的库

import tensorflow_decision_forests as tfdf  # 导入决策森林库
import os  # 导入操作系统库
import numpy as np  # 导入数值计算库
import pandas as pd  # 导入数据处理库
import tensorflow as tf  # 导入深度学习库
import math  # 导入数学库

2023-11-20 12:31:20.226021: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-11-20 12:31:20.226066: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-11-20 12:31:20.227643: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered

隐藏的代码单元格在colab中限制了输出的高度。

#@title

# 导入所需的模块
from IPython.core.magic import register_line_magic
from IPython.display import Javascript
from IPython.display import display as ipy_display

# 定义一个魔术命令,用于设置单元格的最大高度
@register_line_magic
def set_cell_height(size):
  # 调用Javascript代码,设置单元格的最大高度
  ipy_display(
      Javascript("google.colab.output.setIframeHeight(0, true, {maxHeight: " +
                 str(size) + "})"))

使用原始文本作为特征

TF-DF可以原生地处理categorical-set特征。Categorical-sets将文本特征表示为词袋(或n-grams)。

例如:"The little blue dog" {"the", "little", "blue", "dog"}

在这个例子中,您将在Stanford Sentiment Treebank(SST)数据集上训练一个随机森林。该数据集的目标是将句子分类为positivenegative情感。您将使用在TensorFlow Datasets中精选的二分类版本的数据集。

注意: 训练categorical-set特征可能会很昂贵。在这本文中,我们将训练一个包含20棵树的小型随机森林。

# 安装 TensorFlow Datasets 包
!pip install tensorflow-datasets -U --quiet
# 导入tensorflow_datasets库
import tensorflow_datasets as tfds

# 加载数据集
all_ds = tfds.load("glue/sst2")

# 显示测试集中的前3个样例
for example in all_ds["test"].take(3):
  # 打印每个样例的属性名和属性值
  print({attr_name: attr_tensor.numpy() for attr_name, attr_tensor in example.items()})
{'idx': 163, 'label': -1, 'sentence': b'not even the hanson brothers can save it'}
{'idx': 131, 'label': -1, 'sentence': b'strong setup and ambitious goals fade as the film descends into unsophisticated scare tactics and b-film thuggery .'}
{'idx': 1579, 'label': -1, 'sentence': b'too timid to bring a sense of closure to an ugly chapter of the twentieth century .'}


2023-11-20 12:31:28.022927: W tensorflow/core/kernels/data/cache_dataset_ops.cc:858] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset  will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.

数据集的修改如下:

  1. 原始标签是整数{-1, 1},但学习算法期望的是正整数标签,例如{0, 1}。因此,标签的转换如下:new_labels = (original_labels + 1) / 2
  2. 为了使数据集的读取更加高效,应用了批量大小为64。
  3. sentence属性需要进行分词,即"hello world" -> ["hello", "world"]

注意: 此示例不使用数据集的test拆分,因为它没有标签。如果test拆分有标签,可以将validation折叠连接到train中(例如all_ds["train"].concatenate(all_ds["validation"]))。

细节: 某些决策森林学习算法不需要验证数据集(例如随机森林),而其他一些算法则需要(例如某些情况下的梯度提升树)。由于TF-DF下的每个学习算法可以以不同的方式使用验证数据,TF-DF在内部处理训练/验证拆分。因此,当您有训练和验证集时,它们可以始终作为学习算法的输入进行连接。

# 定义函数prepare_dataset,用于处理数据集
# 参数example为输入的样本数据
def prepare_dataset(example):
    # 将label加1后除以2,得到标签值
    label = (example["label"] + 1) // 2
    # 将句子按空格进行分割,并返回分割后的结果作为"sentence"键的值
    return {"sentence" : tf.strings.split(example["sentence"])}, label

# 将训练数据集all_ds中的"train"部分进行批处理,每批100个样本,并使用prepare_dataset函数进行处理
train_ds = all_ds["train"].batch(100).map(prepare_dataset)

# 将验证数据集all_ds中的"validation"部分进行批处理,每批100个样本,并使用prepare_dataset函数进行处理
test_ds = all_ds["validation"].batch(100).map(prepare_dataset)

最后,像往常一样训练和评估模型。TF-DF会自动将多值分类特征识别为分类集合。

# 设置单元格高度为300
%set_cell_height 300
# 指定模型为随机森林模型,使用30棵树,verbose参数为2表示输出训练过程中的详细信息
model_1 = tfdf.keras.RandomForestModel(num_trees=30, verbose=2)

# 训练模型,使用train_ds作为训练数据集
model_1.fit(x=train_ds)
<IPython.core.display.Javascript object>


Warning: The `num_threads` constructor argument is not set and the number of CPU is os.cpu_count()=32 > 32. Setting num_threads to 32. Set num_threads manually to use more than 32 cpus.


WARNING:absl:The `num_threads` constructor argument is not set and the number of CPU is os.cpu_count()=32 > 32. Setting num_threads to 32. Set num_threads manually to use more than 32 cpus.


Use /tmpfs/tmp/tmpp9alip3z as temporary training directory
Reading training dataset...
Training tensor examples:
Features: {'sentence': tf.RaggedTensor(values=Tensor("data:0", shape=(None,), dtype=string), row_splits=Tensor("data_1:0", shape=(None,), dtype=int64))}
Label: Tensor("data_2:0", shape=(None,), dtype=int64)
Weights: None
Normalized tensor features:
 {'sentence': SemanticTensor(semantic=<Semantic.CATEGORICAL_SET: 4>, tensor=tf.RaggedTensor(values=Tensor("data:0", shape=(None,), dtype=string), row_splits=Tensor("data_1:0", shape=(None,), dtype=int64)))}
Training dataset read in 0:00:04.588912. Found 67349 examples.
Training model...
Standard output detected as not visible to the user e.g. running in a notebook. Creating a training log redirection. If training gets stuck, try calling tfdf.keras.set_training_logs_redirection(False).


[INFO 23-11-20 12:31:32.7845 UTC kernel.cc:771] Start Yggdrasil model training
[INFO 23-11-20 12:31:32.7845 UTC kernel.cc:772] Collect training examples
[INFO 23-11-20 12:31:32.7846 UTC kernel.cc:785] Dataspec guide:
column_guides {
  column_name_pattern: "^__LABEL$"
  type: CATEGORICAL
  categorial {
    min_vocab_frequency: 0
    max_vocab_count: -1
  }
}
default_column_guide {
  categorial {
    max_vocab_count: 2000
  }
  discretized_numerical {
    maximum_num_bins: 255
  }
}
ignore_columns_without_guides: false
detect_numerical_as_discretized_numerical: false

[INFO 23-11-20 12:31:32.7849 UTC kernel.cc:391] Number of batches: 674
[INFO 23-11-20 12:31:32.7849 UTC kernel.cc:392] Number of examples: 67349
[INFO 23-11-20 12:31:32.8290 UTC data_spec_inference.cc:305] 12816 item(s) have been pruned (i.e. they are considered out of dictionary) for the column sentence (2000 item(s) left) because min_value_count=5 and max_number_of_unique_values=2000
[INFO 23-11-20 12:31:32.8820 UTC kernel.cc:792] Training dataset:
Number of records: 67349
Number of columns: 2

Number of columns by type:
	CATEGORICAL_SET: 1 (50%)
	CATEGORICAL: 1 (50%)

Columns:

CATEGORICAL_SET: 1 (50%)
	1: "sentence" CATEGORICAL_SET has-dict vocab-size:2001 num-oods:10187 (15.1257%) most-frequent:"the" 27205 (40.3941%)

CATEGORICAL: 1 (50%)
	0: "__LABEL" CATEGORICAL integerized vocab-size:3 no-ood-item

Terminology:
	nas: Number of non-available (i.e. missing) values.
	ood: Out of dictionary.
	manually-defined: Attribute whose type is manually defined by the user, i.e., the type was not automatically inferred.
	tokenized: The attribute value is obtained through tokenization.
	has-dict: The attribute is attached to a string dictionary e.g. a categorical attribute stored as a string.
	vocab-size: Number of unique values.

[INFO 23-11-20 12:31:32.8821 UTC kernel.cc:808] Configure learner
[INFO 23-11-20 12:31:32.8823 UTC kernel.cc:822] Training config:
learner: "RANDOM_FOREST"
features: "^sentence$"
label: "^__LABEL$"
task: CLASSIFICATION
random_seed: 123456
metadata {
  framework: "TF Keras"
}
pure_serving_model: false
[yggdrasil_decision_forests.model.random_forest.proto.random_forest_config] {
  num_trees: 30
  decision_tree {
    max_depth: 16
    min_examples: 5
    in_split_min_examples_check: true
    keep_non_leaf_label_distribution: true
    num_candidate_attributes: 0
    missing_value_policy: GLOBAL_IMPUTATION
    allow_na_conditions: false
    categorical_set_greedy_forward {
      sampling: 0.1
      max_num_items: -1
      min_item_frequency: 1
    }
    growing_strategy_local {
    }
    categorical {
      cart {
      }
    }
    axis_aligned_split {
    }
    internal {
      sorting_strategy: PRESORTED
    }
    uplift {
      min_examples_in_treatment: 5
      split_score: KULLBACK_LEIBLER
    }
  }
  winner_take_all_inference: true
  compute_oob_performances: true
  compute_oob_variable_importances: false
  num_oob_variable_importances_permutations: 1
  bootstrap_training_dataset: true
  bootstrap_size_ratio: 1
  adapt_bootstrap_size_ratio_for_maximum_training_duration: false
  sampling_with_replacement: true
}

[INFO 23-11-20 12:31:32.8826 UTC kernel.cc:825] Deployment config:
cache_path: "/tmpfs/tmp/tmpp9alip3z/working_cache"
num_threads: 32
try_resume_training: true

[INFO 23-11-20 12:31:32.8828 UTC kernel.cc:887] Train model
[INFO 23-11-20 12:31:32.8836 UTC random_forest.cc:416] Training random forest on 67349 example(s) and 1 feature(s).
[INFO 23-11-20 12:32:02.2437 UTC random_forest.cc:802] Training of tree  1/30 (tree index:13) done accuracy:0.738731 logloss:9.4171
[INFO 23-11-20 12:32:12.3428 UTC random_forest.cc:802] Training of tree  3/30 (tree index:27) done accuracy:0.754745 logloss:6.47525
[INFO 23-11-20 12:32:17.6546 UTC random_forest.cc:802] Training of tree  13/30 (tree index:20) done accuracy:0.801813 logloss:2.334
[INFO 23-11-20 12:32:18.5584 UTC random_forest.cc:802] Training of tree  23/30 (tree index:15) done accuracy:0.81742 logloss:0.942096
[INFO 23-11-20 12:32:21.9457 UTC random_forest.cc:802] Training of tree  30/30 (tree index:21) done accuracy:0.821274 logloss:0.854486
[INFO 23-11-20 12:32:21.9462 UTC random_forest.cc:882] Final OOB metrics: accuracy:0.821274 logloss:0.854486
[INFO 23-11-20 12:32:21.9558 UTC kernel.cc:919] Export model in log directory: /tmpfs/tmp/tmpp9alip3z with prefix d2f2a624a65443d5
[INFO 23-11-20 12:32:21.9870 UTC kernel.cc:937] Save model in resources
[INFO 23-11-20 12:32:21.9901 UTC abstract_model.cc:881] Model self evaluation:
Number of predictions (without weights): 67349
Number of predictions (with weights): 67349
Task: CLASSIFICATION
Label: __LABEL

Accuracy: 0.821274  CI95[W][0.818828 0.8237]
LogLoss: : 0.854486
ErrorRate: : 0.178726

Default Accuracy: : 0.557826
Default LogLoss: : 0.686445
Default ErrorRate: : 0.442174

Confusion Table:
truth\prediction
       1      2
1  19593  10187
2   1850  35719
Total: 67349


[INFO 23-11-20 12:32:22.0155 UTC kernel.cc:1233] Loading model from path /tmpfs/tmp/tmpp9alip3z/model/ with prefix d2f2a624a65443d5
[INFO 23-11-20 12:32:22.3248 UTC decision_forest.cc:660] Model loaded with 30 root(s), 43180 node(s), and 1 input feature(s).
[INFO 23-11-20 12:32:22.3249 UTC abstract_model.cc:1344] Engine "RandomForestGeneric" built
[INFO 23-11-20 12:32:22.3249 UTC kernel.cc:1061] Use fast generic engine


Model trained in 0:00:49.561739
Compiling model...
Model compiled.





<keras.src.callbacks.History at 0x7fd79650ec70>

在之前的日志中,注意到 sentence 是一个 CATEGORICAL_SET 特征。

模型的评估与往常一样:

# 对模型进行编译,指定评估指标为准确率
model_1.compile(metrics=["accuracy"])

# 对测试数据集进行评估,返回损失值和准确率
evaluation = model_1.evaluate(test_ds)

# 打印二元交叉熵损失值
print(f"BinaryCrossentropyloss: {evaluation[0]}")

# 打印准确率
print(f"Accuracy: {evaluation[1]}")
1/9 [==>...........................] - ETA: 4s - loss: 0.0000e+00 - accuracy: 0.8100
9/9 [==============================] - 1s 5ms/step - loss: 0.0000e+00 - accuracy: 0.7638
BinaryCrossentropyloss: 0.0
Accuracy: 0.7637614607810974

训练日志如下所示:

# 导入matplotlib.pyplot模块,用于绘制图表
import matplotlib.pyplot as plt

# 获取模型的训练日志
logs = model_1.make_inspector().training_logs()

# 绘制折线图,横坐标为日志中的树的数量,纵坐标为日志中的评估准确率
plt.plot([log.num_trees for log in logs], [log.evaluation.accuracy for log in logs])

# 设置横坐标的标签为"Number of trees"
plt.xlabel("Number of trees")

# 设置纵坐标的标签为"Out-of-bag accuracy"
plt.ylabel("Out-of-bag accuracy")

# 保持图表的原样,不做任何处理
pass

更多的树可能会有益处(我确定,因为我试过:p)。

使用预训练的文本嵌入

前面的例子使用原始文本特征训练了一个随机森林模型。这个例子将使用一个预训练的TF-Hub嵌入将文本特征转换为密集嵌入,并在其上训练一个随机森林模型。在这种情况下,随机森林模型只会“看到”嵌入的数值输出(即它不会看到原始文本)。

在这个实验中,我们将使用Universal-Sentence-Encoder。不同的预训练嵌入可能适用于不同类型的文本(例如不同的语言、不同的任务),也适用于其他类型的结构化特征(例如图像)。

**注意:**这个嵌入模块很大(1GB),因此最终模型的运行速度会比传统的决策树推断慢。

嵌入模块可以应用在两个地方:

  1. 在数据集准备阶段。
  2. 在模型的预处理阶段。

通常情况下,第二个选项更可取:将嵌入打包到模型中使得模型更容易使用(也更难被误用)。

首先安装TF-Hub:

# 安装tensorflow-hub库的最新版本
!pip install --upgrade tensorflow-hub
Requirement already satisfied: tensorflow-hub in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (0.15.0)
Requirement already satisfied: numpy>=1.12.0 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from tensorflow-hub) (1.26.2)
Requirement already satisfied: protobuf>=3.19.6 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from tensorflow-hub) (3.20.3)

与以前不同的是,您不需要对文本进行分词。

# 定义函数prepare_dataset,输入参数为example
def prepare_dataset(example):
    # 将label加1后除以2,得到label的值
    label = (example["label"] + 1) // 2
    # 返回一个字典,键为"sentence",值为example中的"sentence"对应的值,以及label的值
    return {"sentence" : example["sentence"]}, label

# 将all_ds中的"train"数据集按照batch size为100进行分批,并对每个batch应用prepare_dataset函数进行处理,得到train_ds数据集
train_ds = all_ds["train"].batch(100).map(prepare_dataset)

# 将all_ds中的"validation"数据集按照batch size为100进行分批,并对每个batch应用prepare_dataset函数进行处理,得到test_ds数据集
test_ds = all_ds["validation"].batch(100).map(prepare_dataset)
%set_cell_height 300

# 导入tensorflow_hub模块
import tensorflow_hub as hub
# 定义使用的模型为Universal Sentence Encoder,版本为4
hub_url = "https://tfhub.dev/google/universal-sentence-encoder/4"
# 将模型转换为Keras层
embedding = hub.KerasLayer(hub_url)

# 定义输入层,输入为字符串类型的句子
sentence = tf.keras.layers.Input(shape=(), name="sentence", dtype=tf.string)
# 将句子转换为嵌入向量
embedded_sentence = embedding(sentence)

# 定义原始输入为句子,处理后的输入为嵌入向量
raw_inputs = {"sentence": sentence}
processed_inputs = {"embedded_sentence": embedded_sentence}
# 定义预处理模型,将原始输入转换为处理后的输入
preprocessor = tf.keras.Model(inputs=raw_inputs, outputs=processed_inputs)

# 定义随机森林模型,使用预处理模型进行数据预处理,树的数量为100
model_2 = tfdf.keras.RandomForestModel(
    preprocessing=preprocessor,
    num_trees=100)

# 使用训练数据进行模型训练
model_2.fit(x=train_ds)
<IPython.core.display.Javascript object>


Warning: The `num_threads` constructor argument is not set and the number of CPU is os.cpu_count()=32 > 32. Setting num_threads to 32. Set num_threads manually to use more than 32 cpus.


WARNING:absl:The `num_threads` constructor argument is not set and the number of CPU is os.cpu_count()=32 > 32. Setting num_threads to 32. Set num_threads manually to use more than 32 cpus.


Use /tmpfs/tmp/tmp2l8qenh8 as temporary training directory
Reading training dataset...
Training dataset read in 0:00:22.682140. Found 67349 examples.
Training model...


[INFO 23-11-20 12:33:16.6995 UTC kernel.cc:1233] Loading model from path /tmpfs/tmp/tmp2l8qenh8/model/ with prefix a883bbf674954d64


Model trained in 0:00:14.090027
Compiling model...


[INFO 23-11-20 12:33:18.4993 UTC decision_forest.cc:660] Model loaded with 100 root(s), 563552 node(s), and 512 input feature(s).
[INFO 23-11-20 12:33:18.4994 UTC abstract_model.cc:1344] Engine "RandomForestOptPred" built
[INFO 23-11-20 12:33:18.4996 UTC kernel.cc:1061] Use fast generic engine


Model compiled.





<keras.src.callbacks.History at 0x7fd690629e50>
# 编译模型
model_2.compile(metrics=["accuracy"])

# 评估模型
evaluation = model_2.evaluate(test_ds)

# 打印二元交叉熵损失
print(f"BinaryCrossentropyloss: {evaluation[0]}")

# 打印准确率
print(f"Accuracy: {evaluation[1]}")
1/9 [==>...........................] - ETA: 13s - loss: 0.0000e+00 - accuracy: 0.7800
4/9 [============>.................] - ETA: 0s - loss: 0.0000e+00 - accuracy: 0.8075 
7/9 [======================>.......] - ETA: 0s - loss: 0.0000e+00 - accuracy: 0.7886
9/9 [==============================] - 2s 18ms/step - loss: 0.0000e+00 - accuracy: 0.7798
BinaryCrossentropyloss: 0.0
Accuracy: 0.7798165082931519

注意,分类集合与密集嵌入在表示文本时有所不同,因此同时使用这两种策略可能会很有用。

同时训练决策树和神经网络

前面的例子使用了一个预训练的神经网络(NN)来处理文本特征,然后将它们传递给随机森林。这个例子将从头开始训练神经网络和随机森林。

TF-DF的决策森林不会反向传播梯度(尽管这是正在进行的研究的主题)。因此,训练分为两个阶段:

  1. 将神经网络作为标准分类任务进行训练:
示例 → [归一化] → [神经网络*] → [分类头] → 预测
*: 训练。
  1. 用随机森林替换神经网络的头(最后一层和软最大值)。像往常一样训练随机森林:
示例 → [归一化] → [神经网络] → [随机森林*] → 预测
*: 训练。

### 准备数据集

本示例使用[Palmer's Penguins](https://allisonhorst.github.io/palmerpenguins/articles/intro.html)数据集。有关详细信息,请参阅[初学者colab](beginner_colab.ipynb)。

首先,下载原始数据:


```python
# 下载penguins.csv文件并保存到指定路径/tmp/penguins.csv

!wget -q https://storage.googleapis.com/download.tensorflow.org/data/palmer_penguins/penguins.csv -O /tmp/penguins.csv

将数据集加载到Pandas Dataframe中

# 读取csv文件,将数据存储在dataset_df中
dataset_df = pd.read_csv("/tmp/penguins.csv")

# 显示dataset_df中前3个样本的数据
dataset_df.head(3)
speciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsexyear
0AdelieTorgersen39.118.7181.03750.0male2007
1AdelieTorgersen39.517.4186.03800.0female2007
2AdelieTorgersen40.318.0195.03250.0female2007

准备训练数据集。

# 设置标签为"species"
label = "species"
# 将数据集中的数值NaN(表示Pandas Dataframe中的缺失值)替换为0。
# ...神经网络对数值NaN的处理效果不好。
for col in dataset_df.columns:
  # 如果数据集中的列的数据类型不是字符串或对象类型
  if dataset_df[col].dtype not in [str, object]:
    # 将该列中的NaN值替换为0
    dataset_df[col] = dataset_df[col].fillna(0)
# 将数据集拆分为训练集和测试集

def split_dataset(dataset, test_ratio=0.30):
  """将panda dataframe拆分为两个部分。"""
  # 生成一个与数据集长度相同的随机数组,元素值小于测试比例的为True,大于等于测试比例的为False
  test_indices = np.random.rand(len(dataset)) < test_ratio
  # 返回训练集和测试集
  return dataset[~test_indices], dataset[test_indices]

# 调用split_dataset函数将数据集拆分为训练集和测试集
train_ds_pd, test_ds_pd = split_dataset(dataset_df)
# 打印训练集和测试集的样本数量
print("{} 个样本用于训练,{} 个样本用于测试。".format(
    len(train_ds_pd), len(test_ds_pd)))

# 将数据集转换为tensorflow数据集
train_ds = tfdf.keras.pd_dataframe_to_tf_dataset(train_ds_pd, label=label)
test_ds = tfdf.keras.pd_dataframe_to_tf_dataset(test_ds_pd, label=label)
248 examples in training, 96 examples for testing.

构建模型

接下来,使用Keras的函数式风格创建神经网络模型。

为了保持示例简单,该模型仅使用两个输入。

# 创建两个输入层
input_1 = tf.keras.Input(shape=(1,), name="bill_length_mm", dtype="float")  # 输入层1,表示企鹅嘴峰的长度,数据类型为浮点数
input_2 = tf.keras.Input(shape=(1,), name="island", dtype="string")  # 输入层2,表示企鹅所在的岛屿,数据类型为字符串

# 将两个输入层组合成一个列表
nn_raw_inputs = [input_1, input_2]  # 输入层列表,包含两个输入层

使用预处理层将原始输入转换为适合神经网络的输入。

# 正则化
Normalization = tf.keras.layers.Normalization
CategoryEncoding = tf.keras.layers.CategoryEncoding
StringLookup = tf.keras.layers.StringLookup

# 获取"bill_length_mm"列的值,并将其转换为二维数组
values = train_ds_pd["bill_length_mm"].values[:, tf.newaxis]
# 创建Normalization层实例
input_1_normalizer = Normalization()
# 对输入数据进行适应,计算均值和方差
input_1_normalizer.adapt(values)

# 获取"island"列的值
values = train_ds_pd["island"].values
# 创建StringLookup层实例,将字符串转换为整数索引
input_2_indexer = StringLookup(max_tokens=32)
# 对输入数据进行适应,构建索引映射关系
input_2_indexer.adapt(values)

# 创建CategoryEncoding层实例,将整数索引转换为二进制编码
input_2_onehot = CategoryEncoding(output_mode="binary", max_tokens=32)

# 对输入数据进行正则化处理
normalized_input_1 = input_1_normalizer(input_1)
# 将输入数据转换为整数索引,并进行二进制编码
normalized_input_2 = input_2_onehot(input_2_indexer(input_2))

# 组合处理后的输入数据
nn_processed_inputs = [normalized_input_1, normalized_input_2]
WARNING:tensorflow:max_tokens is deprecated, please use num_tokens instead.


WARNING:tensorflow:max_tokens is deprecated, please use num_tokens instead.

构建神经网络的主体部分:

# 创建一个Concatenate层,用于将nn_processed_inputs中的张量连接起来
y = tf.keras.layers.Concatenate()(nn_processed_inputs)

# 创建一个Dense层,输出维度为16,激活函数为relu6
y = tf.keras.layers.Dense(16, activation=tf.nn.relu6)(y)

# 创建一个Dense层,输出维度为8,激活函数为relu,命名为"last"
last_layer = tf.keras.layers.Dense(8, activation=tf.nn.relu, name="last")(y)

# 创建一个Dense层,输出维度为3,用于分类任务
classification_output = tf.keras.layers.Dense(3)(y)

# 创建一个模型,输入为nn_raw_inputs,输出为classification_output
nn_model = tf.keras.models.Model(nn_raw_inputs, classification_output)

这个 nn_model 直接产生分类的 logits。

接下来创建一个决策森林模型。它将在神经网络在分类头之前提取的高级特征上操作。


# 将神经网络模型去掉输出层,得到nn_without_head模型
# 将神经网络模型和决策森林模型组合成一个keras模型,即df_and_nn_model
nn_without_head = tf.keras.models.Model(inputs=nn_model.inputs, outputs=last_layer)
df_and_nn_model = tfdf.keras.RandomForestModel(preprocessing=nn_without_head)
Warning: The `num_threads` constructor argument is not set and the number of CPU is os.cpu_count()=32 > 32. Setting num_threads to 32. Set num_threads manually to use more than 32 cpus.


WARNING:absl:The `num_threads` constructor argument is not set and the number of CPU is os.cpu_count()=32 > 32. Setting num_threads to 32. Set num_threads manually to use more than 32 cpus.


Use /tmpfs/tmp/tmpzwv9a980 as temporary training directory

训练和评估模型

该模型将分为两个阶段进行训练。首先,使用自己的分类头训练神经网络:

# 设置单元格高度为300

# 编译神经网络模型
nn_model.compile(
  optimizer=tf.keras.optimizers.Adam(),  # 使用Adam优化器
  loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),  # 使用稀疏分类交叉熵损失函数
  metrics=["accuracy"]  # 评估指标为准确率
)

# 使用训练数据集进行训练,并使用测试数据集进行验证,训练10个epochs
nn_model.fit(x=train_ds, validation_data=test_ds, epochs=10)

# 打印神经网络模型的概要信息
nn_model.summary()
<IPython.core.display.Javascript object>


Epoch 1/10


/tmpfs/src/tf_docs_env/lib/python3.9/site-packages/keras/src/engine/functional.py:642: UserWarning: Input dict contained keys ['bill_depth_mm', 'flipper_length_mm', 'body_mass_g', 'sex', 'year'] which did not match any model input. They will be ignored by the model.
  inputs = self._flatten_to_reference_inputs(inputs)
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1700483606.110085  457876 device_compiler.h:186] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.



1/1 [==============================] - ETA: 0s - loss: 1.4043 - accuracy: 0.0161
1/1 [==============================] - 2s 2s/step - loss: 1.4043 - accuracy: 0.0161 - val_loss: 1.3502 - val_accuracy: 0.0208
Epoch 2/10

1/1 [==============================] - ETA: 0s - loss: 1.3963 - accuracy: 0.0161
1/1 [==============================] - 0s 21ms/step - loss: 1.3963 - accuracy: 0.0161 - val_loss: 1.3436 - val_accuracy: 0.0208
Epoch 3/10

1/1 [==============================] - ETA: 0s - loss: 1.3885 - accuracy: 0.0161
1/1 [==============================] - 0s 21ms/step - loss: 1.3885 - accuracy: 0.0161 - val_loss: 1.3371 - val_accuracy: 0.0208
Epoch 4/10

1/1 [==============================] - ETA: 0s - loss: 1.3809 - accuracy: 0.0161
1/1 [==============================] - 0s 21ms/step - loss: 1.3809 - accuracy: 0.0161 - val_loss: 1.3305 - val_accuracy: 0.0208
Epoch 5/10

1/1 [==============================] - ETA: 0s - loss: 1.3733 - accuracy: 0.0161
1/1 [==============================] - 0s 21ms/step - loss: 1.3733 - accuracy: 0.0161 - val_loss: 1.3241 - val_accuracy: 0.0208
Epoch 6/10

1/1 [==============================] - ETA: 0s - loss: 1.3658 - accuracy: 0.0161
1/1 [==============================] - 0s 20ms/step - loss: 1.3658 - accuracy: 0.0161 - val_loss: 1.3177 - val_accuracy: 0.0208
Epoch 7/10

1/1 [==============================] - ETA: 0s - loss: 1.3584 - accuracy: 0.0161
1/1 [==============================] - 0s 20ms/step - loss: 1.3584 - accuracy: 0.0161 - val_loss: 1.3113 - val_accuracy: 0.0208
Epoch 8/10

1/1 [==============================] - ETA: 0s - loss: 1.3511 - accuracy: 0.0081
1/1 [==============================] - 0s 20ms/step - loss: 1.3511 - accuracy: 0.0081 - val_loss: 1.3050 - val_accuracy: 0.0208
Epoch 9/10

1/1 [==============================] - ETA: 0s - loss: 1.3440 - accuracy: 0.0081
1/1 [==============================] - 0s 21ms/step - loss: 1.3440 - accuracy: 0.0081 - val_loss: 1.2988 - val_accuracy: 0.0208
Epoch 10/10

1/1 [==============================] - ETA: 0s - loss: 1.3369 - accuracy: 0.0121
1/1 [==============================] - 0s 21ms/step - loss: 1.3369 - accuracy: 0.0121 - val_loss: 1.2927 - val_accuracy: 0.0312
Model: "model_1"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
==================================================================================================
 island (InputLayer)         [(None, 1)]                  0         []                            
                                                                                                  
 bill_length_mm (InputLayer  [(None, 1)]                  0         []                            
 )                                                                                                
                                                                                                  
 string_lookup (StringLooku  (None, 1)                    0         ['island[0][0]']              
 p)                                                                                               
                                                                                                  
 normalization (Normalizati  (None, 1)                    3         ['bill_length_mm[0][0]']      
 on)                                                                                              
                                                                                                  
 category_encoding (Categor  (None, 32)                   0         ['string_lookup[0][0]']       
 yEncoding)                                                                                       
                                                                                                  
 concatenate (Concatenate)   (None, 33)                   0         ['normalization[0][0]',       
                                                                     'category_encoding[0][0]']   
                                                                                                  
 dense (Dense)               (None, 16)                   544       ['concatenate[0][0]']         
                                                                                                  
 dense_1 (Dense)             (None, 3)                    51        ['dense[0][0]']               
                                                                                                  
==================================================================================================
Total params: 598 (2.34 KB)
Trainable params: 595 (2.32 KB)
Non-trainable params: 3 (16.00 Byte)
__________________________________________________________________________________________________

神经网络层在两个模型之间共享。因此,现在神经网络已经训练好了,决策森林模型将适应于神经网络层的训练输出。

# 使用df_and_nn_model模型对训练数据集train_ds进行拟合

df_and_nn_model.fit(x=train_ds)
<IPython.core.display.Javascript object>


Reading training dataset...
Training dataset read in 0:00:00.293304. Found 248 examples.
Training model...
Model trained in 0:00:00.045032
Compiling model...
Model compiled.


[INFO 23-11-20 12:33:27.2559 UTC kernel.cc:1233] Loading model from path /tmpfs/tmp/tmpzwv9a980/model/ with prefix 3397b294ee2f42a4
[INFO 23-11-20 12:33:27.2721 UTC decision_forest.cc:660] Model loaded with 300 root(s), 5280 node(s), and 7 input feature(s).
[INFO 23-11-20 12:33:27.2721 UTC kernel.cc:1061] Use fast generic engine





<keras.src.callbacks.History at 0x7fd6a07d5ac0>

现在评估组合模型:

# 编译模型
df_and_nn_model.compile(metrics=["accuracy"])

# 打印模型评估结果
print("Evaluation:", df_and_nn_model.evaluate(test_ds))
1/1 [==============================] - ETA: 0s - loss: 0.0000e+00 - accuracy: 0.9479
1/1 [==============================] - 0s 162ms/step - loss: 0.0000e+00 - accuracy: 0.9479
Evaluation: [0.0, 0.9479166865348816]

与仅使用神经网络相比:

# 打印输出 "Evaluation :",并调用神经网络模型的 evaluate 方法对测试数据集进行评估
print("Evaluation :", nn_model.evaluate(test_ds))
1/1 [==============================] - ETA: 0s - loss: 1.2927 - accuracy: 0.0312
1/1 [==============================] - 0s 13ms/step - loss: 1.2927 - accuracy: 0.0312
Evaluation : [1.2926578521728516, 0.03125]

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:/a/271080.html

如若内容造成侵权/违法违规/事实不符,请联系我们进行投诉反馈qq邮箱809451989@qq.com,一经查实,立即删除!

相关文章

uniapp中如何使用image图片

当在UniApp中使用图片时&#xff0c;可以通过<image>标签将图片显示在页面上。这个标签可以指定src属性来引用图片&#xff0c;并且可以通过mode属性来设置图片的显示模式。除此之外&#xff0c;还可以利用click事件来实现图片的点击事件。在编写代码时&#xff0c;要注意…

shell 编程中内置的变量(冷门又好用)

简介 分别盘点一下 shell 中的内置变量&#xff0c;真的巨好用&#xff01;&#xff01;&#xff01;包括&#xff1a;环境变量类、shell 变量类、终端设置类和其他一些变量。 常用的内置变量目录如下 1. 环境变量类 $MACHTYPE&#xff1a;机器类型 $OSTYPE&#xff1a;操作…

超声系统前端理论与模拟仿真-续

作者&#xff1a;蒋志强 本人同意他人对我的文章引用&#xff0c;但请在引用时注明出处&#xff0c;谢谢&#xff0e;作者&#xff1a;蒋志强 前言 近期整理了一下彩超前端及波束合成相关的内容&#xff0c;很早以前已经有过一次&#xff0c;这次把其它的内容总结一下&#xf…

【自定义磨砂动态背景】前端及pyqt6实现

如何实现一个自定义的磨砂动态背景呢&#xff1f; 这种效果看起来特别的高端&#xff0c;很新颖美观。 具体的效果可以看这里的演示&#xff1a;https://www.bilibili.com/video/BV1zj411H7wd/ 其实原理就是底层有多个多彩多边形在移动&#xff0c;然后再盖上一层模糊滤镜。 前…

测试服务器带宽(ubuntu)

apt install python3 python3-pippip3 install speedtest-clispeestest-cli

通过自然语言处理增强推荐系统:协同方法

一、介绍 自然语言处理 (NLP) 是人工智能的一个分支&#xff0c;专注于使机器能够以有意义且有用的方式理解、解释和响应人类语言。它包含一系列技术&#xff0c;包括情感分析、语言翻译和聊天机器人。 另一方面&#xff0c;推荐系统&#xff08;RecSys&#xff09;是旨在向用户…

机器视觉系统选型-避免畸变

在定位及高精度测量的系统中&#xff0c;镜头畸变的影响尤其重要 • 使用远心镜头 • 进行系统标定

【Unity+AI】SentisAI大模型植入Unity

相关教程 把AI模型放入Unity-手写数字识别【UnitySentis入门-1】_哔哩哔哩_bilibili Unity官方文档创建引擎以运行模型 |森蒂斯 |1.2.0-exp.2 (unity3d.com) 教程转载自&#xff1a;AI Tools for Game Dev 游戏开发人工智能工具/UnitySentis/01-Unity Sentis 入门.md chut…

听GPT 讲Rust源代码--src/tools(27)

File: rust/src/tools/clippy/clippy_lints/src/methods/suspicious_to_owned.rs 文件rust/src/tools/clippy/clippy_lints/src/methods/suspicious_to_owned.rs的作用是实施Clippy lint规则&#xff0c;检测产生潜在性能问题的字符转换代码&#xff0c;并给出相关建议。 在Rus…

机器人制作开源方案 | 森林管理员

​作者&#xff1a;李佳骏、常睿康、张智斌、李世斌、高华耸 单位&#xff1a;山西能源学院 指导老师&#xff1a;赵浩成、郜敏 1. 研究背景 森林作为地球上可再生自然资源及陆地生态的主体&#xff0c;在人类生存和发展的历史中起着不可代替的作用&#xff0c;它不仅能提供…

多功能视频编码标准 H.266/VVC

视频技术的最新进展潜在地增加了对流传输和广播视频内容的需求。在今日 人们依靠视频来相识并建立联系&#xff0c;借助超高清(Ultra High Definition&#xff0c;UHD)、虚 拟现实(Virtual Reality&#xff0c;VR)和 360 全景视频等技术&#xff0c;人们可以在线捕捉并分享最 快…

119. 杨辉三角 II(Java)

给定一个非负索引 rowIndex&#xff0c;返回「杨辉三角」的第 rowIndex 行。 在「杨辉三角」中&#xff0c;每个数是它左上方和右上方的数的和。 示例 1: 输入: rowIndex 3 输出: [1,3,3,1]示例 2: 输入: rowIndex 0 输出: [1]示例 3: 输入: rowIndex 1 输出: [1,1]提示…

彻底卸载Keil4

彻底卸载Keil4 双击 然后回到该软件的文件夹位置&#xff0c;把该文件夹删除即可&#xff0c;然后清一下回收站。

2024年软考电子商务设计师如何备考?考什么?

一、电子商务设计师概述&#xff1f; 电子商务设计师属于软考中级资格考试&#xff0c;软考是由国家人力资源和社会保障部&#xff08;原人事部&#xff09;、工业和信息化部&#xff08;原信息产业部&#xff09;领导的国家级考试&#xff0c;其目的是&#xff0c;科学、公正…

Hive05_DML 操作

1 DML 数据操作 1.1 数据导入 1.1.1 向表中装载数据&#xff08;Load&#xff09; 1&#xff09;语法 hive> load data [local] inpath 数据的 path [overwrite] into table student [partition (partcol1val1,…)];&#xff08;1&#xff09;load data:表示加载数据 &…

20231225在WIN10下使用SSH连接Ubuntu20.04.6

20231225在WIN10下使用SSH连接Ubuntu20.04.6 2023/12/25 23:03 https://jingyan.baidu.com/article/5552ef479e1856108ffbc9e3.html Win10怎么开启SSH功能 Win10怎么开启SSH功能,下面就一起来看看吧! 工具/原料 华硕天选4 Windows10 方法/步骤 点击左下角的开始菜单,打开Wind…

排序算法中In-place和Out-place

参考链接 排序方式中 In-place 和 Out-place&#xff1a; in-place 占用常数内存&#xff0c;不占用额外内存 假如问题规模是n&#xff0c;在解决问题过程中&#xff0c;只开辟了常数量的空间&#xff0c;与n无关&#xff0c;这是原址操作&#xff0c;就是In-place。 例 &…

【论文阅读】MCANet: Medical Image Segmentation with Multi-Scale Cross-Axis Attention

文章目录 摘要创新点总结实现效果总结 摘要 链接&#xff1a;https://arxiv.org/abs/2312.08866 医学图像分割是医学图像处理和计算机视觉领域的关键挑战之一。由于病变区域或器官的大小和形状各异&#xff0c;有效地捕捉多尺度信息和建立像素间的长距离依赖性至关重要。本文提…

学Java的第三天

一、自增自减运算符 &#xff0c;自增&#xff0c;变量的值加一 --&#xff0c;自减&#xff0c;变量的值减一 和-- 既可以放在变量的后边&#xff0c;也可以放在变量的前边。 单独使用的时候&#xff0c; 和 -- 放在变量的前边还是后边&#xff0c;结果是一样的。 参与操作…

【Hadoop】ZooKeeper数据模型Znode

ZooKeeper 数据模型ZnodeZooKeeper 中的时间ZooKeeper 节点属性 ZooKeeper 数据模型Znode 前面提过&#xff0c;Zookeeper相当于文件系统通知机制。既然是文件系统&#xff0c;那就涉及数据模型。 ZooKeeper 的数据模型在结构上和Unix标准文件系统非常相似&#xff0c;都是采用…