微调大型语言模型进行命名实体识别

大型语言模型的目标是理解和生成与人类语言类似的文本。它们经过大规模的训练,能够对输入的文本进行分析,并生成符合语法和语境的回复。这种模型可以用于各种任务,包括问答系统、对话机器人、文本生成、翻译等。

命名实体识别(Named Entity Recognition,简称NER)是一种常见的应用方法,可以让模型学会识别文本中的命名实体,如人名、地名、组织机构名等。

大型语言模型在训练时通过大量的文本数据学习了丰富的语言结构和上下文信息。这使得模型能够更好地理解命名实体在文本中的上下文,提高了识别的准确性。即使模型在训练过程中没有见过某个命名实体,它也可以通过上下文推断该实体的类别。这意味着模型可以处理新的、未知的实体,而无需重新训练。除此以外我们还能通过微调(fine-tuning)在特定任务上进行优化。这种迁移学习的方法使得在不同领域和任务上进行NER更加高效。

这篇文章总结了命名实体识别(NER)问题微调大型语言模型的经验。我们将以个人身份信息(PII)为例来介绍大型语言模型进行NER微调的方法。

个人可识别信息(Personal Identifiable Information,PII)

个人可识别信息(Personal Identifiable Information,PII)是指可以用于识别、联系或定位个人身份的数据或信息。这些信息可以单独使用或结合其他信息,使得可以辨认特定的个人。PII通常包括但不限于以下内容:

  • 全名
  • 电子邮件地址
  • 身份证号码
  • 驾驶证号码
  • 社会安全号码
  • 银行账号
  • 生日
  • 地址

这些信息的泄露可能会导致身份盗窃、个人隐私侵犯等问题,因此保护PII对于保障个人隐私和安全至关重要。

HIPAA隐私规定(Health Insurance Portability and Accountability Act Privacy Rules,简称HIPAA Privacy Rules)是一组法规,旨在保护医疗保健信息的隐私和安全。这些规定是由美国联邦政府颁布,适用于医疗保健提供者、健康计划、医疗支付者以及与这些实体交换医疗信息的其他组织和个人。

“Safe Harbor method” 是指在HIPAA(Health Insurance Portability and Accountability Act)隐私规定中的一种数据安全标准。这个方法允许医疗保健机构和其他涉及医疗信息的实体在某些条件下共享个人健康信息,而不会被认为违反HIPAA的隐私规定。

在Safe Harbor方法下,共享的个人健康信息必须经过匿名化处理,以使其不再能够识别特定的个人。HIPAA规定了一组特定的标识符,包括但不限于以下信息:

  • 医疗记录号码
  • 社会安全号码
  • 驾驶证号码
  • 信用卡号码

如果这些标识符被移除,或者通过某种方式使得个人健康信息无法与特定的个人相关联,那么这些信息就被视为符合Safe Harbor标准。

所以NER就派上了用武之地,可以对LLM进行微调,生成包含检测到的PII实体的结构良好的字符串,然后再进行匿名化处理来保证个人健康信息的安全性。

对大型语言模型进行微调

微调LLM主要有以下2个方面的挑战:

调优的LLM不应该产生命名实体的幻觉。应该从一组受控的实体标签中进行检测。

微调LLM应该生成结构良好的输出。LLM输出不应包含无关信息(例如,解释为什么检测到某些实体)。因为输出中的额外令牌导致每个输入的推理成本更高。并且下游任务也无法使用。

所以我们就先格式化训练数据

一个典型的NER输入和输出格式如下。

 # INPUT
 
 test_example="My name is John Doe and I can be contacted at 111-222-3334"
 
 
 # GROUND TRUTH NER DETECTIONS
 test_detections= [
   {
     "entity_type": "PERSON",
     "entity_value": "John Doe",
     "start_position": 11,
     "end_position": 19,
   },
   {
     "entity_type": "PHONE_NUMBER",
     "entity_value": "111-222-3334",
     "start_position": 46,
     "end_position": 58,
   }
 ]

输出数据可以通过多种方式进行格式化。对于典型的NER数据集,广泛采用BIO格式。

BIO 格式是命名实体识别(Named Entity Recognition,NER)任务中常用的标注格式,用于标记文本中的命名实体。BIO 格式包括三种标记:B、I 和 O。

  • B(Beginning):表示一个命名实体的开头。
  • I(Inside):表示一个命名实体的内部。
  • O(Outside):表示不是命名实体的词。
 ## BIO Tags for sentence - Alex is going to Los Angeles in California
 
 Alex I-PER
 is O
 going O
 to O
 Los I-LOC
 Angeles I-LOC
 in O
 California I-LOC

BIO格式是非常具体的。它需要理解标记是特定实体标签的“内部”(I)和“外部”(O)。并且还要标识出实体标签开始的字符,这会在制定LLM任务描述提示时增加不必要的复杂性。

我们可以尝试了以下两种输出格式

 # JSON encoded string with NER detections
 
 llm_output_str = "[{\"entity_type\": \"PERSON\",\"entity_value\": \"John Doe\",\"start_position\": 11,\"end_position\": 19,},{\"entity_type\": \"PHONE_NUMBER\",\"entity_value\": \"111-222-3334\",\"start_position\": 46,\"end_position\": 58}]"

这些字符串看起来与一般的NER数据完全相同。无需对LLM输出进行任何额外的数据处理。并且我们可以直接使用json.loads(llm_output_str)来读取模型的输出。

但是这里我们必须要保证模型输出必须为正确的JSON编码字符串,并且还要记录字符串在输入的位置,这对于模型来说是有点困难的。

或者我们让模型直接将NER的标签进行标记,比如:

 llm_output_str = "My name is <PERSON>John Doe</PERSON> and I can be contacted at <PHONE_NUMBER>111-222-3334</PHONE_NUMBER>"

让模型在输入中包含相关的<ENTITY_LABEL> </ENTITY_LABEL>标记,这样对于我们查看结果是非常方便的,但是对于编码来说还必须对LLM生成的输出进行后期处理,解析检测到的实体的实体以及开始和结束字符索引,这会增加我们的代码量。并且这种方法我们需要保证在输出时没有任何令牌产生幻觉,而且输入中的所有字符、标点和词序都需要保留,这对于LLM来说也有一些困难。

那么我们选择哪一个呢?在本文的最后,我们将看到哪种格式的输出字符串效果更好。

提示设计

现在我们有了输入和输出数据格式,下面就需要设计一个向LLM描述任务的提示符。提示设计是非常非常重要的,这回影响到LLM的输出

《 QUANTIFYING LANGUAGE MODELS’ SENSITIVITY TO SPURIOUS FEATURES IN PROMPT DESIGN or: How I learned to start worrying about prompt formatting》论文讨论了提示对于模型性能的变化,有兴趣的可以看看

对于任务描述,我们使用单独提示来生成json和格式的输出字符串。对于生成json字符串的模型,使用了以下任务描述

 You are given a user utterance that may contain Personal Identifiable 
 Information (PII). You are also given a list of entity types representing 
 personal identifiable information (PII). Your task is to detect and identify 
 all instances of the supplied PII entity types in the user utterance. Provide 
 a JSON output with keys: 'entity_type' (label of the detected entity), 
 'entity_value' (actual string value of the entity), 'start_position' 
 (start character index of the entity in the user utterance string), and 
 'end_position' (end character index of the entity in the user utterance string) 
 Ensure accuracy in identification of entities with correct start_position and 
 end_position character indices. Ensure that all entities are identified. Do 
 not perform false identifications.

对于输出字符串,我使用了以下任务描述

 You are given a user utterance that may contain Personal Identifiable 
 Information (PII). You are also given a list of entity types representing 
 Personal Identifiable Information (PII). Your task is to detect and identify 
 all instances of the supplied PII entity types in the user utterance. 
 The output must have the same content as the input. Only the tokens that match 
 the PII entities in the list should be enclosed within XML tags. The XML tag 
 comes from the PII entities described in the list below. For example, a name 
 of a person should be enclosed within <PERSON></PERSON> tags. Ensure that all 
 entities are identified. Do not perform false identifications.

提示还需要包含实体类型及其描述的列表,以确保模型只检测来自受控标签列表的实体。我选择了下面的模板

 List Of Entities
 PERSON: Name of a person
 Rx_NUMBER: Number identifying a medical prescription
 ORDER_NUMBER: Number identifying a retail order
 EMAIL_ADDRESS: Email address
 PHONE_NUMBER: Telephone or mobile number
 DATE_TIME: Dates and Times
 US_SSN: Social Security Number in the United States

我们针对上面的提示进行以下的测试

 ## Few shot example input
 "My name is John Doe and I can be contacted at 111-222-3334"
 ## Few shot example output
 "My name is <PERSON>John Doe</PERSON> and I can be contacted at <PHONE_NUMBER>111-222-3334</PHONE_NUMBER>"
 
 ## Actual input
 "My phone number is 222-333-4445 and my name is Ana Jones"
 ## Incorrect Model output - model rephrases the output to match closer to few shot example output
 "My name is <PERSON>Ana Jones</PERSON> and my phone number is <PHONE_NUMBER>222-333-4445</PHONE_NUMBER>"
 
 ## What model should have generated
 "My phone number is <PHONE_NUMBER>222-333-4445</PHONE_NUMBER> and my name is <PERSON>Ana Jones</PERSON>"

我们可以非常明显的看到模型输出中生成虚假的标记,实体都已经区分出来了,但是位置变了。我们可以在prompt中加入一些少样本的示例来让模型强制学习,但是这回增加prompt的输入令牌数。

在提示中加入Chain-Of-Thought

除了在会话中嵌入少样本示例外,我们还可以让模型以简洁的方式重新描述指令。这加强了模型对任务的理解,可以获得更好、更一致的格式化输出。我还让模型向我“解释”,给定任务描述,为什么示例输入和输出是有意义的。比如说以下的提示

 # First user message
 usr_msg1="""
 You are given a user utterance that may contain Personal Identifiable 
 Information (PII). You are also given a list of entity types representing 
 Personal Identifiable Information (PII). Your task is to detect and identify 
 all instances of the supplied PII entity types in the user utterance. 
 The output must have the same content as the input. Only the tokens that match 
 the PII entities in the list should be enclosed within XML tags. The XML tag 
 comes from the PII entities described in the list below. For example, a name 
 of a person should be enclosed within <PERSON></PERSON> tags. Ensure that all 
 entities are identified. Do not perform false identifications.
 
 List Of Entities
 PERSON: Name of a person
 Rx_NUMBER: Number identifying a medical prescription
 ORDER_NUMBER: Number identifying a retail order
 EMAIL_ADDRESS: Email address
 PHONE_NUMBER: Telephone or mobile number
 DATE_TIME: Dates and Times
 US_SSN: Social Security Number in the United States
 
 Are the instructions clear to you?
 """
 
 # First assistant message which is a reply to usr_msg1
 # I hardcode this msg once the model produced a resonably good response.
 # Note that the response comes from a non-fine-tuned version of the model.
 # The idea is to sample a good 'thought' from the base version of the model.
 asst_msg1="""
 Yes, the instructions are clear. I will identify and enclose within the 
 corresponding XML tags, all instances of the specified PII entity types in 
 the user utterance. For example, <PERSON><Name of a person></PERSON>, 
 <Rx_NUMBER><Number identifying a medical prescription></Rx_NUMBER>, etc. 
 leaving the rest of the user utterance unchanged.
 """
 
 # Here I hardcode a few shot example as a user message
 usr_msg2="""
 My name is John Doe, and can be contacted at 111-222-3334
 """
 
 # Here I hardcode the appropriate response from the assitant as the
 # correct output of the few shot example
 asst_msg2="""
 My name is <PERSON>John Doe</PERSON>, and can be contacted at <PHONE_NUMBER>111-222-3334</PHONE_NUMBER>
 """
 
 # Here I ask the assistant why its response to the last user message was the 
 # correct response
 usr_msg3="""
 Give a brief explanation of why your answer is correct.
 """
 
 # I hardcode the explaination from the model.
 # Note that the explaination comes from a non-fine-tuned version of the model
 asst_msg3="""
 I identified and enclosed within corresponding XML tags, all instances of 
 the specified PII entity types in the user utterance - a Person Name 
 "John Doe" within the <PERSON></PERSON> tag, and a Phone Number 
 "111-222-3334" within the <PHONE_NUMBER></PHONE_NUMBER> tag. 
 The rest of the user utterance was left unchanged as it did not contain any 
 other identified PII entities.
 """
 
 # The rest of the user and assistant msgs are hardcoded so that the model is
 # in a state where it expects another input from the user
 
 usr_msg4="""
 Great! I am now going to give you another user utterance. Please 
 detect PII entities in it according to the previous instructions. Do 
 not include an explanation in your answer.
 """
 
 asst_msg4="""
 Sure! Please give me the user utterance.
 """
 
 # usr_msg5 would be the actual input string on which we want to detect the 
 # PII entities

创建完整训练数据文件的样例函数如下所示

 def get_fine_tune_prompt_xml(
     rule_set: List[str], 
     input_str: str,
     label_str: str,
     tokenizer: PreTrainedTokenizerBase,
 ) -> torch.Tensor:
 
     """
     Args:
     rule_set (List[str]): List of strings representing entity labels and its
                           corresponding description
     input_str (str): Actual input string on which detections need to be
                      performed
     label_str (str): Expected output string corresponding to input_str
     tokenizer (PreTrainedTokenizerBase): A tokenizer corresponding to the model
                                          being fine-tuned
 
     Returns: 
     torch.Tensor: Tensor of tokenized input ids
     """
     rule_str = "\n".join(rule_set)
 
     usr_msg1 = "You are given a user utterance that may contain Personal Identifiable Information (PII). " \
         "You are also given a list of entity types representing Personal Identifiable Information (PII). " \
         "Your task is to detect and identify all instances of the supplied PII entity types in the user utterance. " \
         "The output must have the same content as the input. Only the tokens that match the PII entities in the " \
         "list should be enclosed within XML tags. The XML tag comes from the PII entities described in the list below. " \
         "For example, a name of a person should be enclosed within <PERSON></PERSON> tags." \
         "Ensure that all entities are identified. Do not perform false identifications." \
         f"""\n\nList Of Entities\n{rule_str}"""\
         "\n\n" \
         "Are the instructions clear to you?"
     
     asst_msg1 = "Yes, the instructions are clear. I will identify and enclose within the corresponding XML tags, " \
         "all instances of the specified PII entity types in the user utterance. For example, " \
         "<PERSON><Name of a person></PERSON>, <Rx_NUMBER><Number identifying a medical prescription></Rx_NUMBER>, etc. " \
         "leaving the rest of the user utterance unchanged."
     
     usr_msg2 = "My name is John Doe, and can be contacted at 111-222-3334"
 
     asst_msg2 = "My name is <PERSON>John Doe</PERSON>, and can be contacted at <PHONE_NUMBER>536-647-8464</PHONE_NUMBER>"
 
     usr_msg3 = "Give a brief explanation of why your answer is correct."
 
     asst_msg3 = "I identified and enclosed within corresponding XML tags, all instances of the specified PII " \
         "entity types in the user utterance - a Person Name \"John Doe\" within the <PERSON></PERSON> tag, and " \
         "a Phone Number \"536-647-8464\" within the <PHONE_NUMBER></PHONE_NUMBER> tag. The rest of the user " \
         "utterance was left unchanged as it did not contain any other identified PII entities."
     
     usr_msg4 = "Great! I am now going to give you another user utterance. Please detect PII entities in it " \
         "according to the previous instructions. Do not include an explanation in your answer."
     
     asst_msg4 = "Sure! Please give me the user utterance."
 
     messages = [
         {"role": "user", "content": usr_msg1},
         {"role": "assistant", "content": asst_msg1},
         {"role": "user", "content": usr_msg2},
         {"role": "assistant", "content": asst_msg2},
         {"role": "user", "content": usr_msg3},
         {"role": "assistant", "content": asst_msg3},
         {"role": "user", "content": usr_msg4},
         {"role": "assistant", "content": asst_msg4},
         {"role": "user", "content": input_str},
         {"role": "assistant", "content": label_str},
     ]
     
    encoded_input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt")
 
    return encoded_input_ids

函数的输出如下所示

 <s> [INST] You are given a user utterance that may contain Personal Identifiable Information (PII). You are also given a list of entity types representing Personal Identifiable Information (PII). Your task is to detect and identify all instances of the supplied PII entity types in the user utterance. The output must have the same content as the input. Only the tokens that match the PII entities in the list should be enclosed within XML tags. The XML tag comes from the PII entities described in the list below. For example, a name of a person should be enclosed within <PERSON></PERSON> tags. Ensure that all entities are identified. Do not perform false identifications.
 
 List Of Entities
 PERSON: Name of a person
 Rx_NUMBER: Number identifying a medical prescription
 ORDER_NUMBER: Number identifying a retail order
 EMAIL_ADDRESS: Email address
 PHONE_NUMBER: Telephone or mobile number
 DATE_TIME: Dates and Times
 US_SSN: Social Security Number in the United States
 
 Are the instructions clear to you? [/INST]Yes, the instructions are clear. I will identify and enclose within the corresponding XML tags, all instances of the specified PII entity types in the user utterance. For example, <PERSON><Name of a person></PERSON>, <Rx_NUMBER><Number identifying a medical prescription></Rx_NUMBER>, etc. leaving the rest of the user utterance unchanged.</s> [INST] My name is John Doe, and can be contacted at 111-222-3334 [/INST]My name is <PERSON>John Doe</PERSON>, and can be contacted at <PHONE_NUMBER>111-222-3334</PHONE_NUMBER></s> [INST] Give a brief explanation of why your answer is correct. [/INST]I identified and enclosed within corresponding XML tags, all instances of the specified PII entity types in the user utterance - a Person Name "John Doe" within the <PERSON></PERSON> tag, and a Phone Number "111-222-3334" within the <PHONE_NUMBER></PHONE_NUMBER> tag. The rest of the user utterance was left unchanged as it did not contain any other identified PII entities.</s> [INST] Great! I am now going to give you another user utterance. Please detect PII entities in it according to the previous instructions. Do not include an explanation in your answer. [/INST]Sure! Please give me the user utterance.</s> [INST] Hi! is Dr. Danielle Boyd at the clinic [/INST]Hi! is Dr. <PERSON>Danielle Boyd</PERSON> at the clinic</s>

tokenizer.apply_chat_template()负责将’ [INST] ‘和’ [/INST] ‘应用于用户消息,并将’ ‘(序列令牌结束)应用于辅助消息。还要注意,标记器负责将’ '(序列的开始)标记应用到提示符的开头。这些微小的细节对模型在微调过程中是否能有效地学习和收敛有巨大的影响。

自定义损失

自回归模型(像大多数llm一样)被训练来正确预测“下一个令牌”。给定我们刚刚创建的训练数据样本和微调训练设置,模型将学习预测文本所有部分的下一个标记,即任务描述、实体列表、样本示例、会话历史中硬编码的模型思维链等。

这将使模型除了学习预测正确的结果外,还学习任务描述中的令牌分布。这使得我们的学习任务有点繁琐。我们对LLM进行微调的主要目标是为给定的输入字符串生成结构良好且正确的检测。因此,我们应该只计算输出字符串中令牌的损失。所以在我们的示例训练数据中,模型应该只计算以下令牌的损失

 Hi! is Dr. <PERSON>Danielle Boyd</PERSON> at the clinic</s>

这将鼓励模型“忘记”之前的所有标记,只是“注意”主要标记并生成正确的输出字符串。我们可以使用HuggingFace的DataCollator API。

 fromdataclassesimportdataclass
 fromtransformers.tokenization_utils_baseimportPreTrainedTokenizerBase
 fromtransformers.utilsimportPaddingStrategy
 fromtypingimportAny, Callable, Dict, List, NewType, Optional, Tuple, Union
 
 @dataclass
 classCustomDataCollatorWithPadding:
     """
     Data collator that will dynamically pad the inputs received.
 
     Args:
         tokenizer ([`PreTrainedTokenizer`] or [`PreTrainedTokenizerFast`]):
             The tokenizer used for encoding the data.
         padding (`bool`, `str` or [`~utils.PaddingStrategy`], *optional*, defaults to `True`):
             Select a strategy to pad the returned sequences (according to the model's padding side and padding index)
             among:
 
             - `True` or `'longest'` (default): Pad to the longest sequence in the batch (or no padding if only a single
               sequence is provided).
             - `'max_length'`: Pad to a maximum length specified with the argument `max_length` or to the maximum
               acceptable input length for the model if that argument is not provided.
             - `False` or `'do_not_pad'`: No padding (i.e., can output a batch with sequences of different lengths).
         max_length (`int`, *optional*):
             Maximum length of the returned list and optionally padding length (see above).
         pad_to_multiple_of (`int`, *optional*):
             If set will pad the sequence to a multiple of the provided value.
 
             This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability >=
             7.5 (Volta).
         return_tensors (`str`, *optional*, defaults to `"pt"`):
             The type of Tensor to return. Allowable values are "np", "pt" and "tf".
     """
 
     tokenizer: PreTrainedTokenizerBase
     padding: Union[bool, str, PaddingStrategy] =True
     max_length: Optional[int] =None
     pad_to_multiple_of: Optional[int] =None
     return_tensors: str="pt"
 
     def__call__(self, features: List[Dict[str, Any]]) ->Dict[str, Any]:
         batch=self.tokenizer.pad(
             features,
             padding=self.padding,
             max_length=self.max_length,
             pad_to_multiple_of=self.pad_to_multiple_of,
             return_tensors=self.return_tensors,
         )
         labels=batch["input_ids"].clone()
         
         # Set loss mask for all pad tokens
         labels[labels==self.tokenizer.pad_token_id] =-100
         
         # Compute loss mask for appropriate tokens only
         foriinrange(batch['input_ids'].shape[0]):
             
             # Decode the training input
             text_content=self.tokenizer.decode(batch['input_ids'][i][1:])  # slicing from [1:] is important because tokenizer adds bos token
             
             # Extract substrings for prompt text in the training input
             # The training input ends at the last user msg ending in [/INST]
             prompt_gen_boundary=text_content.rfind("[/INST]") +len("[/INST]")
             prompt_text=text_content[:prompt_gen_boundary]
             
             # print(f"""PROMPT TEXT:\n{prompt_text}""")
             
             # retokenize the prompt text only
             prompt_text_tokenized=self.tokenizer(
                 prompt_text,
                 return_overflowing_tokens=False,
                 return_length=False,
             )
             # compute index where prompt text ends in the training input
             prompt_tok_idx=len(prompt_text_tokenized['input_ids'])
             
             # Set loss mask for all tokens in prompt text
             labels[i][range(prompt_tok_idx)] =-100
             
             # print("================DEBUGGING INFORMATION===============")
             # for idx, tok in enumerate(labels[i]):
             #     token_id = batch['input_ids'][i][idx]
             #     decoded_token_id = self.tokenizer.decode(batch['input_ids'][i][idx])
             #     print(f"""TOKID: {token_id} | LABEL: {tok} || DECODED: {decoded_token_id}""")
                     
         batch["labels"] =labels
         returnbatch

CustomDataCollatorWithPadding类可以像下面这样传递给SFTTrainer

 trainer=SFTTrainer(
     model=model,
     train_dataset=dataset["train"],
     eval_dataset=eval_dataset["val"],
     peft_config=peft_config,
     dataset_text_field="text",
     max_seq_length=max_seq_length,
     tokenizer=tokenizer,
     args=training_arguments,
     packing=packing,
     # Using custom data collator inside SFTTrainer
     data_collator=CustomDataCollatorWithPadding(
         tokenizer=tokenizer, 
         padding="longest", 
         max_length=max_seq_length, 
         return_tensors="pt"
     )
 )

通过将标签令牌id设置为-100,我们可以将这些令牌位置的损失归零。这意味着从序列令牌的开头()到最后一个用户msg(以[/INST]结尾)的所有令牌都不会参与损失的计算。

结果

用这个设置微调了mistral /Mistral-7B-Instruct-v0.2模型。我有大约800个训练数据样本,大约400个测试样本和大约400个验证样本。

训练了3轮的模型,并在测试集上取得了相当高的精度/召回率/F1(96%以上)。

这里说一个结果,使用字符串标注的方法超过了生成JSON编码的方法,虽然JSON的格式是正确的,但是正如我们前面所述的,在预测正确的’ start_position ‘和’ end_position '字符索引方面结果并不好。

这里我没有确认模型是否也能很好地处理BIO输出格式,我个人认为应该不会太好。

我们添加了自定义的损失掩码,这是否有助于模型更好地泛化到看不见的实体?这个也没有进行测试。

如果把7B的模型改为13B或者34B等更大的模型的性能如何变化?训练和推理的成本是否值得性能的提升?这都是我们可以继续研究的问题,如果你对NER感兴趣可以自行研究,我也会在有结果后分享我的发现。

https://avoid.overfit.cn/post/0cb82b2fee6440b6b762a23bbf85bf59

作者:Nitin Kumar

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:/a/465366.html

如若内容造成侵权/违法违规/事实不符,请联系我们进行投诉反馈qq邮箱809451989@qq.com,一经查实,立即删除!

相关文章

Vue 3响应式系统详解:ref、toRefs、reactive及更多

&#x1f31f; 前言 欢迎来到我的技术小宇宙&#xff01;&#x1f30c; 这里不仅是我记录技术点滴的后花园&#xff0c;也是我分享学习心得和项目经验的乐园。&#x1f4da; 无论你是技术小白还是资深大牛&#xff0c;这里总有一些内容能触动你的好奇心。&#x1f50d; &#x…

图解Kafka架构学习笔记(一)

本文参考尚硅谷大数据技术之Kafka。 消息队列 &#xff08;1&#xff09;点对点模式&#xff08;一对一&#xff0c;消费者主动拉取数据&#xff0c;消息收到后消息清除&#xff09; 点对点模型通常是一个基于拉取或者轮询的消息传送模型&#xff0c;这种模型从队列中请求信息…

【JS】html字符转义

需求 将html转为字符串将html字符串转义&#xff0c;比如<div>转为<div> 码 /*** html标签字符转义* param {Stirng} str 要转换的html字符* returns String 返回转义的html字符串*/ const elToStr str > str.replaceAll(<, <).replaceAll(>, >)…

sadtalker-api/

———— 下载sadtalker工程文件&#xff0c;包括844个模型 。。。。。。。。。。。。。。。。 配置环境&#xff1a; pip源&#xff0c;设置&#xff1a; pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple anaconda prompt, 进入命令行 how在 …

8-图像缩放

其实&#xff0c;就是开辟一个zoomwidth&#xff0c;zoomheight的内存&#xff0c;再分别赋值即可。 void CDib::Scale(float xZoom, float yZoom) { //指向原图像指针 LPBYTE p_data GetData(); //指向原像素的指针 LPBYTE lpSrc; //指向缩放图像对应像素的指针 LPBYTE lpDs…

【Flink SQL】Flink SQL 基础概念(四):SQL 的时间属性

《Flink SQL 基础概念》系列&#xff0c;共包含以下 5 篇文章&#xff1a; Flink SQL 基础概念&#xff08;一&#xff09;&#xff1a;SQL & Table 运行环境、基本概念及常用 APIFlink SQL 基础概念&#xff08;二&#xff09;&#xff1a;数据类型Flink SQL 基础概念&am…

算法练习:前缀和

目录 1. 一维前缀和2. 二维前缀和3. 寻找数组中心下标4. 除自身以外数组的乘积5. !和为k的子数字6. !和可被k整除的子数组7. !连续数组8. 矩阵区域和 1. 一维前缀和 题目信息&#xff1a; 题目链接&#xff1a; 一维前缀和思路&#xff1a;求前缀和数组&#xff0c;sum dp[r] …

R语言:microeco:一个用于微生物群落生态学数据挖掘的R包,第六:trans_nullmodel class

近几十年来&#xff0c;系统发育分析和零模型的整合通过增加系统发育维度&#xff0c;更有力地促进了生态位和中性影响对群落聚集的推断。trans_nullmodel类提供了一个封装&#xff0c;包括系统发育信号、beta平均成对系统发育距离(betaMPD)、beta平均最近分类单元距离(betaMNT…

解决后端传给前端的日期问题

解决方式&#xff1a; 1). 方式一 在属性上加上注解&#xff0c;对日期进行格式化 但这种方式&#xff0c;需要在每个时间属性上都要加上该注解&#xff0c;使用较麻烦&#xff0c;不能全局处理。 2). 方式二&#xff08;推荐 ) 在WebMvcConfiguration中扩展SpringMVC的消息转…

专业120+总400+北京理工大学826信号处理导论考研经验北理工电子信息与通信工程,真题,大纲,参考书。

**今年专业课826信号处理导论&#xff08;信号系统和数字信号处理&#xff09;120&#xff0c;总分400&#xff0c;应群里同学需要&#xff0c;自己总结一下去年的复习经历&#xff0c;希望对大家复习有帮助。**专业课&#xff1a; 北京理工大学专业826是两门合一&#xff0c;…

Flutter开发进阶之使用工具效率开发

Flutter开发进阶之使用工具效率开发 软件开发团队使用Flutter开发的原因通常是因为Flutter开发性能高、效率高、兼容性好、可拓展性高&#xff0c;作为软件PM来说主要考虑的是范围管理、进度管理、成本管理、资源管理、质量管理、风险管理和沟通管理等&#xff0c;可以看到Flu…

微信小程序调用百度智能云API(菜品识别)

一、注册后生成应用列表创建应用 二、找到当前所需使用的api菜品识别文档 三、点链接看实例代码 这里需要使用到如下几个参数&#xff08;如下&#xff09;&#xff0c;其他的参数可以不管 client_id &#xff1a; 就是创建应用后的API Keyclient_secret&#xff1a; 就是创建…

Vue mqtt 附在线mqtt客户端地址 + 完整示例

mqtt&#xff1a;轻量级物联网消息推送协议。 目录 一、介绍 1、官方文档 1&#xff09;npm网 2) 中文网 MQTT中文网_MQTT 物联网接入平台-MQTT.CN 2、官方示例 二、准备工作 1、安装依赖包 2、示例版本 三、使用步骤 1、在单页面引入 mqtt 四、完整示例 tips 一、介…

正则表达式与re模块

目录 正则表达式 简介 语法&#xff1a; 常用元字符&#xff1a; 量词: 贪婪匹配和惰性匹配&#xff1a; re模块 简介&#xff1a; 常用的几个模块&#xff1a; 1.findall 2.search 3.finditer 4.compile 案例展示&#xff1a; 需求&#xff1a; 思路分析&#…

《论文阅读》E-CORE:情感相关性增强的移情对话生成 EMNLP 2023

《论文阅读》E-CORE:情感相关性增强的移情对话生成 EMNLP 2023 前言摘要模型架构图构建边的构建和初始化节点的初始化图更新情感相关性加强解码损失函数总结前言 亲身阅读感受分享,细节画图解释,再也不用担心看不懂论文啦~ 无抄袭,无复制,纯手工敲击键盘~ 今天为大家带来…

IDEA 多个git仓库项目放一个窗口

1、多个项目先通过新建module或者CtrlAltShiftS 添加module引入 2、重点是右下角有时候git 分支视图只有一个module的Repositories。这时候需要去设置把多个git仓库添加到同一个窗口才能方便提交代码。 3、如果Directory Mappings已经有相关项目配置&#xff0c;但是灰色的&…

浅谈JPA框架

JPA 前言概述ORM 映射元数据JPQLJPA API附Spring Data JPA 前言 了解 JPA 框架对后续使用 Spring Boot 是有很大帮助的&#xff0c;下面简单介绍 JPA 框架的基础知识。 概述 JPA&#xff08; Java 对象持久化 API &#xff0c;Java Persistence API &#xff09;&#xff0c…

物联网数据驾驶舱

在信息化时代&#xff0c;数据已经成为驱动企业发展的核心动力。特别是在物联网领域&#xff0c;海量数据的实时采集、分析和监控&#xff0c;对于企业的运营决策和业务优化具有至关重要的作用。HiWoo Cloud作为领先的物联网云平台&#xff0c;其数据监控功能以“物联网数据驾驶…

npm包、全局数据共享、分包

使用 npm 包 小程序对 npm 的支持与限制 目前&#xff0c;小程序中已经支持使用 npm 安装第三方包&#xff0c;从而来提高小程序的开发效率。但是&#xff0c;在小程序中使用npm 包有如下 3 个限制&#xff1a; ① 不支持依赖于 Node.js 内置库的包 ② 不支持依赖于浏览器内置…

C++——string

一学习string的原因 1.从个人理解角度上&#xff1a; 在刚开始学习之前&#xff0c;我只知道学习完string在以后的刷题中能提高做题效率&#xff0c;在对字符串的处理string库中也许有对应的接口去实现需求&#xff0c;不用自己去写函数的实现。 但在学string中改变了之前的…