BERT 微调后得到句子级嵌入

2023-12-23

我遇到了这个page https://colab.research.google.com/github/google-research/bert/blob/master/predicting_movie_reviews_with_bert_on_tf_hub.ipynb#scrollTo=KVB3eOcjxxm1

1）我想获得句子级嵌入（嵌入由[CLS]token）微调完成后。我怎样才能做到呢？

2）我还注意到该页面上的代码需要花费大量时间才能返回测试数据的结果。这是为什么？与我尝试获得测试预测时相比，当我训练模型时，花费的时间更少。从该页面上的代码来看，我没有使用以下代码块

test_InputExamples = test.apply(lambda x: bert.run_classifier.InputExample(guid=None, 
                                                                       text_a = x[DATA_COLUMN], 
                                                                       text_b = None, 
                                                                       label = x[LABEL_COLUMN]), axis = 1

test_features = bert.run_classifier.convert_examples_to_features(test_InputExamples, label_list, MAX_SEQ_LENGTH, tokenizer)

test_input_fn = run_classifier.input_fn_builder(
        features=test_features,
        seq_length=MAX_SEQ_LENGTH,
        is_training=False,
        drop_remainder=False)

estimator.evaluate(input_fn=test_input_fn, steps=None)

相反，我只是在整个测试数据上使用了下面的函数

def getPrediction(in_sentences):
  labels = ["Negative", "Positive"]
  input_examples = [run_classifier.InputExample(guid="", text_a = x, text_b = None, label = 0) for x in in_sentences] # here, "" is just a dummy label
  input_features = run_classifier.convert_examples_to_features(input_examples, label_list, MAX_SEQ_LENGTH, tokenizer)
  predict_input_fn = run_classifier.input_fn_builder(features=input_features, seq_length=MAX_SEQ_LENGTH, is_training=False, drop_remainder=False)
  predictions = estimator.predict(predict_input_fn)
  return [(sentence, prediction['probabilities'], labels[prediction['labels']]) for sentence, prediction in zip(in_sentences, predictions)]

3）我怎样才能得到预测的概率。有没有办法使用keras predict method?

update1

问题2更新- 你可以使用 20000 个训练样例进行测试吗getPrediction函数？......这对我来说需要更长的时间......甚至比在 20000 个示例上训练模型所花费的时间还要多。

1) From BERT 文档 https://aihub.cloud.google.com/p/products%2F2c1fe4d8-4ff3-4d4f-8ac4-45d445532a3b

输出字典包含：

pooled_output：整个序列的形状的池化输出 [批量大小，隐藏大小]。序列输出：每个的表示输入序列中形状为 [batch_size, 最大序列长度，隐藏大小]。

我已经添加pooled_output对应于 CLS 向量的向量。

3) 您收到对数概率。只需申请softmax以获得正态概率。

现在剩下要做的就是模型报告它。我已经留下了日志问题，但它们不再是必要的了。

查看代码变化：

def create_model(is_predicting, input_ids, input_mask, segment_ids, labels,
                 num_labels):
  """Creates a classification model."""

  bert_module = hub.Module(
      BERT_MODEL_HUB,
      trainable=True)
  bert_inputs = dict(
      input_ids=input_ids,
      input_mask=input_mask,
      segment_ids=segment_ids)
  bert_outputs = bert_module(
      inputs=bert_inputs,
      signature="tokens",
      as_dict=True)

  # Use "pooled_output" for classification tasks on an entire sentence.
  # Use "sequence_outputs" for token-level output.
  output_layer = bert_outputs["pooled_output"]

  pooled_output = output_layer

  hidden_size = output_layer.shape[-1].value

  # Create our own layer to tune for politeness data.
  output_weights = tf.get_variable(
      "output_weights", [num_labels, hidden_size],
      initializer=tf.truncated_normal_initializer(stddev=0.02))

  output_bias = tf.get_variable(
      "output_bias", [num_labels], initializer=tf.zeros_initializer())

  with tf.variable_scope("loss"):

    # Dropout helps prevent overfitting
    output_layer = tf.nn.dropout(output_layer, keep_prob=0.9)

    logits = tf.matmul(output_layer, output_weights, transpose_b=True)
    logits = tf.nn.bias_add(logits, output_bias)
    log_probs = tf.nn.log_softmax(logits, axis=-1)
    probs = tf.nn.softmax(logits, axis=-1)

    # Convert labels into one-hot encoding
    one_hot_labels = tf.one_hot(labels, depth=num_labels, dtype=tf.float32)

    predicted_labels = tf.squeeze(tf.argmax(log_probs, axis=-1, output_type=tf.int32))
    # If we're predicting, we want predicted labels and the probabiltiies.
    if is_predicting:
      return (predicted_labels, log_probs, probs, pooled_output)

    # If we're train/eval, compute loss between predicted and actual label
    per_example_loss = -tf.reduce_sum(one_hot_labels * log_probs, axis=-1)
    loss = tf.reduce_mean(per_example_loss)
    return (loss, predicted_labels, log_probs, probs, pooled_output)

现在在model_fn_builder()添加对这些值的支持：

  # this should be changed in both places
  (predicted_labels, log_probs, probs, pooled_output) = create_model(
    is_predicting, input_ids, input_mask, segment_ids, label_ids, num_labels)

  # return dictionary of all the values you wanted
  predictions = {
      'log_probabilities': log_probs,
      'probabilities': probs,
      'labels': predicted_labels,
      'pooled_output': pooled_output
  }

Adjust getPrediction()因此，最终你的预测将如下所示：

('That movie was absolutely awful',
  array([0.99599314, 0.00400678], dtype=float32),  <= Probability
  array([-4.0148855e-03, -5.5197663e+00], dtype=float32), <= Log probability, same as previously
  'Negative', <= Label
  array([ 0.9181199 ,  0.7763732 ,  0.9999883 , -0.93533266, -0.9841384 ,
          0.78126144, -0.9918988 , -0.18764131,  0.9981035 ,  0.99999994,
          0.900716  , -0.99926263, -0.5078789 , -0.99417543, -0.07695035,
          0.9501321 ,  0.75836045,  0.49151263, -0.7886792 ,  0.97505844,
         -0.8931161 , -1.        ,  0.9318583 , -0.60531116, -0.8644371 ,
        ...
        and this is 768-d [CLS] vector (sentence embedding).

关于2）：最终我的训练花费了大约5分钟，测试花费了大约40秒。非常合理。

UPDATE

对于 20k 样本，训练时间为 12:48，测试时间为 2:07 分钟。

对于 10k 样本，时间分别为 8:40 和 1:07。

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)

python

tensorflow

Keras

Classification

bertlanguagemodel

BERT 微调后得到句子级嵌入的相关文章

Python PAM 模块的安全问题？

我有兴趣编写一个 PAM 模块该模块将利用流行的 Unix 登录身份验证机制我过去的大部分编程经验都是使用 Python 进行的并且我正在交互的系统已经有一个 Python API 我用谷歌搜索发现pam python http pa
使用 openCV 对图像中的子图像进行通用检测

免责声明我是计算机视觉菜鸟我看过很多关于如何在较大图像中查找特定子图像的堆栈溢出帖子我的用例有点不同因为我不希望它是具体的而且我不确定如何做到这一点如果可能的话但我感觉应该如此我有大量图像数据集有时其中一些图像是数据集的
如何收集列表、字典等中重复计算的结果（或制作修改每个元素的列表的副本）？

There are a great many existing Q A on Stack Overflow on this general theme but they are all either poor quality typical
导入错误：没有名为 _ssl 的模块

带 Python 2 7 的 Ubuntu Maverick 我不知道如何解决以下导入错误 gt gt gt import ssl Traceback most recent call last File
Python 多处理示例不起作用

我正在尝试学习如何使用multiprocessing但我无法让它发挥作用这是代码文档 http docs python org 2 library multiprocessing html from multiprocessing imp
Python tcl 未正确安装

我刚刚为 python 安装了graphics py 但是当我尝试运行以下代码时 from graphics import def main win GraphWin My Circle 100 100 c Circle Point 50
安装后 Anaconda 提示损坏

我刚刚安装张量流GPU创建单独的后环境按照以下指示here https github com antoniosehk keras tensorflow windows installation 但是安装后当我关闭提示窗口并打开新航站楼弹出
在循环中每次迭代开始时将变量重新分配给原始值（在循环之前定义）

在Python中你使用在每次迭代开始时将变量重新分配给原始值在循环之前定义时也就是说 original 1D o o o for i in range 0 3 new original 1D revert back to orig
IRichBolt 在storm-1.0.0 和 pyleus-0.3.0 上运行拓扑时出错

我正在运行风暴拓扑 pyleus verbose local xyz topology jar using storm 1 0 0 pyleus 0 3 0 centos 6 6并得到错误线程 main java lang NoClass
交换keras中的张量轴

我想将图像批次的张量轴从 batch size row col ch 交换为批次大小通道行列在 numpy 中这可以通过以下方式完成 X batch np moveaxis X batch 3 1 我该如何在 Keras 中做到
Pandas Dataframe 中 bool 值的条件前向填充

问题如何转发 fill boolTruepandas 数据框中的值如果是当天的第一个条目 True 到一天结束时请参阅以下示例和所需的输出 Data import pandas as pd import numpy as np df
使用 OpenPyXL 迭代工作表和单元格，并使用包含的字符串更新单元格[重复]

这个问题在这里已经有答案了我想使用 OpenPyXL 来搜索工作簿但我遇到了一些问题希望有人可以帮助解决以下是一些障碍待办事项我的工作表和单元格数量未知我想搜索工作簿并将工作表名称放入数组中我想循环遍历每个数组项并搜索包含特
在tensorflow.js中对张量进行分区、屏蔽或过滤

我有 2 个相同长度的张量 data and groupIds 我想分开data通过相应的值分成几组groupId 例如 const data tf tensor 1 2 3 4 5 const groupIds tf tensor 0 1
Python - 在窗口最小化或隐藏时使用 pywinauto 控制窗口

我正在尝试做的事情我正在尝试使用 pywinauto 在 python 中创建一个脚本以在后台自动安装 notepad 隐藏或最小化 notepad 只是一个示例因为我将编辑它以与其他软件一起使用 Problem 问题是我想在安装程序
Nuitka 未使用 nuitka --recurse-all hello.py [错误] 编译 exe

我正在尝试通过 nuitka 创建一个简单的 exe 这样我就可以在我的笔记本电脑上运行它而无需安装 Python 我在 Windows 10 上并使用 Anaconda Python 3 我输入 nuitka recurse all h
为美国东部以外地区的 Cloudwatch 警报发送短信？

AWS 似乎没有为美国东部以外的 SNS 主题订阅者提供 SMS 作为协议我想连接我的 CloudWatch 警报并在发生故障时接收短信但无法将其发送到 SMS YES 经过一番挖掘后我能够让它发挥作用它比仅仅选择一个主题或输入闹钟
对输入求 Keras 模型的导数返回全零

所以我有一个 Keras 模型我想将模型的梯度应用于其输入这就是我所做的 import tensorflow as tf from keras models import Sequential from keras layers imp
如何使用google colab在jupyter笔记本中显示GIF？

我正在使用 google colab 想嵌入一个 gif 有谁知道如何做到这一点我正在使用下面的代码它并没有在笔记本中为 gif 制作动画我希望笔记本是交互式的这样人们就可以看到代码的动画效果而无需运行它我发现很多方法在 Goo
使用基于正则表达式的部分匹配来选择 Pandas 数据帧的子数据帧

我有一个 Pandas 数据框它有两列一列进程参数列包含字符串另一列值列包含相应的浮点值我需要过滤出部分匹配列过程参数中的一组键的子数据帧并提取与这些键匹配的数据帧的两列 df pd DataFrame Proce
Python 分析：“‘select.poll’对象的‘poll’方法”是什么？

我已经使用 python 分析了我的 python 代码cProfile模块并得到以下结果 ncalls tottime percall cumtime percall filename lineno function 13937860 9

随机推荐

TypeError：jquery 1.9.1 版本中的“in”操作数 obj 无效

ajax async false type POST url url module listing projectId data ajax true success function response each response funct
使图像的一部分透明

我想在按钮上放置图像但我希望图像的一部分是透明的我该怎么做呢 Try the Image OpacityMask http msdn microsoft com en us library ms743320 aspx财产您可以给它一个
将按钮的可见性绑定到两个文本框的内容的最简洁方法

我有一个Button在我的应用程序中我已将其功能绑定到是否TextBox是空的如下所示
提高始终加密证书的有效性

我正在使用 SQL Server 的始终加密功能使用受自签名证书保护的主密钥来加密数据库中的一些列该证书是使用 SQL 2016 的 Management Studio 创建的并且始终默认为比颁发日期提前一年的到期日期它存储在当前用
为什么 Clang 为引用和非空指针参数生成不同的代码？

这与为什么 GCC 不能为两个 int32 的结构生成最佳运算符 q 66263263 我在 godbolt org 上研究了这个问题的代码并注意到了这种奇怪的行为 struct Point int x y bool nonzero pt
Java FileHandler 禁用日志轮转

我正在尝试禁用日志轮换以供文件处理程序使用 FileHandler fh new FileHandler path run log 1000000 1 false 我想要的是一个日志为每次运行创建我不想轮换或备份旧文件但使用此初始化
从不同的数据框中获取数据

我有一个数据框 Name Subset Type System A00 IU00 A OP A A00 IT00 PP A B01 IT 01A PP B B01 IU OP B B03 IM 09 B LP A B03 IM03A OP
从 Gecko 和 Webkit 中的选择（范围）中检索父节点

我试图在使用使用 createLink 命令的所见即所得编辑器时添加属性我认为取回浏览执行该命令后创建的节点是很简单的结果我只能在 IE 中获取这个新创建的节点有任何想法吗以下代码演示了该问题底部的调试日志在每个浏览器中显示不同
将 AMQ 与 Rest API 网关集成

我正在尝试将 AMQ 与 api 网关集成以便我可以使用 API 网关中的 AWS 资源选项将消息直接从 api 网关推送到 AMQ 并在部署 AWS ARN 时收到此错误因为集成包含无效操作我应该在这里使用什么操作以便 api 网
Eclipse Java：“创建字段”快速修复建议的模板？

在构造函数中我经常分配给一个不存在的字段然后选择 Ctrl 1 在 CurrentType 类型中创建字段 memberField 问题是我希望该字段默认为最终字段但事实并非如此是否有用于此快速修复的模板谢谢我没有看到任何明显的
是否可以从 Clojure 重新定义 Java 方法？

使用多方法我们可以向现有的 Java 类添加方法我的问题是是否可以从 Clojure 代码重新定义一种特定方法以及如何重新定义例如如果您有以下课程 public class Shape public void draw 我希望能够运
弹簧批次容错能力

我正在尝试从 csv 文件导入城市数据某些数据可能会重复从而引发冲突错误ERROR duplicate key value violates unique constraint city unique idx Detail Key co
两个 swift 函数极大地增加了编译时间

返回并阅读我的应用程序的构建日志后似乎存在一个奇怪的问题两个相对简单的函数都将编译时间各增加一分钟分别为 58 秒和 53 秒这可以在我下面的构建日志中看到这些函数位于我的 CAAgeViewController 中并且都引
真正换行 (LF) 的转义序列

在 C 语言中我们有几个常见的转义序列 r对于回车符 CR 这相当于做 015 n通常被描述为换行 LF 但我知道 n 将根据 CRLF 的要求被翻译成字符串取决于操作系统这相当于做 015 012 特别是如果我是东阿printf o
C++ - 使用引用类型的模板实例化

我听说过一些关于引用到引用问题的知识this http www comeaucomputing com iso cwg defects html 106解决我不太熟悉 C 委员会的术语但我理解链接中的 Moved to DR 注释意味着
检查输入值时出错

我有一个使用 readline 要求人们输入数据的函数但我不知道确保输入的数据符合我的标准的最佳方法我认为 if 语句可能是检查错误的最佳方法但我不确定如何合并它们我使用它们的尝试显然是有缺陷的见下文举一个简单的例子我最可能遇
vim 中的 Javascript 语法高亮显示

还有其他人发现 VIM 的 Javascript 语法突出显示效果不佳吗我发现有时我需要滚动才能调整语法突出显示因为有时它会神秘地删除所有突出显示有没有任何解决方法或方法来解决这个问题我使用的是 vim 7 1 你可能想尝试这个改进
JBoss 作为客户端 5.1.0.GA 存储库丢失

就在最近我正在新计算机上创建新的 Maven 项目这表明 jboss 作为客户端的依赖项不再可用
有没有关于 Lucene.NET 的书籍 [关闭]

Closed 此问题正在寻求书籍工具软件库等的推荐不满足堆栈溢出指南 help closed questions 目前不接受答案我在亚马逊上搜索过但在 lucene net 上找不到书你们在 lucene net 上找到过一本不
BERT 微调后得到句子级嵌入

我遇到了这个page https colab research google com github google research bert blob master predicting movie reviews with bert on

BERT 微调后得到句子级嵌入

update1

BERT 微调后得到句子级嵌入 的相关文章

随机推荐

热门标签

BERT 微调后得到句子级嵌入的相关文章