Tensorflow pad序列特征列

2024-01-14

如何在特征列中填充序列以及什么是dimension in the feature_column.

我在用Tensorflow 2.0并实现文本摘要的示例。对于机器学习、深度学习和 TensorFlow 来说还很陌生。

我碰到feature_column并发现它们很有用，因为我认为它们可以嵌入到模型的处理管道中。

在不使用的经典场景中feature_column，我可以预处理文本，对其进行标记，将其转换为数字序列，然后将它们填充到maxlen说100个字。使用时我无法完成此操作feature_column.

以下是我到目前为止所写的内容。


train_dataset = tf.data.experimental.make_csv_dataset(
    'assets/train_dataset.csv', label_name=LABEL, num_epochs=1, shuffle=True, shuffle_buffer_size=10000, batch_size=1, ignore_errors=True)

vocabulary = ds.get_vocabulary()

def text_demo(feature_column):
    feature_layer = tf.keras.experimental.SequenceFeatures(feature_column)
    article, _ = next(iter(train_dataset.take(1)))

    tokenizer = tf_text.WhitespaceTokenizer()

    tokenized = tokenizer.tokenize(article['Text'])

    sequence_input, sequence_length = feature_layer({'Text':tokenized.to_tensor()})

    print(sequence_input)

def categorical_column(feature_column):
    dense_column = tf.keras.layers.DenseFeatures(feature_column)

    article, _ = next(iter(train_dataset.take(1)))

    lang_tokenizer = tf.keras.preprocessing.text.Tokenizer(
      filters='')
    lang_tokenizer.fit_on_texts(article)

    tensor = lang_tokenizer.texts_to_sequences(article)

    tensor = tf.keras.preprocessing.sequence.pad_sequences(tensor,
                                                         padding='post', maxlen=50)

    print(dense_column(tensor).numpy())


text_seq_vocab_list = tf.feature_column.sequence_categorical_column_with_vocabulary_list(key='Text', vocabulary_list=list(vocabulary))
text_embedding = tf.feature_column.embedding_column(text_seq_vocab_list, dimension=8)
text_demo(text_embedding)

numerical_voacb_list = tf.feature_column.categorical_column_with_vocabulary_list(key='Text', vocabulary_list=list(vocabulary))
embedding = tf.feature_column.embedding_column(numerical_voacb_list, dimension=8)
categorical_column(embedding)

我也很困惑这里用什么，sequence_categorical_column_with_vocabulary_list or categorical_column_with_vocabulary_list。在文档中，SequenceFeatures也没有解释，尽管我知道这是一个实验性功能。

我也无法理解什么是dimension参数做什么？

其实，这个

我也很困惑这里用什么， sequence_categorical_column_with_vocabulary_list 或 categorical_column_with_vocabulary_list。

应该是第一个问题，因为它影响对主题名称的解释。

另外，目前还不清楚你的意思是什么文本摘要。您将要传递什么类型的模型\层处理过的文本 into?

顺便说一句，这很重要，因为tf.keras.layers.DenseFeatures and tf.keras.experimental.SequenceFeatures假定适用于不同的网络架构和方法。

作为文档序列特征层 https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/keras/experimental/SequenceFeatures说的输出SequenceFeatures层应该被输入到序列网络中，例如 RNN。

DenseFeatures 生成密集张量作为输出，因此适合其他类型的网络。

当您在代码片段中执行标记化时，您将使用嵌入在你的模型中。那么你有两个选择：

将学习到的嵌入向前传递到密集层。这意味着您不会分析词序。
将学习到的嵌入传递到卷积层、循环层、平均池化层、LSTM 层中，因此也可以使用词序进行学习

第一个选项需要使用：

The tf.keras.layers.DenseFeatures with
one of tf.feature_column.categorical_column_*()
and tf.feature_column.embedding_column()

第二个选项需要使用：

The tf.keras.experimental.SequenceFeatures with
one of tf.feature_column.sequence_categorical_column_*()
and tf.feature_column.embedding_column()

以下是示例。两个选项的预处理和训练部分是相同的：

import tensorflow as tf
print(tf.__version__)

from tensorflow import feature_column

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing.text import text_to_word_sequence
import tensorflow.keras.utils as ku
from tensorflow.keras.utils import plot_model

import pandas as pd
from sklearn.model_selection import train_test_split

DATA_PATH = 'C:\SoloLearnMachineLearning\Stackoverflow\TextDataset.csv'

#it is just two column csv, like:
# text;label
# A wiki is run using wiki software;0
# otherwise known as a wiki engine.;1

dataframe = pd.read_csv(DATA_PATH, delimiter = ';')
dataframe.head()

# Preprocessing before feature_clolumn includes
# - getting the vocabulary
# - tokenization, which means only splitting on tokens.
#   Encoding sentences with vocablary will be done by feature_column!
# - padding
# - truncating

# Build vacabulary
vocab_size = 100
oov_tok = '<OOV>'

sentences = dataframe['text'].to_list()

tokenizer = Tokenizer(num_words = vocab_size, oov_token="<OOV>")

tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index

# if word_index shorter then default value of vocab_size we'll save actual size
vocab_size=len(word_index)
print("vocab_size = word_index = ",len(word_index))

# Split sentensec on tokens. here token = word
# text_to_word_sequence() has good default filter for 
# charachters include basic punctuation, tabs, and newlines
dataframe['text'] = dataframe['text'].apply(text_to_word_sequence)

dataframe.head()

max_length = 6

# paddind and trancating setnences
# do that directly with strings without using tokenizer.texts_to_sequences()
# the feature_colunm will convert strings into numbers
dataframe['text']=dataframe['text'].apply(lambda x, N=max_length: (x + N * [''])[:N])
dataframe['text']=dataframe['text'].apply(lambda x, N=max_length: x[:N])
dataframe.head()

# Define method to create tf.data dataset from Pandas Dataframe
def df_to_dataset(dataframe, label_column, shuffle=True, batch_size=32):
    dataframe = dataframe.copy()
    #labels = dataframe.pop(label_column)
    labels = dataframe[label_column]

    ds = tf.data.Dataset.from_tensor_slices((dict(dataframe), labels))
    if shuffle:
        ds = ds.shuffle(buffer_size=len(dataframe))
    ds = ds.batch(batch_size)
    return ds

# Split dataframe into train and validation sets
train_df, val_df = train_test_split(dataframe, test_size=0.2)

print(len(train_df), 'train examples')
print(len(val_df), 'validation examples')

batch_size = 32
ds = df_to_dataset(dataframe, 'label',shuffle=False,batch_size=batch_size)

train_ds = df_to_dataset(train_df, 'label',  shuffle=False, batch_size=batch_size)
val_ds = df_to_dataset(val_df, 'label', shuffle=False, batch_size=batch_size)

# and small batch for demo
example_batch = next(iter(ds))[0]
example_batch

# Helper methods to print exxample outputs of for defined feature_column

def demo(feature_column):
    feature_layer = tf.keras.layers.DenseFeatures(feature_column)
    print(feature_layer(example_batch).numpy())

def seqdemo(feature_column):
    sequence_feature_layer = tf.keras.experimental.SequenceFeatures(feature_column)
    print(sequence_feature_layer(example_batch))

当我们不使用词序来学习时，我们提供第一个选项

# Define categorical colunm for our text feature, 
# which is preprocessed into lists of tokens
# Note that key name should be the same as original column name in dataframe
text_column = feature_column.
            categorical_column_with_vocabulary_list(key='text', 
                                                    vocabulary_list=list(word_index))
#indicator_column produce one-hot-encoding. These lines just to compare with embedding
#print(demo(feature_column.indicator_column(payment_description_3)))
#print(payment_description_2,'\n')

# argument dimention here is exactly the dimension of the space in which tokens 
# will be presented during model's learning
# see the tutorial at https://www.tensorflow.org/beta/tutorials/text/word_embeddings
text_embedding = feature_column.embedding_column(text_column, dimension=8)
print(demo(text_embedding))

# The define the layers and model it self
# This example uses Keras Functional API instead of Sequential just for more generallity

# Define DenseFeatures layer to pass feature_columns into Keras model
feature_layer = tf.keras.layers.DenseFeatures(text_embedding)

# Define inputs for each feature column.
# See https://github.com/tensorflow/tensorflow/issues/27416#issuecomment-502218673
feature_layer_inputs = {}

# Here we have just one column
# Important to define tf.keras.Input with shape 
# corresponding to lentgh of our sequence of words
feature_layer_inputs['text'] = tf.keras.Input(shape=(max_length,),
                                              name='text',
                                              dtype=tf.string)
print(feature_layer_inputs)

# Define outputs of DenseFeatures layer 
# And accually use them as first layer of the model
feature_layer_outputs = feature_layer(feature_layer_inputs)
print(feature_layer_outputs)

# Add consequences layers.
# See https://keras.io/getting-started/functional-api-guide/
x = tf.keras.layers.Dense(256, activation='relu')(feature_layer_outputs)
x = tf.keras.layers.Dropout(0.2)(x)

# This example supposes binary classification, as labels are 0 or 1
x = tf.keras.layers.Dense(1, activation='sigmoid')(x)

model = tf.keras.models.Model(inputs=[v for v in feature_layer_inputs.values()],
                              outputs=x)

model.summary()

# This example supposes binary classification, as labels are 0 or 1
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy']
              #run_eagerly=True
             )

# Note that fit() method looking up features in train_ds and valdation_ds by name in 
# tf.keras.Input(shape=(max_length,), name='text'

# This model of cause will learn nothing because of fake data.

num_epochs = 5
history = model.fit(train_ds,
                    validation_data=val_ds,
                    epochs=num_epochs,
                    verbose=1
                    )

第二种选择是我们关心词序并学习我们的模型。

# Define categorical colunm for our text feature, 
# which is preprocessed into lists of tokens
# Note that key name should be the same as original column name in dataframe
text_column = feature_column.
              sequence_categorical_column_with_vocabulary_list(key='text', 
                                                vocabulary_list=list(word_index))

# arguemnt dimention here is exactly the dimension of the space in 
# which tokens will be presented during model's learning
# see the tutorial at https://www.tensorflow.org/beta/tutorials/text/word_embeddings
text_embedding = feature_column.embedding_column(text_column, dimension=8)
print(seqdemo(text_embedding))

# The define the layers and model it self
# This example uses Keras Functional API instead of Sequential 
# just for more generallity

# Define SequenceFeatures layer to pass feature_columns into Keras model
sequence_feature_layer = tf.keras.experimental.SequenceFeatures(text_embedding)

# Define inputs for each feature column. See
# см. https://github.com/tensorflow/tensorflow/issues/27416#issuecomment-502218673
feature_layer_inputs = {}
sequence_feature_layer_inputs = {}

# Here we have just one column

sequence_feature_layer_inputs['text'] = tf.keras.Input(shape=(max_length,),
                                                       name='text',
                                                       dtype=tf.string)
print(sequence_feature_layer_inputs)

# Define outputs of SequenceFeatures layer 
# And accually use them as first layer of the model

# Note here that SequenceFeatures layer produce tuple of two tensors as output.
# We need just first to pass next.
sequence_feature_layer_outputs, _ = sequence_feature_layer(sequence_feature_layer_inputs)
print(sequence_feature_layer_outputs)
# Add consequences layers. See https://keras.io/getting-started/functional-api-guide/

# Conv1D and MaxPooling1D will learn features from words order
x = tf.keras.layers.Conv1D(8,4)(sequence_feature_layer_outputs)
x = tf.keras.layers.MaxPooling1D(2)(x)
# Add consequences layers. See https://keras.io/getting-started/functional-api-guide/
x = tf.keras.layers.Dense(256, activation='relu')(x)
x = tf.keras.layers.Dropout(0.2)(x)

# This example supposes binary classification, as labels are 0 or 1
x = tf.keras.layers.Dense(1, activation='sigmoid')(x)

model = tf.keras.models.Model(inputs=[v for v in sequence_feature_layer_inputs.values()],
                              outputs=x)
model.summary()

# This example supposes binary classification, as labels are 0 or 1
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy']
              #run_eagerly=True
             )

# Note that fit() method looking up features in train_ds and valdation_ds by name in 
# tf.keras.Input(shape=(max_length,), name='text'

# This model of cause will learn nothing because of fake data.

num_epochs = 5
history = model.fit(train_ds,
                    validation_data=val_ds,
                    epochs=num_epochs,
                    verbose=1
                    )

请在我的 github 上找到包含以下示例的完整 jupyter 笔记本：

回答。 Tensorflow pad 序列特征列。密集特征.ipynb https://github.com/EgorBEremeev/SoloLearnML/blob/master/stackoverflow/Answer.%20Tensorflow%20pad%20sequence%20feature%20column.%20DenseFeatures.ipynb
回答。 Tensorflow pad 序列特征列。序列特征.ipynb https://github.com/EgorBEremeev/SoloLearnML/blob/master/stackoverflow/Answer.%20Tensorflow%20pad%20sequence%20feature%20column.%20SequenceFeatures.ipynb

参数维度为feature_column.embedding_column()正是模型学习过程中令牌呈现的空间维度。请参阅教程：https://www.tensorflow.org/beta/tutorials/text/word_embeddings https://www.tensorflow.org/beta/tutorials/text/word_embeddings详细解释

另请注意，使用feature_column.embedding_column()是一个替代方案tf.keras.layers.Embedding()。如你所见feature_column从预处理管道中进行编码步骤，但您仍然应该手动进行句子的分割、填充和截断。

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)

Tensorflow pad序列特征列的相关文章

Python、Tkinter、更改标签颜色

有没有一种简单的方法来更改按钮中文本的颜色 I use button text input text here 更改按下后按钮文本的内容是否存在类似的颜色变化 button color red Use the foreground设置按钮
使用 openCV 对图像中的子图像进行通用检测

免责声明我是计算机视觉菜鸟我看过很多关于如何在较大图像中查找特定子图像的堆栈溢出帖子我的用例有点不同因为我不希望它是具体的而且我不确定如何做到这一点如果可能的话但我感觉应该如此我有大量图像数据集有时其中一些图像是数据集的
如何生成给定范围内的回文数列表？

假设范围是 1 X 120 这是我尝试过的 gt gt gt def isPalindrome s check if a number is a Palindrome s str s return s s 1 gt gt gt def ge
如何收集列表、字典等中重复计算的结果（或制作修改每个元素的列表的副本）？

There are a great many existing Q A on Stack Overflow on this general theme but they are all either poor quality typical
如何在android上的python kivy中关闭应用程序后使服务继续工作

我希望我的服务在关闭应用程序后继续工作但我做不到我听说我应该使用startForeground 但如何在Python中做到这一点呢应用程序代码 from kivy app import App from kivy uix floatl
如何在 Sublime Text 2 的 OSX 终端中显示构建结果

我刚刚从 TextMate 切换到 Sublime Text 2 我非常喜欢它让我困扰的一件事是默认的构建结果显示在 ST2 的底部我的程序产生一些很长的结果显示它的理想方式如在 TM2 中是并排查看它们如何在 Mac 操作系统
pandas 替换多个值

以下是示例数据框 gt gt gt df pd DataFrame a 1 1 1 2 2 b 11 22 33 44 55 gt gt gt df a b 0 1 11 1 1 22 2 1 33 3 2 44 4 3 55 现在我想根据
打破嵌套循环[重复]

这个问题在这里已经有答案了有没有比抛出异常更简单的方法来打破嵌套循环在Perl https en wikipedia org wiki Perl 您可以为每个循环指定标签并且至少继续一个外循环 for x in range 10 fo
运行多个 scrapy 蜘蛛的正确方法

我只是尝试使用在同一进程中运行多个蜘蛛新的 scrapy 文档 http doc scrapy org en 1 0 topics practices html但我得到 AttributeError CrawlerProcess objec
feedparser 在脚本运行期间失败，但无法在交互式 python 控制台中重现

当我运行 eclipse 或在 iPython 中运行脚本时它失败了 ascii codec can t decode byte 0xe2 in position 32 ordinal not in range 128 我不知道为什么但
PyTorch 中的后向函数

我对 pytorch 的后向功能有一些疑问我认为我没有得到正确的输出 import numpy as np import torch from torch autograd import Variable a Variable torch
Pandas Dataframe 中 bool 值的条件前向填充

问题如何转发 fill boolTruepandas 数据框中的值如果是当天的第一个条目 True 到一天结束时请参阅以下示例和所需的输出 Data import pandas as pd import numpy as np df
ExpectedFailure 被计为错误而不是通过

我在用着expectedFailure因为有一个我想记录的错误我现在无法修复但想将来再回来解决我的理解expectedFailure是它会将测试计为通过但在摘要中表示预期失败的数量为 x 类似于它如何处理跳过的 tets 但是当我
Python - 在窗口最小化或隐藏时使用 pywinauto 控制窗口

我正在尝试做的事情我正在尝试使用 pywinauto 在 python 中创建一个脚本以在后台自动安装 notepad 隐藏或最小化 notepad 只是一个示例因为我将编辑它以与其他软件一起使用 Problem 问题是我想在安装程序
从 pygame 获取 numpy 数组

我想通过 python 访问我的网络摄像头不幸的是由于网络摄像头的原因 openCV 无法工作 Pygame camera 使用以下代码就像魅力一样 from pygame import camera display camera in
VSCode：调试配置中的 Python 路径无效

对 Python 和 VSCode 以及 stackoverflow 非常陌生直到最近我已经使用了大约 3 个月一切都很好当尝试在调试器中运行任何基本的 Python 程序时弹出窗口The Python path in your
对输入求 Keras 模型的导数返回全零

所以我有一个 Keras 模型我想将模型的梯度应用于其输入这就是我所做的 import tensorflow as tf from keras models import Sequential from keras layers imp
使用基于正则表达式的部分匹配来选择 Pandas 数据帧的子数据帧

我有一个 Pandas 数据框它有两列一列进程参数列包含字符串另一列值列包含相应的浮点值我需要过滤出部分匹配列过程参数中的一组键的子数据帧并提取与这些键匹配的数据帧的两列 df pd DataFrame Proce
在 Python 类中动态定义实例字段

我是 Python 新手主要从事 Java 编程我目前正在思考Python中的类是如何实例化的我明白那个 init 就像Java中的构造函数然而有时 python 类没有 init 方法在这种情况下我假设有一个默认构造函数就像
协方差矩阵的对角元素不是 1 pandas/numpy

我有以下数据框 A B 0 1 5 1 2 6 2 3 7 3 4 8 我想计算协方差 a df iloc 0 values b df iloc 1 values 使用 numpy 作为 cov numpy cov a b I get ar

随机推荐

Devexpress 在 razor mvc3 中添加报告时出错

我正在尝试在我的 MVC 3 Web 应用程序中使用 DevExpress 报告此应用程序是普通的 MVC 3 应用程序而不是 DevExpress MVC 3 应用程序使用以下教程添加 XtraReportshttp documen
应用程序不会写入 MS DB

我创建了一个 Java 桌面应用程序它可以读取并写入 Microsoft Access DB 在我将其转换为应用程序之前该应用程序运行良好 JAR之后它只能从数据库中读取但不能写入关于如何解决这个问题有什么想法吗我猜您已将数据库文
Snap.js 侧面板默认打开？

我正在使用 Snap js https github com jakiestfu Snap js https github com jakiestfu Snap js 为网站构建一个新框架我想知道是否有人知道一种方法可以在您访问该页面时
Javascript If 语句，查看数组

今天下午脑子一片空白我一辈子都想不出正确的方法来做到这一点 if i 3 i 4 i 5 i 6 i 7 i 8 i 9 i 2 i 19 i 18 i 60 i 61 i 50 i 49 i 79 i 78 i 81 i 82 i 80
如何在气流中使用 CLI 清除失败的 DAG

我有一些失败的 DAG 比如说从 2 月 1 日到 2 月 20 日从那天起他们都成功了我尝试使用cli https airflow apache org cli html clear 而不是使用 Web UI 执行二十次 airfl
AppEngine 端点 JsonMappingException - 避免字段被序列化

我有这个错误 com google appengine repackaged org codehaus jackson map JsonMappingException Direct self reference leading to cy
System.Security.Cryptography.CryptographicException：RSACryptoserviceProvider 中的长度错误

我想使用加密和解密数据RSACryptoServiceProvider在 wp8 项目中的 c 中我正在创建非对称密钥 CspParameters parameters new CspParameters parameters KeyCo
隐藏html水平但不垂直的滚动条

我有一个宽度固定但高度可变的 HTML 文本区域我想设置overflow scroll并能够显示垂直滚动条但不能显示水平滚动条我无法使用overflow auto由于其他特定于我的情况的事情我知道使用 CSS2 无法仅显示垂直滚动条
在Python中将不规则间隔的数据重新采样为规则网格

我需要将二维数据重新采样为常规网格这就是我的代码的样子 import matplotlib mlab as ml import numpy as np y np zeros 512 115 x np zeros 512 115 Just
如何使用php连接远程mysql数据库（托管在dotCloud上）

我无法连接到位于 dotCloud 上的数据库我试过 mysqli new mysqli db host db user db password db name and mysqli mysqli connect db host db u
如何查找最新或最近的AWS RDS快照？

我可以打电话aws rds describe db snapshots db instance identifier my db instance 并对所有自动快照进行排序以找到最近创建的快照但我希望有人有更好的主意对我来说这个有效
如何在 Banana PI ZERO M2 上启用 eth0

默认情况下 BPI ZERO M2 上禁用 eth0 这里我们将展示启用它解决方案是创建一个 dtdo 文件并将其放在正确的位置 1 创建文本源文件 bananapi m2 zero eth0 dts dts v1 plugin mode
VBScript - 如何让程序等待进程完成？

我在与 VBA Excel 宏和 HTA 一起使用的 VBScript 中遇到问题问题只是 VBScript 我还有其他两个组件即 VBA 宏和 HTA 前端工作正常但在我解释问题之前我认为为了让您帮助我我必须帮助您了解 VBS
' 在 dart 中没有零参数构造函数' aria-label='超类 'Bloc' 在 dart 中没有零参数构造函数'> 超类 'Bloc' 在 dart 中没有零参数构造函数

我是 Dart 语言开发的初学者我尝试创建一个示例 flutter 应用程序 BLOC 模式其灵感来自于这个 GitHub 存储库 https github com newajthevillager FirebaseUserAuthen
如何在 jenkins 中获取作业的相应构建工件？

我使用创建 Jenkins 工作hudson cli CLI jar 我已选择将文物归档选项中的构建后步骤部分它对每个成功构建的工件进行归档我在用詹金斯远程访问API http localhost 8080 job job na
当最后一个进程处于尾部时未捕获 SIGTERM 信号

我有以下脚本其中有tail pid somepid f mylogs 我想抓住SIGTERM并对该 PID 进行一些正常关闭因为该进程无法理解SIGTERM并痛苦地死去 echo pid trap with arg func 1 shi
查找 Spark DataFrame 中每组的最大行数

我尝试使用 Spark 数据帧而不是 RDD 因为它们似乎比 RDD 更高级并且往往会生成更可读的代码在 14 个节点的 Google Dataproc 集群中我有大约 600 万个名称这些名称由两个不同的系统转换为 id sa a
转换日期Python

I have MMDDYY日期即今天是111609 我如何将其转换为11 16 2009 用Python 我建议如下 import datetime date datetime datetime strptime 111609 m d y
循环变量是否始终是新创建的

在下面的代码中我使用变量名称n对于局部变量和循环计数器 proc main var n 700 writeln n before loop n for n in 1 3 writeln n n writeln n after loop n
Tensorflow pad序列特征列

如何在特征列中填充序列以及什么是dimension in the feature column 我在用Tensorflow 2 0并实现文本摘要的示例对于机器学习深度学习和 TensorFlow 来说还很陌生我碰到feature co

Tensorflow pad序列特征列

Tensorflow pad序列特征列 的相关文章

随机推荐

热门标签

Tensorflow pad序列特征列的相关文章