当单词不存在时，将 0 分配给某些单词

2024-01-07

这是我在 stackoverflow 上发表的第一篇文章，我对编码还比较陌生。所以，请耐心听我说。

我正在做一个实验，有两组数据文档。文档1如下：

TOPIC:topic_0 5892.0
site 0.0371690427699
Internet 0.0261371350984
online 0.0229124236253
web 0.0218940936864
say 0.0159538357094
image 0.015105227427

TOPIC:topic_1 12366.0
Mr 0.150331554262
s 0.0517548115801
say 0.0451237263464

TOPIC:topic_2 ....
.....
.....

TOPIC:topic_3 1066.0
say 0.062
word 0.182

依此类推，直到100个主题。

在本文档中，有些单词要么出现在所有主题中，要么只出现在少数主题中。因此，我想执行一个过程，如果一个单词不存在于一个主题中，我希望该单词在该主题中的值为 0。也就是说，单词 BBC 存在于主题 2 中，但不存在于主题 2 中。主题 1，所以我希望我的列表为：

TOPIC:topic_0 5892.0
site 0.0371690427699
Internet 0.0261371350984
online 0.0229124236253
web 0.0218940936864
say 0.0159538357094
image 0.015105227427
Mr 0
s 0
president 0
tell 0
BBC 0

TOPIC:topic_1 12366.0
Mr 0.150331554262
s 0.0517548115801
say 0.0451237263464
president 0.0153647096879
tell 0.0135856380398
BBC 0.0135856380398
site 0
Internet 0
online 0
web 0
say 0
image 0

我必须将这些值与另一个文档中存在的另一组值相乘。为了那个原因，

from collections import defaultdict
from itertools import groupby, imap

d = defaultdict(list)
with open("doc1") as f,open("doc2") as f2:
values = map(float, f2.read().split()) 
for line in f:
    if line.strip() and not line.startswith("TOPIC"):
        name, val = line.split()
        d[name].append(float(val))

for k,v in d.items():
     print("Prob for {} is {}".format(k ,sum(i*j for i, j in zip(v,values)) ))

我的 doc2 的格式为：

  0.566667 0.0333333 0.133333 0 0 0  2.43333 0 0.13333......... till 100 values.

上面的代码考虑了单词“say”。它检查该单词是否在 3 个主题中，并将它们的值收集在一个列表中，如 [0.015, 0.45, 0.062]。该列表与 doc2 中的值相乘，其中值 0.015 乘以 doc2 中的第 0 个值、0.45 * doc2 中的第一个值和 0.062* doc2 中的第二个值。但这不是我想要的。我们可以看到topic_2中没有“SAY”这个词。这里的列表必须包含 [0.015, 0.45, 0, 0.062]。因此，当这些值与 doc2 中各自的位置值相乘时，它们将给出

P(SAY) = (0.566667*0.015) + (0.0333333*0.045) + (0.133333 *0) + (0*0.062)

因此，代码非常好，但只需要进行此修改。

问题是您将主题视为一个整体，如果您希望各个部分使用groupby https://stackoverflow.com/a/31506466/2141635原始答案中的代码首先获取一组所有名称，然后将这组名称与 defaultdict 键进行比较，以找出每个部分中的差异：

from collections import defaultdict
d = defaultdict(float)
from itertools import groupby, imap

with open("doc1") as f,open("doc2") as f2:
    values = imap(float, f2.read().split())
    # find every word in every TOPIC
    all_words = {line.split()[0] for line in f if line.strip() and not line.startswith("TOPIC")}
    f.seek(0) # rset pointer
    # lambda x: not(x.strip()) will split into groups on the empty lines
    for ind, (k, v) in enumerate(groupby(f, key=lambda x: not(x.strip()))):
        if not k:
            topic = next(v)
            #  get matching float from values
            f = next(values)
            # iterate over the group
            for s in v:
                name, val = s.split()
                d[name] += (float(val) * f)
            # get difference in all_words vs words in current TOPIC
            # giving 0 as default for missing values
            for word in all_words - d.viewkeys():
                d[word] = 0
            for k,v in d.iteritems():
                print("Prob for {} is {}".format(k,v))
            d = defaultdict(float)

要存储所有输出，您可以将字典添加到列表中：

from collections import defaultdict
d = defaultdict(float)
from itertools import groupby, imap
with open("doc1") as f,open("doc2") as f2:
    values = imap(float, f2.read().split())
    all_words = {line.split()[0] for line in f if line.strip() and not line.startswith("TOPIC")}
    f.seek(0)
    out = []
    # lambda x: not(x.strip()) will split into groups on the empty lines
    for ind, (k, v) in enumerate(groupby(f, key=lambda x: not(x.strip()))):
        if not k:
            topic = next(v)
            #  get matching float from values
            f = next(values)
            # iterate over the group
            for s in v:
                name, val = s.split()
                d[name] += (float(val) * f)
            for word in all_words - d.viewkeys():
                d[word] = 0
            out.append(d)
            d = defaultdict(float)

然后迭代列表：

for top in out:
  for k,v in top.iteritems():
            print("Prob for {} is {}".format(k,v))

或者忘记 defualtdict 并使用 dict.fromkeys：

from itertools import groupby, imap

with open("doc1") as f,open("doc2") as f2:
    values = imap(float, f2.read().split())
    all_words = [line.split()[0] for line in f if line.strip() and not line.startswith("TOPIC")]
    f.seek(0)
    out, d = [], dict.fromkeys(all_words ,0.0)
    # lambda x: not(x.strip()) will split into groups on the empty lines
    for ind, (k, v) in enumerate(groupby(f, key=lambda x: not(x.strip()))):
        if not k:
            topic = next(v)
            #  get matching float from values
            f = next(values)
            # iterate over the group
            for s in v:
                name, val = s.split()
                d[name] += (float(val) * f)
            out.append(d)
            d = dict.fromkeys(all_words ,0)

如果您总是希望末尾缺少单词，请使用 collections.OrderedDict 并使用第一种方法在字典末尾添加缺少的值：

from collections import OrderedDict

from itertools import groupby, imap
with open("doc1") as f,open("doc2") as f2:
    values = imap(float, f2.read().split())
    all_words = {line.split()[0] for line in f if line.strip() and not line.startswith("TOPIC")}
    f.seek(0)
    out = []
    # lambda x: not(x.strip()) will split into groups on the empty lines
    for  (k, v) in groupby(f, key=lambda x: not(x.strip())):
        if not k:
            topic = next(v)
            #  get matching float from values
            f = next(values)
            # iterate over the group
            for s in v:
                name, val = s.split()
                d.setdefault(name, (float(val) * f))
            for word in all_words.difference(d):
                    d[word] = 0
            out.append(d)
            d = OrderedDict()

for top in out:
    for k,v in top.iteritems():
         print("Prob for {} is {}".format(k,v))

最后按顺序和主题存储：

from collections import OrderedDict

from itertools import groupby, imap

with open("doc1") as f,open("doc2") as f2:
    values = imap(float, f2.read().split())
    all_words = {line.split()[0] for line in f if line.strip() and not line.startswith("TOPIC")}
    f.seek(0)
    out = OrderedDict()
    # lambda x: not(x.strip()) will split into groups on the empty lines
    for (k, v) in groupby(f, key=lambda x: not(x.strip())):
        if not k:
            topic = next(v).rstrip()
            # create OrderedDict for each topic
            out[topic] = OrderedDict()
            #  get matching float from values
            f = next(values)
            # iterate over the group
            for s in v:
                name, val = s.split()
                out[topic].setdefault(name, (float(val) * f))
            # find words missing from TOPIC and  set to 0
            for word in  all_words.difference(out[topic]):
                    out[topic][word] = 0

for k,v in out.items():
    print(k) # each TOPIC
    for k,v in v.iteritems():
        print("Prob for {} is {}".format(k,v)) # the OrderedDict items
   print("\n")

doc1:

TOPIC:topic_0 5892.0
site 0.0371690427699
Internet 0.0261371350984
online 0.0229124236253
web 0.0218940936864
say 0.0159538357094
image 0.015105227427

TOPIC:topic_1 12366.0
Mr 0.150331554262
s 0.0517548115801
say 0.0451237263464
president 0.0153647096879
tell 0.0135856380398
BBC 0.0135856380398

doc2:

0.345 0.566667

Output:

TOPIC:topic_0 5892.0
Prob for site is 0.0128233197556
Prob for Internet is 0.00901731160895
Prob for online is 0.00790478615073
Prob for web is 0.00755346232181
Prob for say is 0.00550407331974
Prob for image is 0.00521130346231
Prob for BBC is 0
Prob for Mr is 0
Prob for s is 0
Prob for president is 0
Prob for tell is 0


TOPIC:topic_1 12366.0
Prob for Mr is 0.085187930859
Prob for s is 0.0293277438137
Prob for say is 0.0255701266375
Prob for president is 0.00870667394471
Prob for tell is 0.0076985327511
Prob for BBC is 0.0076985327511
Prob for web is 0
Prob for image is 0
Prob for online is 0
Prob for site is 0
Prob for Internet is 0

您可以使用常规 for 循环应用完全相同的逻辑，groupby 只是为您完成所有分组工作。

如果您实际上只想写入文件，那么代码就更简单：

from itertools import groupby, imap
with open("doc1") as f,open("doc2") as f2,open("prob.txt","w") as f3:
    values = imap(float, f2.read().split())
    all_words = {line.split()[0] for line in f if line.strip() and not line.startswith("TOPIC")}
    f.seek(0)
    for (k, v) in groupby(f, key=lambda x: not(x.strip())):
        if not k:
            topic, words  = next(v), []
            flt = next(values)
            f3.write(topic)    
            for s in v:
                name, val = s.split()
                words.append(name)
                f3.write("{} {}\n".format(name, (float(val) * flt)))
            for word in all_words.difference(words):
                  f3.write("{} {}\n".format(word, 0))
            f3.write("\n")

问题.txt：

TOPIC:topic_0 5892.0
site 0.0128233197556
Internet 0.00901731160895
online 0.00790478615073
web 0.00755346232181
say 0.00550407331974
image 0.00521130346231
BBC 0
Mr 0
s 0
president 0
tell 0

TOPIC:topic_1 12366.0
Mr 0.085187930859
s 0.0293277438137
say 0.0255701266375
president 0.00870667394471
tell 0.0076985327511
BBC 0.0076985327511
web 0
image 0
online 0
site 0
Internet 0

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)

当单词不存在时，将 0 分配给某些单词的相关文章

如何在刻度标签和轴之间添加空间

我已成功增加刻度标签的字体但现在它们距离轴太近了我想在刻度标签和轴之间添加一点呼吸空间如果您不想全局更改间距通过编辑 rcParams 并且想要更简洁的方法请尝试以下操作 ax tick params axis both whic
将字符串转换为带有毫秒和时区的日期时间 - Python

我有以下 python 片段 from datetime import datetime timestamp 05 Jan 2015 17 47 59 000 0800 datetime object datetime strptime t
DreamPie 不适用于 Python 3.2

我最喜欢的 Python shell 是DreamPie http dreampie sourceforge net 我想将它与 Python 3 2 一起使用我使用了添加解释器 DreamPie 应用程序并添加了 Python 3 2
__del__ 真的是析构函数吗？

我主要用 C 做事情其中析构函数方法实际上是为了销毁所获取的资源最近我开始使用python 这真的很有趣而且很棒我开始了解到它有像java一样的GC 因此没有过分强调对象所有权构造和销毁据我所知 init 方法对我来说在 py
安装后 Anaconda 提示损坏

我刚刚安装张量流GPU创建单独的后环境按照以下指示here https github com antoniosehk keras tensorflow windows installation 但是安装后当我关闭提示窗口并打开新航站楼弹出
keras加载模型错误尝试将包含17层的权重文件加载到0层的模型中

我目前正在使用 keras 开发 vgg16 模型我用我的一些图层微调 vgg 模型拟合我的模型训练后我保存我的模型model save name h5 可以毫无问题地保存但是当我尝试使用以下命令重新加载模型时load mod
在 NumPy 中获取 ndarray 的索引和值

我有一个 ndarrayA任意维数N 我想创建一个数组B元组数组或列表其中第一个N每个元组中的元素是索引最后一个元素是该索引的值A 例如 A array 1 2 3 4 5 6 Then B 0 0 1 0 1 2 0 2 3 1 0
HTTPS 代理不适用于 Python 的 requests 模块

我对 Python 还很陌生我一直在使用他们的 requests 模块作为 PHP 的 cURL 库的替代品我的代码如下 import requests import json import os import urllib impor
循环中断打破tqdm

下面的简单代码使用tqdm https github com tqdm tqdm在循环迭代时显示进度条 import tqdm for f in tqdm tqdm range 100000000 if f gt 100000000 4 b
Python - 按月对日期进行分组

这是一个简单的问题起初我认为很简单而忽略了它一个小时过去了我不太确定所以我有一个Python列表datetime对象我想用图表来表示它们 x 值是年份和月份 y 值是此列表中本月发生的日期对象的数量也许一个例子可以更好地证明这
Numpy 优化

我有一个根据条件分配值的函数我的数据集大小通常在 30 50k 范围内我不确定这是否是使用 numpy 的正确方法但是当数字超过 5k 时它会变得非常慢有没有更好的方法让它更快 import numpy as np N 5000
通过数据框与函数进行交互

如果我有这样的日期框架氮 EG 00 04 NEG 04 08 NEG 08 12 NEG 12 16 NEG 16 20 NEG 20 24 datum von 2017 10 12 21 69 15 36 0 87 1 42 0 76
如何在 Django 中使用并发进程记录到单个文件而不使用独占锁

给定一个在多个服务器上同时执行的 Django 应用程序该应用程序如何记录到单个共享日志文件在网络共享中而不保持该文件以独占模式永久打开当您想要利用日志流时这种情况适用于 Windows Azure 网站上托管的 Django 应
如何从没有结尾的管道中读取 python 中的 stdin

当管道来自打开时不知道正确的名称我无法从 python 中的标准输入或管道读取数据文件我有作为例子管道测试 py import sys import time k 0 try for line in sys stdin k k
在python中，如何仅搜索所选子字符串之前的一个单词

给定文本文件中的长行列表我只想返回紧邻其前面的子字符串例如单词狗描述狗的单词例如假设有这些行包含狗 hotdog big dog is dogged dog spy with my dog brown dogs 在这种情况下期望
使用基于正则表达式的部分匹配来选择 Pandas 数据帧的子数据帧

我有一个 Pandas 数据框它有两列一列进程参数列包含字符串另一列值列包含相应的浮点值我需要过滤出部分匹配列过程参数中的一组键的子数据帧并提取与这些键匹配的数据帧的两列 df pd DataFrame Proce
在 Python 类中动态定义实例字段

我是 Python 新手主要从事 Java 编程我目前正在思考Python中的类是如何实例化的我明白那个 init 就像Java中的构造函数然而有时 python 类没有 init 方法在这种情况下我假设有一个默认构造函数就像
您可以在 Python 类型注释中指定方差吗？

你能发现下面代码中的错误吗米皮不能 from typing import Dict Any def add items d Dict str Any gt None d foo 5 d Dict str str add items d f
Python：元类属性有时会覆盖类属性？

下面代码的结果让我感到困惑 class MyClass type property def a self return 1 class MyObject object metaclass MyClass a 2 print MyObject
改变字典的哈希函数

按照此question https stackoverflow com questions 37100390 towards understanding dictionaries 我们知道两个不同的字典 dict 1 and dict 2例

随机推荐

从模块中角度导出的组件在另一个模块中不可用

我正在 AppModule 中导出自定义组件但无法在 AppModule 中导入的另一个模块中使用它我认为导出的组件在全球范围内都是可见的我试图在 TestModule 内的组件中使用 CalendarComponent 和选择器 a
发布代码覆盖率在 Azure DevOps 中找不到覆盖率文件

我正在使用节点14 x和开玩笑26 x 有一个npm testpackage json 文件中的脚本包含以下内容 cross env NODE ENV test jest coverage forceExit 当我在本地运行它时它会生成代
我可以将自定义分区器与 group by 一起使用吗？

假设我知道我的数据集不平衡并且我知道键的分布我想利用它来编写一个自定义分区器以充分利用运算符实例我知道关于数据流 partitionCustom https ci apache org projects flink flink doc
Qt/Qt Creator - 程序意外完成。 <程序路径>崩溃了

我对 C 和 Qt 5 2 1 有点陌生我实际上正在学习如何使用Qt 为了尽可能简单地做到这一点我使用 Qt Creator 3 0 1 我在项目的 main cpp 文件中编写了这一小段代码 include
当命令行给出 -jvm-debug 时，如何在测试中设置 fork？

如果项目在调试模式下运行是否有办法有条件地禁用分叉 sbt jvm debug 9999 然后在我的构建中 fork in Test find a key that lets me know if debugging in set up
使用 bar 函数时如何在 x 轴上显示分类数据？

我正在尝试模拟 MATLAB 官方网站上的代码但无法获得相同的输出这是代码 c categorical apples oranges pears prices 1 23 0 99 2 3 bar c prices 这是 MATLAB 网
如何从Excel列字母中获取列号（或索引）

我搜索过这个网站并用谷歌搜索了一个公式我需要根据字母计算 Excel 列号例如 A 1 B 2 AA 27 AZ 52 AAA 703 在字母表随机循环后代码似乎少了 1 位数字 AZ gt BA 少数字它看起来还会从两个不同的输入
如何检测 JComboBox 是否为空？

如何检测 JComboBox 是否为空是不是类似 combobox isEmpty 出了什么问题JComboBox getItemCount http docs oracle com javase 7 docs api javax swi
隐马尔可夫模型 (HMM) 中的三态电话模型

我想问一下HMM中3态电话模型的含义本案例基于语音识别系统中的HMM理论因此该示例基于 HMM 中语音的声学建模我从期刊论文中得到了这张示例图片 http www intechopen com source html 41188 m
如何在 Github Atom Editor 中同步多台计算机的包和设置

我已经在我的个人电脑和办公室电脑上安装了 Github Atom Editor 我想将设置和软件包同步到我的 Dropbox 帐户这样当我登录办公室电脑时它会自动下载或更新所有软件包和设置到我的家庭电脑您是否尝试过使用原子同步设置 h
CMU Sphinx 是否可以通过 Maven 获得？

我有一个可能需要 CMU Sphinx 的应用程序的想法它可以通过 Maven 获得还是需要手动添加更新 CMUSphinx 将在一周左右的时间内在 sonatype 中提供 Maven 支持已经提交到 sphinx4 trunk 中
go-git：创建本地分支的正确方法，模拟“git分支 ”的行为？

正如标题所示我试图弄清楚如何使用创建本地分支go git与 Git CLI 命令给出相同结果的方式git branch
防止“冒泡”？ [复制]

这个问题在这里已经有答案了我不确定这是否真的在冒泡我会解释一下我有这个 div div text here div div 如何绑定点击事件使其仅影响所包含的 div 如果我这样设置 jQuery div bind click fu
使用elasticsearch实施建议“类别中的xxx”

我想对产品实施类似亚马逊的类别内建议亚马逊建议在特定类别中搜索给定术语而不是全局搜索这允许更具体的搜索和结果有没有办法使用elasticsearch提供的建议功能之一来实现这一点目前我的想法是从elasticsearch获取建
mongoDB vs mySQL——为什么一个在某些方面比另一个更好[关闭]

就目前情况而言这个问题不太适合我们的问答形式我们希望答案得到事实参考资料或专业知识的支持但这个问题可能会引发辩论争论民意调查或扩展讨论如果您觉得这个问题可以改进并可能重新开放访问帮助中心 help reopen questi
评估骰子滚动符号字符串

Rules 编写一个接受字符串作为参数的函数返回表达式的评估值骰子记数法 http en wikipedia org wiki Dice notation 包括加法和乘法为了澄清问题这里是法律表达式的 EBNF 定义 roll po
使用 Python 从文本中提取 IBAN

我想用 Python 从文本中提取 IBAN 号码这里的挑战是 IBAN 本身可以用多种方式编写数字之间有空格我发现很难将其转换为有用的正则表达式模式我写了一个演示版 https regex101 com r PRDDaT 1它尝试
使用 LINQ 进行编码是如何工作的？幕后发生了什么？

例如 m lottTorqueTools From t In m lottTorqueTools Where Not t SlotNumber toolTuple SlotNumber And Not t StationIndex tool
java.lang.RuntimeException: android.database.sqlite.SQLiteException: 没有这样的表: media_store_extension (代码 1): ,

我在 2021 年 10 月之后在 Play 商店上发布我的应用程序时遇到问题错误表明该表media store extension不存在问题是我在项目中没有使用 SQLITE 所以我不知道是什么导致了这个异常目标 sdk 是 30
当单词不存在时，将 0 分配给某些单词

这是我在 stackoverflow 上发表的第一篇文章我对编码还比较陌生所以请耐心听我说我正在做一个实验有两组数据文档文档1如下 TOPIC topic 0 5892 0 site 0 0371690427699 Intern

当单词不存在时，将 0 分配给某些单词

当单词不存在时，将 0 分配给某些单词 的相关文章

随机推荐

热门标签

当单词不存在时，将 0 分配给某些单词的相关文章