python词云实现

2023-05-16

python的一个蛮酷炫的功能是可以轻松地实现词云。
github上有关于这个项目的开源代码：
https://github.com/amueller/word_cloud
注意跑例程时要删除里面的wordcloud文件夹
词云的功能有部分是基于NLP，有部分是基于图像的，
下面以一段github wordcloud上面的代码为例


from os import path
from PIL import Image
import numpy as np
import matplotlib.pyplot as plt

from wordcloud import WordCloud, STOPWORDS

d = path.dirname(__file__)

# Read the whole text.
text = open(path.join(d, 'alice.txt')).read()

# read the mask image
# taken from
# http://www.stencilry.org/stencils/movies/alice%20in%20wonderland/255fk.jpg
alice_mask = np.array(Image.open(path.join(d, "alice_mask.png")))

stopwords = set(STOPWORDS)
stopwords.add("said")

wc = WordCloud(background_color="white", max_words=2000, mask=alice_mask,
               stopwords=stopwords)
# generate word cloud
wc.generate(text)

# store to file
wc.to_file(path.join(d, "alice.png"))

# show
plt.imshow(wc, interpolation='bilinear')
plt.axis("off")
plt.figure()
plt.imshow(alice_mask, cmap=plt.cm.gray, interpolation='bilinear')
plt.axis("off")
plt.show()

原图：
这里写图片描述
结果：

Alice与兔子的图片
其中：
text打开文档
alice_mask是以数组的形式加载图画
stopwords设置停止显示的词语
WordCloud设置词云的属性
generate生成词云
to_file储存图片
进入wordcloud.py可以看到WordCloud类的相关属性：

 """Word cloud object for generating and drawing.

    Parameters
    ----------
    font_path : string
        Font path to the font that will be used (OTF or TTF).
        Defaults to DroidSansMono path on a Linux machine. If you are on
        another OS or don't have this font, you need to adjust this path.

    width : int (default=400)
        Width of the canvas.

    height : int (default=200)
        Height of the canvas.

    prefer_horizontal : float (default=0.90)
        The ratio of times to try horizontal fitting as opposed to vertical.
        If prefer_horizontal < 1, the algorithm will try rotating the word
        if it doesn't fit. (There is currently no built-in way to get only vertical
        words.)

    mask : nd-array or None (default=None)
        If not None, gives a binary mask on where to draw words. If mask is not
        None, width and height will be ignored and the shape of mask will be
        used instead. All white (#FF or #FFFFFF) entries will be considerd
        "masked out" while other entries will be free to draw on. [This
        changed in the most recent version!]

    scale : float (default=1)
        Scaling between computation and drawing. For large word-cloud images,
        using scale instead of larger canvas size is significantly faster, but
        might lead to a coarser fit for the words.

    min_font_size : int (default=4)
        Smallest font size to use. Will stop when there is no more room in this
        size.

    font_step : int (default=1)
        Step size for the font. font_step > 1 might speed up computation but
        give a worse fit.

    max_words : number (default=200)
        The maximum number of words.

    stopwords : set of strings or None
        The words that will be eliminated. If None, the build-in STOPWORDS
        list will be used.

    background_color : color value (default="black")
        Background color for the word cloud image.

    max_font_size : int or None (default=None)
        Maximum font size for the largest word. If None, height of the image is
        used.

    mode : string (default="RGB")
        Transparent background will be generated when mode is "RGBA" and
        background_color is None.

    relative_scaling : float (default=.5)
        Importance of relative word frequencies for font-size.  With
        relative_scaling=0, only word-ranks are considered.  With
        relative_scaling=1, a word that is twice as frequent will have twice
        the size.  If you want to consider the word frequencies and not only
        their rank, relative_scaling around .5 often looks good.

        .. versionchanged: 2.0
            Default is now 0.5.

    color_func : callable, default=None
        Callable with parameters word, font_size, position, orientation,
        font_path, random_state that returns a PIL color for each word.
        Overwrites "colormap".
        See colormap for specifying a matplotlib colormap instead.

    regexp : string or None (optional)
        Regular expression to split the input text into tokens in process_text.
        If None is specified, ``r"\w[\w']+"`` is used.

    collocations : bool, default=True
        Whether to include collocations (bigrams) of two words.

        .. versionadded: 2.0

    colormap : string or matplotlib colormap, default="viridis"
        Matplotlib colormap to randomly draw colors from for each word.
        Ignored if "color_func" is specified.

        .. versionadded: 2.0

    normalize_plurals : bool, default=True
        Whether to remove trailing 's' from words. If True and a word
        appears with and without a trailing 's', the one with trailing 's'
        is removed and its counts are added to the version without
        trailing 's' -- unless the word ends with 'ss'.

    Attributes
    ----------
    ``words_`` : dict of string to float
        Word tokens with associated frequency.

        .. versionchanged: 2.0
            ``words_`` is now a dictionary

    ``layout_`` : list of tuples (string, int, (int, int), int, color))
        Encodes the fitted word cloud. Encodes for each word the string, font
        size, position, orientation and color.

    Notes
    -----
    Larger canvases with make the code significantly slower. If you need a
    large word cloud, try a lower canvas size, and set the scale parameter.

    The algorithm might give more weight to the ranking of the words
    than their actual frequencies, depending on the ``max_font_size`` and the
    scaling heuristic.
    """

其中：
font_path表示用到字体的路径
width和height表示画布的宽和高
prefer_horizontal可以调整词云中字体水平和垂直的多少
mask即掩膜，产生词云背景的区域
scale:计算和绘图之间的缩放
min_font_size设置最小的字体大小
max_words设置字体的多少
stopwords设置禁用词
background_color设置词云的背景颜色
max_font_size设置字体的最大尺寸
mode设置字体的颜色但设置为RGBA时背景透明
relative_scaling设置有关字体大小的相对字频率的重要性
regexp设置正则表达式
collocations 是否包含两个词的搭配
在generate函数中调试进去可以看到函数：
words=process_text（text）可以返回文本中的词频
generate_from_frequencies根据单词和词频创造一个词云
下面是generate_from_frequencies函数的实现步骤

    def generate_from_frequencies(self, frequencies, max_font_size=None):
        """Create a word_cloud from words and frequencies.

        Parameters
        ----------
        frequencies : dict from string to float
            A contains words and associated frequency.

        max_font_size : int
            Use this font-size instead of self.max_font_size

        Returns
        -------
        self

        """
        # make sure frequencies are sorted and normalized
        frequencies = sorted(frequencies.items(), key=item1, reverse=True)
        if len(frequencies) <= 0:
            raise ValueError("We need at least 1 word to plot a word cloud, "
                             "got %d." % len(frequencies))
        frequencies = frequencies[:self.max_words]

        # largest entry will be 1
        max_frequency = float(frequencies[0][1])

        frequencies = [(word, freq / max_frequency)
                       for word, freq in frequencies]

        if self.random_state is not None:
            random_state = self.random_state
        else:
            random_state = Random()

        if self.mask is not None:
            mask = self.mask
            width = mask.shape[1]
            height = mask.shape[0]
            if mask.dtype.kind == 'f':
                warnings.warn("mask image should be unsigned byte between 0"
                              " and 255. Got a float array")
            if mask.ndim == 2:
                boolean_mask = mask == 255
            elif mask.ndim == 3:
                # if all channels are white, mask out
                boolean_mask = np.all(mask[:, :, :3] == 255, axis=-1)
            else:
                raise ValueError("Got mask of invalid shape: %s"
                                 % str(mask.shape))
        else:
            boolean_mask = None
            height, width = self.height, self.width
        occupancy = IntegralOccupancyMap(height, width, boolean_mask)

        # create image
        img_grey = Image.new("L", (width, height))
        draw = ImageDraw.Draw(img_grey)
        img_array = np.asarray(img_grey)
        font_sizes, positions, orientations, colors = [], [], [], []

        last_freq = 1.

        if max_font_size is None:
            # if not provided use default font_size
            max_font_size = self.max_font_size

        if max_font_size is None:
            # figure out a good font size by trying to draw with
            # just the first two words
            if len(frequencies) == 1:
                # we only have one word. We make it big!
                font_size = self.height
            else:
                self.generate_from_frequencies(dict(frequencies[:2]),
                                               max_font_size=self.height)
                # find font sizes
                sizes = [x[1] for x in self.layout_]
                font_size = int(2 * sizes[0] * sizes[1] / (sizes[0] + sizes[1]))
        else:
            font_size = max_font_size

        # we set self.words_ here because we called generate_from_frequencies
        # above... hurray for good design?
        self.words_ = dict(frequencies)

        # start drawing grey image
        for word, freq in frequencies:
            # select the font size
            rs = self.relative_scaling
            if rs != 0:
                font_size = int(round((rs * (freq / float(last_freq))
                                       + (1 - rs)) * font_size))
            if random_state.random() < self.prefer_horizontal:
                orientation = None
            else:
                orientation = Image.ROTATE_90
            tried_other_orientation = False
            while True:
                # try to find a position
                font = ImageFont.truetype(self.font_path, font_size)
                # transpose font optionally
                transposed_font = ImageFont.TransposedFont(
                    font, orientation=orientation)
                # get size of resulting text
                box_size = draw.textsize(word, font=transposed_font)
                # find possible places using integral image:
                result = occupancy.sample_position(box_size[1] + self.margin,
                                                   box_size[0] + self.margin,
                                                   random_state)
                if result is not None or font_size < self.min_font_size:
                    # either we found a place or font-size went too small
                    break
                # if we didn't find a place, make font smaller
                # but first try to rotate!
                if not tried_other_orientation and self.prefer_horizontal < 1:
                    orientation = (Image.ROTATE_90 if orientation is None else
                                   Image.ROTATE_90)
                    tried_other_orientation = True
                else:
                    font_size -= self.font_step
                    orientation = None

            if font_size < self.min_font_size:
                # we were unable to draw any more
                break

            x, y = np.array(result) + self.margin // 2
            # actually draw the text
            draw.text((y, x), word, fill="white", font=transposed_font)
            positions.append((x, y))
            orientations.append(orientation)
            font_sizes.append(font_size)
            colors.append(self.color_func(word, font_size=font_size,
                                          position=(x, y),
                                          orientation=orientation,
                                          random_state=random_state,
                                          font_path=self.font_path))
            # recompute integral image
            if self.mask is None:
                img_array = np.asarray(img_grey)
            else:
                img_array = np.asarray(img_grey) + boolean_mask
            # recompute bottom right
            # the order of the cumsum's is important for speed ?!
            occupancy.update(img_array, x, y)
            last_freq = freq

        self.layout_ = list(zip(frequencies, font_sizes, positions,
                                orientations, colors))
        return self

比较遗憾的词云并没有用到opencv库，如果用到opencv库应该可以做到更加炫酷

这里写图片描述

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)

python

词云实现

python词云实现的相关文章

使用Python开发Web应用程序

我一直在用 python 做一些工作但这都是针对独立应用程序的我很想知道 python 的任何分支是否支持 Web 开发有人还会建议一个好的教程或网站吗我可以从中学习一些使用 python 进行 Web 开发的基础知识既然大家都说
如何在刻度标签和轴之间添加空间

我已成功增加刻度标签的字体但现在它们距离轴太近了我想在刻度标签和轴之间添加一点呼吸空间如果您不想全局更改间距通过编辑 rcParams 并且想要更简洁的方法请尝试以下操作 ax tick params axis both whic
Pycharm Python 控制台不打印输出

我有一个从 Pycharm python 控制台调用的函数但没有显示输出 In 2 def problem1 6 for i in range 1 101 2 print i end In 3 problem1 6 In 4 另一方面像
如何在 Sublime Text 2 的 OSX 终端中显示构建结果

我刚刚从 TextMate 切换到 Sublime Text 2 我非常喜欢它让我困扰的一件事是默认的构建结果显示在 ST2 的底部我的程序产生一些很长的结果显示它的理想方式如在 TM2 中是并排查看它们如何在 Mac 操作系统
Spark的distinct()函数是否仅对每个分区中的不同元组进行洗牌

据我了解 distinct 哈希分区 RDD 来识别唯一键但它是否针对仅移动每个分区的不同元组进行了优化想象一个具有以下分区的 RDD 1 2 2 1 4 2 2 1 3 3 5 4 5 5 5 在此 RDD 上的不同键上所有重复键
__del__ 真的是析构函数吗？

我主要用 C 做事情其中析构函数方法实际上是为了销毁所获取的资源最近我开始使用python 这真的很有趣而且很棒我开始了解到它有像java一样的GC 因此没有过分强调对象所有权构造和销毁据我所知 init 方法对我来说在 py
如何使用装饰器禁用某些功能的中间件？

我想模仿的行为csrf exempt see here https docs djangoproject com en 1 11 ref csrf django views decorators csrf csrf exempt and h
从列表中的数据框列中搜索部分字符串匹配 - Pandas - Python

我有一个清单 things A1 B2 C3 我有一个 pandas 数据框其中有一列包含用分号分隔的值某些行将包含与上面列表中的一项的匹配它不会是完美的匹配因为它在其中包含字符串的其他部分该列例如该列中的一行可能有哇这里
在 NumPy 中获取 ndarray 的索引和值

我有一个 ndarrayA任意维数N 我想创建一个数组B元组数组或列表其中第一个N每个元组中的元素是索引最后一个元素是该索引的值A 例如 A array 1 2 3 4 5 6 Then B 0 0 1 0 1 2 0 2 3 1 0
feedparser 在脚本运行期间失败，但无法在交互式 python 控制台中重现

当我运行 eclipse 或在 iPython 中运行脚本时它失败了 ascii codec can t decode byte 0xe2 in position 32 ordinal not in range 128 我不知道为什么但
HTTPS 代理不适用于 Python 的 requests 模块

我对 Python 还很陌生我一直在使用他们的 requests 模块作为 PHP 的 cURL 库的替代品我的代码如下 import requests import json import os import urllib impor
如何将 numpy.matrix 提高到非整数幂？

The 运算符为numpy matrix不支持非整数幂 gt gt gt m matrix 1 0 0 5 0 5 gt gt gt m 2 5 TypeError exponent must be an integer 我想要的是 oct
Numpy 优化

我有一个根据条件分配值的函数我的数据集大小通常在 30 50k 范围内我不确定这是否是使用 numpy 的正确方法但是当数字超过 5k 时它会变得非常慢有没有更好的方法让它更快 import numpy as np N 5000
Nuitka 未使用 nuitka --recurse-all hello.py [错误] 编译 exe

我正在尝试通过 nuitka 创建一个简单的 exe 这样我就可以在我的笔记本电脑上运行它而无需安装 Python 我在 Windows 10 上并使用 Anaconda Python 3 我输入 nuitka recurse all h
为美国东部以外地区的 Cloudwatch 警报发送短信？

AWS 似乎没有为美国东部以外的 SNS 主题订阅者提供 SMS 作为协议我想连接我的 CloudWatch 警报并在发生故障时接收短信但无法将其发送到 SMS YES 经过一番挖掘后我能够让它发挥作用它比仅仅选择一个主题或输入闹钟
检查所有值是否作为字典中的键存在

我有一个值列表和一本字典我想确保列表中的每个值都作为字典中的键存在目前我正在使用两组来确定字典中是否存在任何值 unmapped set foo set bar keys 有没有更Pythonic的方法来测试这个感觉有点像黑客您的方
VSCode：调试配置中的 Python 路径无效

对 Python 和 VSCode 以及 stackoverflow 非常陌生直到最近我已经使用了大约 3 个月一切都很好当尝试在调试器中运行任何基本的 Python 程序时弹出窗口The Python path in your
对输入求 Keras 模型的导数返回全零

所以我有一个 Keras 模型我想将模型的梯度应用于其输入这就是我所做的 import tensorflow as tf from keras models import Sequential from keras layers imp
协方差矩阵的对角元素不是 1 pandas/numpy

我有以下数据框 A B 0 1 5 1 2 6 2 3 7 3 4 8 我想计算协方差 a df iloc 0 values b df iloc 1 values 使用 numpy 作为 cov numpy cov a b I get ar
Pandas 与 Numpy 数据帧

看这几行代码 df2 df copy df2 1 df 1 df 1 values 1 df2 ix 0 0 我们的教练说我们需要使用 values属性来访问底层的 numpy 数组否则我们的代码将无法工作我知道 pandas Data

随机推荐

DDR扫盲—-关于Prefetch(预取)与Burst(突发)的深入讨论

DDR扫盲关于Prefetch 预取与Burst 突发的深入讨论原文转自 xff1a DDR扫盲关于Prefetch与Burst的深入讨论 Felix 电子技术应用 AET 中国科技核心期刊最丰富的电子设计资源平台 chinaa
矢量与场论 | 哈密顿算子，哈密顿算子，散度点乘，旋度叉乘的计算过程以及以及定理

矢量与场论哈密顿算子三种重要的矢量场 xff08 有势场管形场调和场 xff09 有势场 xff1a 设有矢量场A M xff0c 若存在单值函数u M xff0c 满足 A 61 gradu xff0c 则称这个矢量场为有势场 x
multisum11查找TL431,在VOLTAGE-REFERENCE类中
铅酸电池三段式充电过程：恒流，恒压，涓流。锂电池四阶段：涓流，恒流，恒压，停止

1 第一阶段 xff0c 快充 bulk 以最大 xff08 100 xff09 的输出电流对电池快速充电 2 第二阶段 xff0c 均充 absorption xff0c 达到电池最大充电电压 xff0c 进行稳压 xff0c 此时电流会
链表的创建与基本操作(Python版)

链表的创建与基本操作 Python版 span class hljs comment usr bin python span span class hljs comment coding utf 8 span span class hljs
交流电路红的视在功率VA数值上是电压与电流的乘积等于有功功率的平方加上无功功率的平方，再开平方。无功功率将电感或电容元件与交流电源往复交换的功率。虽然无功元件整体做功是0.吸收和释放相等

一有功功率在交流电路中 xff0c 凡是消耗在电阻元件上功率不可逆转换的那部分功率 xff08 如转变为热能光能或机械能 xff09 称为有功功率 xff0c 简称有功 xff0c 用 P 表示 xff0c 单位是瓦 xff08
软件破解注册码

写在破解之前 xff1a xff1a xff1a 软件破解的目的是 xff1a 有些需要注册的软件 xff0c 可是找不到注册码 xff0c 将其破解之后 xff0c 输入任何注册码都会提示注册成功声明 xff1a 此贴适合从来没接触过软
keil中怎么添加自己的头文件,加入工程，保存路径。#include还用吗

keil中怎么添加自己的头文件 xff0c 例如 xff1a 添加 include lt led h gt 要把它写在哪里 xff0c 保存在哪里 xff0c 才能编译后 xff0c 显示 include lt reg51 h gt inc
面试题整理简历中深度学习机器学习相关的知识及linux操作系统命令

深度学习与机器学习都在整理关于后台的 xff0c 被问到后忘了 xff0c 尴尬的确是我的问题 xff0c 基本的机器学习知识还是要整理一波 o inception 网络 xff1a 主要应用了深度可分离卷积 xff1a 主要用了大尺度
面试可能遇到的问题野指针等解决方法

空指针 xff1a 一般声明一个指针变量赋值为NULL xff0c 这就是空指针 xff0c 各个类型的空指针都存在确确实实的内存地址 xff0c 但是不会指向任何有效的值的内存地址 xff0c 对空指针操作 xff0c 例如访问属性和方法
大规模分布式储存系统笔记(一)

分布式储存系统的特性 xff1a 1 可扩展性可按集群规模增长 xff0c 系统性能线性增长 xff1b 2 低成本系统自动容错 xff0c 自动负载均衡 xff0c 运维方便 3 高性能 4 易用性对外提供接口数据类型 xff1a
用MATLAB实现对运动物体识别与跟踪

不得不说MATLAB的图像处理函数有点多 xff0c 但速度有时也是出奇的慢还是想c的指针 xff0c 虽然有点危险 xff0c 但速度那是杠杠的第二个MATLAB程序 xff0c 对运动物体的识别与追踪这里我们主要运用帧差法实现运动
PS 开启GPU加速图片处理

还认为你的电脑的速度效果比不上苹果吗 xff1f 还在嫌电脑渲染速度慢吗 xff1f 试一下 xff0c 电脑开启GPU硬件加速吧 xff01 只要有独显轻松加速 xff08 毕竟苹果笔记本要配独显电脑的价格基本上在15000以上 xff0
管道鸟cortex-M4（TM4C1294）

看到满屏的贪吃蛇 xff0c 我也来开源一个Ti开发板 xff08 TM4C1294 xff09 的游戏将简化版的管道鸟 xff0c 根据自己玩的经历 xff0c 在cortexm4开发板上重新撸了一边 xff0c 设计思路 xff1a
C#连接MYSQL数据库并进行查询

之前用MFC开发结果界面太难看被pass了要求用C 重新来开发 gt lt 不过终于摆脱VC6 0的蛋疼操作了Y 先来连接数据库 xff08 1 xff09 用c 连接MYSQL数据库需要用到mysql connector net xff
binascii.Error: Incorrect padding 报错解决

输入的base64编码字符串必须符合base64的padding规则当原数据长度不是3的整数倍时如果最后剩下两个输入数据 xff0c 在编码结果后加1个 61 xff1b 如果最后剩下一个输入数据 xff0c 编码结果后加2个 61 x
通过过滤器链了解spring security + oauth2实现单点登录的过程

一系统注意部署在同一机器 xff08 localhost xff09 上的三个应用 xff0c 为了防止存放在cookie中的JSESSIONID不被覆盖 xff0c 需要设置不同的path xff0c 可以在配置文件中指定不同的上下文
jetson tx2开箱上电

期待已久的jetson tx2终于到了 xff0c 来做一个开箱 jetson tx2是英伟达的第三代GPU嵌入式开发板前两代分别是jetson tk1和jetson tx1 jetson tk1 xff1a 绿色的版板子接口丰富 jet
Jetson tx2刷机过程中的坑

暑假各种事忙得差不多后 xff0c 终于有时间拿出早就申请到的tx2 xff0c 开始刷机教程 xff0c 这两天几乎踩边了所有的坑第一个坑 xff0c 虚拟机一般在安装VMware虚拟机时 xff0c 建议的安装空间20GB xff0
python词云实现

python的一个蛮酷炫的功能是可以轻松地实现词云 github上有关于这个项目的开源代码 xff1a https github com amueller word cloud 注意跑例程时要删除里面的wordcloud文件夹词云的功能有

python词云实现

python词云实现 的相关文章

随机推荐

热门标签

python词云实现的相关文章