使用Python实现Hadoop MapReduce程序

2023-05-16

转自：使用Python实现Hadoop MapReduce程序

英文原文：Writing an Hadoop MapReduce Program in Python

根据上面两篇文章，下面是我在自己的ubuntu上的运行过程。文字基本采用博文使用Python实现Hadoop MapReduce程序，打字很浪费时间滴。

在这个实例中，我将会向大家介绍如何使用Python 为 Hadoop编写一个简单的MapReduce程序。

尽管Hadoop 框架是使用Java编写的但是我们仍然需要使用像C++、Python等语言来实现 Hadoop程序。尽管Hadoop官方网站给的示例程序是使用Jython编写并打包成Jar文件，这样显然造成了不便，其实，不一定非要这样来实现，我们可以使用Python与Hadoop 关联进行编程，看看位于/src/examples/python/WordCount.py 的例子，你将了解到我在说什么。

我们想要做什么？

我们将编写一个简单的 MapReduce 程序，使用的是C-Python，而不是Jython编写后打包成jar包的程序。
我们的这个例子将模仿 WordCount 并使用Python来实现，例子通过读取文本文件来统计出单词的出现次数。结果也以文本形式输出，每一行包含一个单词和单词出现的次数，两者中间使用制表符来想间隔。

先决条件

编写这个程序之前，你学要架设好Hadoop 集群，这样才能不会在后期工作抓瞎。如果你没有架设好，那么在后面有个简明教程来教你在Ubuntu Linux 上搭建（同样适用于其他发行版linux、unix）

如何在Ubuntu Linux 上搭建hadoop的单节点模式和伪分布模式，请参阅博文 Ubuntu上搭建Hadoop环境（单机模式+伪分布模式）

Python的MapReduce代码

使用Python编写MapReduce代码的技巧就在于我们使用了 HadoopStreaming 来帮助我们在Map 和 Reduce间传递数据通过STDIN (标准输入)和STDOUT (标准输出).我们仅仅使用Python的sys.stdin来输入数据，使用sys.stdout输出数据，这样做是因为HadoopStreaming会帮我们办好其他事。这是真的，别不相信！
Map: mapper.py

将下列的代码保存在/usr/local/hadoop/mapper.py中，他将从STDIN读取数据并将单词成行分隔开，生成一个列表映射单词与发生次数的关系：
注意：要确保这个脚本有足够权限（chmod +x mapper.py）。

[python] view plain copy

#!/usr/bin/env python
import sys
# input comes from STDIN (standard input)
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()
# split the line into words
words = line.split()
# increase counters
for word in words:
# write the results to STDOUT (standard output);
# what we output here will be the input for the
# Reduce step, i.e. the input for reducer.py
#
# tab-delimited; the trivial word count is 1
print '%s\t%s' % (word, 1)

在这个脚本中，并不计算出单词出现的总数，它将输出 "<word> 1" 迅速地，尽管<word>可能会在输入中出现多次，计算是留给后来的Reduce步骤（或叫做程序）来实现。当然你可以改变下编码风格，完全尊重你的习惯。Reduce: reducer.py

将代码存储在/usr/local/hadoop/reducer.py 中，这个脚本的作用是从mapper.py 的STDIN中读取结果，然后计算每个单词出现次数的总和，并输出结果到STDOUT。

同样，要注意脚本权限：chmod +x reducer.py

[python] view plain copy

#!/usr/bin/env python
from operator import itemgetter
import sys
current_word = None
current_count = 0
word = None
# input comes from STDIN
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()
# parse the input we got from mapper.py
word, count = line.split('\t', 1)
# convert count (currently a string) to int
try:
count = int(count)
except ValueError:
# count was not a number, so silently
# ignore/discard this line
continue
# this IF-switch only works because Hadoop sorts map output
# by key (here: word) before it is passed to the reducer
if current_word == word:
current_count += count
else:
if current_word:
# write result to STDOUT
print '%s\t%s' % (current_word, current_count)
current_count = count
current_word = word
# do not forget to output the last word if needed!
if current_word == word:
print '%s\t%s' % (current_word, current_count)

测试你的代码（cat data | map | sort | reduce）

我建议你在运行MapReduce job测试前尝试手工测试你的mapper.py 和 reducer.py脚本，以免得不到任何返回结果

这里有一些建议，关于如何测试你的Map和Reduce的功能：

[plain] view plain copy

hadoop@derekUbun:/usr/local/hadoop$ echo "foo foo quux labs foo bar quux" | ./mapper.py
foo 1
foo 1
quux 1
labs 1
foo 1
bar 1
quux 1
hadoop@derekUbun:/usr/local/hadoop$ echo "foo foo quux labs foo bar quux" |./mapper.py | sort |./reducer.py
bar 1
foo 3
labs 1
quux 2

# using one of the ebooks as example input
# (see below on where to get the ebooks)

[plain] view plain copy

hadoop@derekUbun:/usr/local/hadoop$ cat book/book.txt |./mapper.pysubscribe 1
to 1
our 1
email 1
newsletter 1
to 1
hear 1
about 1
new 1
eBooks. 1

在Hadoop平台上运行Python脚本

为了这个例子，我们将需要一本电子书，把它放在/usr/local/hadpoop/book/book.txt之下

[plain] view plain copy

hadoop@derekUbun:/usr/local/hadoop$ ls -l book
总用量 636
-rw-rw-r-- 1 derek derek 649669 3月 12 12:22 book.txt

复制本地数据到HDFS

在我们运行MapReduce job 前，我们需要将本地的文件复制到HDFS中：

[plain] view plain copy

hadoop@derekUbun:/usr/local/hadoop$ hadoop dfs -copyFromLocal /usr/local/hadoop/book book
hadoop@derekUbun:/usr/local/hadoop$ hadoop dfs -ls
Found 3 items
drwxr-xr-x - hadoop supergroup 0 2013-03-12 15:56 /user/hadoop/book

执行 MapReduce job现在，一切准备就绪，我们将在运行Python MapReduce job 在Hadoop集群上。像我上面所说的，我们使用的是HadoopStreaming 帮助我们传递数据在Map和Reduce间并通过STDIN和STDOUT，进行标准化输入输出。

[plain] view plain copy

hadoop@derekUbun:/usr/local/hadoop$hadoop jar contrib/streaming/hadoop-streaming-1.1.2.jar
-mapper /usr/local/hadoop/mapper.py
-reducer /usr/local/hadoop/reducer.py
-input book/*
-output book-output

在运行中，如果你想更改Hadoop的一些设置，如增加Reduce任务的数量，你可以使用“-jobconf”选项：

[plain] view plain copy

hadoop@derekUbun:/usr/local/hadoop$hadoop jar contrib/streaming/hadoop-streaming-1.1.2.jar
-jobconf mapred.reduce.tasks=4
-mapper /usr/local/hadoop/mapper.py
-reducer /usr/local/hadoop/reducer.py
-input book/*
-output book-output

如果上面两个运行出错，请参考下面一段代码。注意，重新运行，需要删除dfs中的output文件

[plain] view plain copy

bin/hadoop jar contrib/streaming/hadoop-streaming-1.1.2.jar
-mapper task1/mapper.py
-file task1/mapper.py
-reducer task1/reducer.py
-file task1/reducer.py
-input url
-output url-output
-jobconf mapred.reduce.tasks=3

一个重要的备忘是关于Hadoop does not honor mapred.map.tasks 这个任务将会读取HDFS目录下的book并处理他们，将结果存储在独立的结果文件中，并存储在HDFS目录下的book-output目录。之前执行的结果如下：

[plain] view plain copy

hadoop@derekUbun:/usr/local/hadoop$ hadoop jar contrib/streaming/hadoop-streaming-1.1.2.jar -jobconf mapred.reduce.tasks=4 -mapper /usr/local/hadoop/mapper.py -reducer /usr/local/hadoop/reducer.py -input book/* -output book-output
13/03/12 16:01:05 WARN streaming.StreamJob: -jobconf option is deprecated, please use -D instead.
packageJobJar: [/usr/local/hadoop/tmp/hadoop-unjar4835873410426602498/] [] /tmp/streamjob5047485520312501206.jar tmpDir=null
13/03/12 16:01:06 INFO util.NativeCodeLoader: Loaded the native-hadoop library
13/03/12 16:01:06 WARN snappy.LoadSnappy: Snappy native library not loaded
13/03/12 16:01:06 INFO mapred.FileInputFormat: Total input paths to process : 1
13/03/12 16:01:06 INFO streaming.StreamJob: getLocalDirs(): [/usr/local/hadoop/tmp/mapred/local]
13/03/12 16:01:06 INFO streaming.StreamJob: Running job: job_201303121448_0010
13/03/12 16:01:06 INFO streaming.StreamJob: To kill this job, run:
13/03/12 16:01:06 INFO streaming.StreamJob: /usr/local/hadoop/libexec/../bin/hadoop job -Dmapred.job.tracker=localhost:9001 -kill job_201303121448_0010
13/03/12 16:01:06 INFO streaming.StreamJob: Tracking URL: http://localhost:50030/jobdetails.jsp?jobid=job_201303121448_0010
13/03/12 16:01:07 INFO streaming.StreamJob: map 0% reduce 0%
13/03/12 16:01:10 INFO streaming.StreamJob: map 100% reduce 0%
13/03/12 16:01:17 INFO streaming.StreamJob: map 100% reduce 8%
13/03/12 16:01:18 INFO streaming.StreamJob: map 100% reduce 33%
13/03/12 16:01:19 INFO streaming.StreamJob: map 100% reduce 50%
13/03/12 16:01:26 INFO streaming.StreamJob: map 100% reduce 67%
13/03/12 16:01:27 INFO streaming.StreamJob: map 100% reduce 83%
13/03/12 16:01:28 INFO streaming.StreamJob: map 100% reduce 100%
13/03/12 16:01:29 INFO streaming.StreamJob: Job complete: job_201303121448_0010
13/03/12 16:01:29 INFO streaming.StreamJob: Output: book-output
hadoop@derekUbun:/usr/local/hadoop$

如你所见到的上面的输出结果，Hadoop 同时还提供了一个基本的WEB接口显示统计结果和信息。
当Hadoop集群在执行时，你可以使用浏览器访问 http://localhost:50030/ ：

检查结果是否输出并存储在HDFS目录下的book-output中：

[plain] view plain copy

hadoop@derekUbun:/usr/local/hadoop$ hadoop dfs -ls book-output
Found 6 items
-rw-r--r-- 2 hadoop supergroup 0 2013-03-12 16:01 /user/hadoop/book-output/_SUCCESS
drwxr-xr-x - hadoop supergroup 0 2013-03-12 16:01 /user/hadoop/book-output/_logs
-rw-r--r-- 2 hadoop supergroup 33 2013-03-12 16:01 /user/hadoop/book-output/part-00000
-rw-r--r-- 2 hadoop supergroup 60 2013-03-12 16:01 /user/hadoop/book-output/part-00001
-rw-r--r-- 2 hadoop supergroup 54 2013-03-12 16:01 /user/hadoop/book-output/part-00002
-rw-r--r-- 2 hadoop supergroup 47 2013-03-12 16:01 /user/hadoop/book-output/part-00003
hadoop@derekUbun:/usr/local/hadoop$

可以使用dfs -cat 命令检查文件目录

[plain] view plain copy

hadoop@derekUbun:/usr/local/hadoop$ hadoop dfs -cat book-output/part-00000
about 1
eBooks. 1
the 1
to 2
hadoop@derekUbun:/usr/local/hadoop$

下面是原英文作者mapper.py和reducer.py的两个修改版本:

mapper.py

[python] view plain copy

#!/usr/bin/env python
"""A more advanced Mapper, using Python iterators and generators."""
import sys
def read_input(file):
for line in file:
# split the line into words
yield line.split()
def main(separator='\t'):
# input comes from STDIN (standard input)
data = read_input(sys.stdin)
for words in data:
# write the results to STDOUT (standard output);
# what we output here will be the input for the
# Reduce step, i.e. the input for reducer.py
#
# tab-delimited; the trivial word count is 1
for word in words:
print '%s%s%d' % (word, separator, 1)
if __name__ == "__main__":
main()

reducer.py

[python] view plain copy

#!/usr/bin/env python
"""A more advanced Reducer, using Python iterators and generators."""
from itertools import groupby
from operator import itemgetter
import sys
def read_mapper_output(file, separator='\t'):
for line in file:
yield line.rstrip().split(separator, 1)
def main(separator='\t'):
# input comes from STDIN (standard input)
data = read_mapper_output(sys.stdin, separator=separator)
# groupby groups multiple word-count pairs by word,
# and creates an iterator that returns consecutive keys and their group:
# current_word - string containing a word (the key)
# group - iterator yielding all ["<current_word>", "<count>"] items
for current_word, group in groupby(data, itemgetter(0)):
try:
total_count = sum(int(count) for current_word, count in group)
print "%s%s%d" % (current_word, separator, total_count)
except ValueError:
# count was not a number, so silently discard this item
pass
if __name__ == "__main__":
main()

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)

使用Python实现Hadoop MapReduce程序的相关文章

Python BigQuery 存储。并行读取多个流

我有以下玩具代码 import pandas as pd from google cloud import bigquery storage v1beta1 import os import google auth os environ G
InterfaceError：连接已关闭（使用 django + celery + Scrapy）

当我在 Celery 任务中使用 Scrapy 解析函数有时可能需要 10 分钟时我得到了这个信息我用姜戈 1 6 5 django celery 3 1 16 芹菜 3 1 16 psycopg2 2 5 5 我也使用了psyc
如何使用固定的 pandas 数据框进行动态 matplotlib 绘图？

我有一个名为的数据框benchmark returns and strategy returns 两者具有相同的时间跨度我想找到一种方法以漂亮的动画风格绘制数据点以便它显示逐渐加载的所有点我知道有一个matplotlib animat
如何收集列表、字典等中重复计算的结果（或制作修改每个元素的列表的副本）？

There are a great many existing Q A on Stack Overflow on this general theme but they are all either poor quality typical
如何在android上的python kivy中关闭应用程序后使服务继续工作

我希望我的服务在关闭应用程序后继续工作但我做不到我听说我应该使用startForeground 但如何在Python中做到这一点呢应用程序代码 from kivy app import App from kivy uix floatl
Python 多处理示例不起作用

我正在尝试学习如何使用multiprocessing但我无法让它发挥作用这是代码文档 http docs python org 2 library multiprocessing html from multiprocessing imp
SQL Alchemy 中的 NULL 安全不等式比较？

目前我知道如何表达 NULL 安全的唯一方法 SQL Alchemy 中的比较其中与 NULL 条目的比较计算结果为 True 而不是 NULL 是 or field None field value 有没有办法在 SQL Alchem
打破嵌套循环[重复]

这个问题在这里已经有答案了有没有比抛出异常更简单的方法来打破嵌套循环在Perl https en wikipedia org wiki Perl 您可以为每个循环指定标签并且至少继续一个外循环 for x in range 10 fo
在 NumPy 中获取 ndarray 的索引和值

我有一个 ndarrayA任意维数N 我想创建一个数组B元组数组或列表其中第一个N每个元组中的元素是索引最后一个元素是该索引的值A 例如 A array 1 2 3 4 5 6 Then B 0 0 1 0 1 2 0 2 3 1 0
使用 Pycharm 在 Windows 下启动应用程序时出现 UnicodeDecodeError

问题是当我尝试启动应用程序 app py 时我收到以下错误 UnicodeDecodeError utf 8 编解码器无法解码位置 5 中的字节 0xb3 起始字节无效整个文件app py coding utf 8 from flask
Python 中的二进制缓冲区

在Python中你可以使用StringIO https docs python org library struct html用于字符数据的类似文件的缓冲区内存映射文件 https docs python org library mmap
使用 OpenPyXL 迭代工作表和单元格，并使用包含的字符串更新单元格[重复]

这个问题在这里已经有答案了我想使用 OpenPyXL 来搜索工作簿但我遇到了一些问题希望有人可以帮助解决以下是一些障碍待办事项我的工作表和单元格数量未知我想搜索工作簿并将工作表名称放入数组中我想循环遍历每个数组项并搜索包含特
Python：尝试检查有效的电话号码

我正在尝试编写一个接受以下格式的电话号码的程序XXX XXX XXXX并将条目中的任何字母翻译为其相应的数字现在我有了这个如果启动不正确它将允许您重新输入正确的数字然后它会翻译输入的原始数字我该如何解决 def main phon
循环中断打破tqdm

下面的简单代码使用tqdm https github com tqdm tqdm在循环迭代时显示进度条 import tqdm for f in tqdm tqdm range 100000000 if f gt 100000000 4 b
Python 3 中“map”类型的对象没有 len()

我在使用 Python 3 时遇到问题我得到了 Python 2 7 代码目前我正在尝试更新它我收到错误类型错误 map 类型的对象没有 len 在这部分 str len seed candidates 在我像这样初始化它之前 se
如何从没有结尾的管道中读取 python 中的 stdin

当管道来自打开时不知道正确的名称我无法从 python 中的标准输入或管道读取数据文件我有作为例子管道测试 py import sys import time k 0 try for line in sys stdin k k
如何使用google colab在jupyter笔记本中显示GIF？

我正在使用 google colab 想嵌入一个 gif 有谁知道如何做到这一点我正在使用下面的代码它并没有在笔记本中为 gif 制作动画我希望笔记本是交互式的这样人们就可以看到代码的动画效果而无需运行它我发现很多方法在 Goo
Spark.read 在 Databricks 中给出 KrbException

我正在尝试从 databricks 笔记本连接到 SQL 数据库以下是我的代码 jdbcDF spark read format com microsoft sqlserver jdbc spark option url jdbc sql
Python：元类属性有时会覆盖类属性？

下面代码的结果让我感到困惑 class MyClass type property def a self return 1 class MyObject object metaclass MyClass a 2 print MyObject
PyAudio ErrNo 输入溢出 -9981

我遇到了与用户相同的错误 Python 使用 Pyaudio 以 16000Hz 录制音频时出错 https stackoverflow com questions 12994981 python error audio recording

随机推荐

理解神经网络：从神经元到RNN、CNN、深度学习

本文为 AI 研习社编译的技术博客 xff0c 原标题 xff1a Understanding Neural Networks From neuron to RNN CNN and Deep Learning 作者 vibhor nigam
debian 系统版本划分、识别、演进的释疑（升级系统须知）

2019独角兽企业重金招聘Python工程师标准 gt gt gt debian 系统版本划分识别演进的释疑 xff08 升级系统须知 xff09 http my oschina net emptytimespace blog 84
vnc远程不能登录，总是提示认证错误解决

vnc无法登陆 xff0c 总是提示验证错误 34 An authentication error occurred See the server error log for details 34 then the server will
JavaScript 二进制转文件

关于在javascript下 xff0c 如何将二进制转换成相应的文件并下载首先 xff0c 我们需要得到二进制的数据以及相应的文件格式 xff0c 没有相应的格式也可以 xff0c 可以通过二进制来判断 xff0c 但相对会麻烦很多 x
子网数、主机数与子网掩码的关系

直接拿实际的例子说吧 xff0c 这样容易理解 1 利用子网数目计算子网掩码把B类地址172 16 0 0划分成30个子网络 xff0c 它的子网掩码是多少 xff1f 将子网络数目30转换成二进制表示11110 统计一下这个二进制的数共
人脸识别“SphereFace: Deep Hypersphere Embedding for Face Recognition”

在开放集中进行人脸识别 xff0c 理想的特征最大的类内差距应小于最小的类间差距作者提出了angular softmax xff08 A Softmax xff09 损失函数学习angularly discriminative featu
私有云拥有哪些好处？

更高的安全性和隐私虽然公共云服务提供了一定程度的安全性 xff0c 但是私有云是一个更安全的选择这是通过使用不同的资源池实现的 xff0c 这些资源池的访问仅限于防火墙专用租用线路和组织的现场内部托管更多的控制由于私有云只能由一个
透视学如何成像

2019独角兽企业重金招聘Python工程师标准 gt gt gt 透视学如何成像 xff1f 这其中是有规律可循的所谓当局者迷 xff0c 旁观者清我们自身无法去证实或者判断透视现象的规律 xff0c 因为我们的视觉已经适应这种变化
win10 64位JLink v8固件丢失修复总结

大早晨的调着调着程序 xff0c 视线没离开一会 xff0c 就发现jlink自动断开连接了 xff0c 然后重新拔插jlink 重启都不行 xff0c 才发现小灯已经不亮了 xff0c 原来是固件损坏了 xff0c 果断想办法修复这位大爷
STP/RSTP/MSTP的分析与对比

一 xff0e 生成树相关的几个概念STP RSTP MSTP STP xff1a IEEE Std 802 1D 1998定义 xff0c 不能快速迁移即使是在点对点链路或边缘端口 xff0c 也必须等待2倍的forward delay
运维工程师的职责和前景

运维工程师的职责和前景运维中关键技术点解剖 xff1a 1 大量高并发网站的设计方案 xff1b 2 高可靠高可伸缩性网络架构设计 xff1b 3 网站安全问题 xff0c 如何避免被黑 xff1f 4 南北互联问题动态CDN解决方案
Snipaste强大离线/在线截屏软件的下载、安装和使用

步骤一 https zh snipaste com xff0c 去此官网下载步骤二 xff1a 由于此是个绿色软件 xff0c 直接解压即可步骤三使用 xff0c 见官网 ttps zh snipaste com 按F1开始截屏感谢
SQL分页查询总结{转}

开发过程中经常遇到分页的需求 xff0c 今天在此总结一下吧简单说来方法有两种 xff0c 一种在源上控制 xff0c 一种在端上控制源上控制把分页逻辑放在SQL层 xff1b 端上控制一次性获取所有数据 xff0c 把分页逻辑放在UI
Hadoop MapReduce 处理2表join编程案例

2019独角兽企业重金招聘Python工程师标准 gt gt gt 假设文件1 表1 结构 hdfs文件名 t user txt 1 wangming 男计算机 2 hanmei 男机械 3 lilei 女法学 4 hanmeixiu
传统数据库“上云”之路

2018 云栖大会南京峰会飞天技术汇专场上 xff0c 阿里云高级产品专家萧少聪从准备迁移效率和迁移后效果三个方面分享了传统数据库迁移到阿里云数据库及后续使用情况的全链路解决方案 xff0c 针对主流数据库迁移到阿里云数据库的技术及实
Hadoop学习--URL方法访问HDFS数据--day04

import java io ByteArrayOutputStream import java io InputStream import java net URL import org apache hadoop fs FsUrlStr
解决vuepress报Error: Cannot find module ‘core-js/library/fn/object/assign问题（core-js版本与引入UI冲突问题）

问题如图原因 core js版本原因解决方案第一种 xff0c 在config文件 xff08 路径docs vuepress config js xff09 中加上以下代码 span class token function cha
SD-WAN与SDN：差异在于细节

SD WAN和SDN xff1a 在很多方面类似 xff0c 从 SD 开始 SD WAN和SDN都有共同的遗产 xff0c 从控制平面和数据平面的分离开始两者都设计为在商用x86硬件上运行 xff0c 两者都可以虚拟化 xff0c 并且
Linux命令模拟Http的get或post请求

Http请求指的是客户端向服务器的请求消息 xff0c Http请求主要分为get或post两种 xff0c 在Linux系统下可以用curl和wget命令来模拟Http的请求 get请求 xff1a 1 使用curl命令 xff1a cu
使用Python实现Hadoop MapReduce程序

转自 xff1a 使用Python实现Hadoop MapReduce程序英文原文 xff1a Writing an Hadoop MapReduce Program in Python 根据上面两篇文章 xff0c 下面是我在自己的ub

使用Python实现Hadoop MapReduce程序

使用Python实现Hadoop MapReduce程序 的相关文章

随机推荐

热门标签

使用Python实现Hadoop MapReduce程序的相关文章