TFIDF算法Hadoop实现

2023-11-13

程序说明：利用MapReduce计算框架，计算一组英文文档中各个单词的TFIDF。某单词在某文档的TFIDF=该单词中该文档的TF×该单词IDF。其中，
- TF(i,j)：单词i在文档j中出现的频率（Term Frequency）。TF(i,j)=N(i,j)/N(j)，N(i,j)是单词i中文档j中出现的次数，N(j)是文档j的单词总数。
- IDF(i)：单词i的逆文件频率（Inverse Document Frequency）。IDF(i)=LOG((M+1)/M(i))，M是文件总数，M(i)是包含单词i的文件数。M+1，为了避免M与M(i)相等，导致对数为零的情况。

程序结构：程序由3个Job，按照chain mapreduce进行计算。其中，
- Job1：包括Tfmapper、Tfreducer、Tfgroup，计算单词在文档的TF。
- Job2：包括Idfmapper、Idfreducer、Idfsort，计算单词的IDF。
- Job3：包括Tfidf_tfmapper、Tfidf_idfmapper、Tfidfreducer、Tfidfsort、Tfidfgroup，计算单词的TFIDF。

代码参考：
- http://blog.csdn.net/jackydai987/article/details/6303459

- blog.csdn.net/ididcan/article/details/6657977

TFIDF源代码内容：

1）编写Docword类，把相关单词、对应文件名、相关指标（如IDF、TFIDF等等）作为属性。

<span style="font-size:18px;">package tfidf;

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;

import org.apache.hadoop.io.WritableComparable;

public class Docword implements WritableComparable<Docword> {
    String docname;
    String word;
    double index;
    static final String DELIMITER=",";
    public Docword() {
        super();
        // TODO Auto-generated constructor stub
    }

    public String getDocname() {
        return docname;
    }

    public void setDocname(String docname) {
        this.docname = docname;
    }

    public String getWord() {
        return word;
    }

    public void setWord(String word) {
        this.word = word;
    }

    public double getIndex() {
        return index;
    }

    public void setIndex(double index) {
        this.index = index;
    }

    @Override
    public void write(DataOutput out) throws IOException {
        // TODO Auto-generated method stub
        out.writeUTF(docname);
        out.writeUTF(word);
        out.writeDouble(index);
    }

    @Override
    public String toString() {
        return docname+DELIMITER+word;
    }

    @Override
    public void readFields(DataInput in) throws IOException {
        // TODO Auto-generated method stub
        docname=in.readUTF();
        word=in.readUTF();
        index=in.readDouble();
    }

    @Override
    public int compareTo(Docword arg0) {
        // TODO Auto-generated method stub
        int result=0;
        result=(result!=0)?result:docname.compareTo(arg0.getDocname());
        result=(result!=0)?result:word.compareTo(arg0.getWord());
        return result;
    }

}</span>

2）TFIDF代码。已有注释，具体如下。

<span style="font-size:18px;">public class Tfidf extends Configured implements Tool {

    static final String DELIMITER=",";

    /* =======================Job1：Tfidf-tf=======================
     * 程序说明：计算文档中各单词TF。程序主要思路借鉴了MapReduce Tutorial的WordCount2.0代码，
     * - 利用Local Resource，将待删除的字符串（如逗号、句号等）读入内存，对split中每个record删除特定字符串。
     * - 设置”全局“变量（Map类的属性），统计文档的单词总数。注意，文档可能由于超过split大小，需要被分为多个split处理。所以Map中”全局“变量仅统计该split中各文档单词数，实际数需要中reduce中按照filename进行合并。
     *
     * 程序输入：HDFS某目录下所有Text英文文档。
     * 程序输出：各文档中单词的TF，按照“filename, wordname, termfreq”格式输出。
     * （Shuffle）排序规则：1）filename。2）wordname。
     * （Shuffle）合并规则：1）filename。2）wordname。
     */

    public static class Tfmapper extends
            Mapper<LongWritable, Text, Docword, IntWritable> {
        Docword outputKey=new Docword();
        IntWritable outputValue=new IntWritable(1);//拆分每个单词，按频次为1，在reduce中分文件名进行合并
        HashSet<String> skipstr=new HashSet<String>(); //存储待删除的字符串
        int termnumindoc=0;//每个split中各文档中单词总数。

        @Override
        protected void cleanup(Context context) throws IOException,
                InterruptedException {
            // TODO Auto-generated method stub
            FileSplit filesplit=(FileSplit)context.getInputSplit();
            String filename=filesplit.getPath().getName();
            outputKey.setDocname(filename);
            outputKey.setWord("!termnum");//!termnum标识该记录是文件总单词数，由于map中已经对记录删除”！“，所以不会重复。
            context.write(outputKey, new IntWritable(termnumindoc));//按照”filename,!termnum,termnumindoc“格式输出。由于!的ASCII码中所有字母之前，按照Docword.compareTo()，经过shuffle后，!termnum记录会出现中该filename所有记录的第一位。
        }
        @Override
        protected void map(LongWritable key, Text value, Context context)
                throws IOException, InterruptedException {
            // TODO Auto-generated method stub
            String str=value.toString().toLowerCase();
            for(String skip:skipstr){
                str=str.replaceAll(skip, "");//对每个redcord，删除特定字符串
            }
            String []words=StringUtils.split(str, ' ');
            FileSplit filesplit=(FileSplit)context.getInputSplit();//获取InputSplit。每个InputSplit只属于一个InputFile，每个InputFile至少有一个InputSplit。
            String filename=filesplit.getPath().getName();//利用FileSplit，提取输入文件名。

            for(String word:words){
                if(word.trim().isEmpty())//删除空行和空字符串。也可以放在待删除字符串，用正则表达式实现。
                    continue;
                if(word.charAt(0)<'a' || word.charAt(0)>'z')//删除非字母开头的单词。也可以放在待删除字符串，用正则表达式实现。
                    continue;
                termnumindoc++;//文档的单词数加1.
                outputKey.setDocname(filename);//文件名
                outputKey.setWord(word);//单词名
                context.write(outputKey, outputValue);//按照”filename,wordname,1“格式输出到本地文件系统，进行shuffle
            }
        }
        @Override
        protected void setup(Context context) throws IOException,
                InterruptedException {
            // TODO Auto-generated method stub
            String line="";
            BufferedReader fs=new BufferedReader(new FileReader("skipstr"));//读入待剔除的字符串
            while((line=fs.readLine())!=null){
                skipstr.add(line);//字符串逐个加入内存中的HashSet
            }
        }

    }
    public static class Tfcombiner extends
            Reducer<Docword, IntWritable, Docword, IntWritable> {
        IntWritable outputValue=new IntWritable();
        @Override
        protected void reduce(Docword key, Iterable<IntWritable> values,
                Context context)
                throws IOException, InterruptedException {
            // TODO Auto-generated method stub
            int sumall=0;
            for(IntWritable value:values){
                sumall+=value.get();
            }
            outputValue.set(sumall);
            context.write(key, outputValue);
        }

    }
    public static class Tfreducer extends
            Reducer<Docword, IntWritable, Docword, DoubleWritable> {
        DoubleWritable outputValue=new DoubleWritable();
        int termnumindoc=0;//各文档中单词总数。由于文档可能超过split尺寸大小，被拆分在多个split被多个Map统计单词数。在Reduce中对各个Map统计的单词总数进行汇总。
        @Override
        protected void reduce(Docword key, Iterable<IntWritable> values,
                Context context)
                throws IOException, InterruptedException {
            // TODO Auto-generated method stub
            int sumall=0;
            for(IntWritable value:values){
                sumall+=value.get();
            }
            if(key.getWord().equals("!termnum")){//单词名是!termnum，标记该记录是文件的单词总数。
                termnumindoc=sumall;
            }
            else{//单词名不是!termnum，标记该记录是文件实际单词。
                // 由于!的ASCII码中所有字母之前，按照Docword.compareTo()，经过shuffle后，!termnum记录会出现中该filename所有记录的第一位。
                // 计算TF时，分母”文件中单词总数“termnumindoc已有数据。
                outputValue.set((double)1*sumall/termnumindoc);
                context.write(key, outputValue);//按照”filename,wordname,termfreq“输出到HDFS，作为Job2的输入
            }
        }

    }

    public static class Tfpartition extends Partitioner<Docword, IntWritable> {

        @Override
        public int getPartition(Docword key, IntWritable value,
                int numPartitions) {
            // TODO Auto-generated method stub
            return Math.abs((key.getDocname()).hashCode())%numPartitions;
        }

    }

    /* =======================Job2：Tfidf-idf=======================
     * 程序说明：计算各单词idf。程序读入Job1的输出数据，按照"wordname,filename"格式，对各个wordname进行合并，计算word的IDF。
     * - 在run()中，利用HDFS API获取文件总数，并传递给Reduce。
     * - 由于Job1的输出数据按照filename进行排序，顺序读入记录并比较前后两个记录的filename有没有改变。如果没有改变，则文件总数不变；如果有改变，则文件总数加1。
     *
     * 程序输入：Job1的输出文件，格式为“filename, wordname, termfreq”
     * 程序输出：单词的Idf，格式为“wordname, idf”。
     * （Shuffle）排序规则：wordname。
     * （Shuffle）合并规则：wordname。
     */

    public static class Idfmapper extends
            Mapper<LongWritable, Text, Docword, IntWritable> {
        Docword outputKey=new Docword();
        IntWritable outputValue=new IntWritable(1);//拆分单词，按频次为1，在reduce中分文件名进行合并
        int filenum=0;
        String filename="";//文件名。

        @Override
        protected void map(LongWritable key, Text value,
                Context context)
                throws IOException, InterruptedException {
            // TODO Auto-generated method stub
            String []words=StringUtils.split(value.toString(), ',');
            outputKey.setDocname(words[0]);
            outputKey.setWord(words[1]);
            context.write(outputKey, outputValue);
        }

    }
    public static class Idfcombiner extends
            Reducer<Docword, IntWritable, Docword, IntWritable> {
        IntWritable outputValue=new IntWritable();

        @Override
        protected void reduce(Docword key, Iterable<IntWritable> values,
                Context context)
                throws IOException, InterruptedException {
            // TODO Auto-generated method stub
            int sumall=0;
            for(IntWritable value:values){
                sumall+=value.get();
            }
            outputValue.set(sumall);
            context.write(key, outputValue);
        }

    }
    public static class Idfreducer extends
                Reducer<Docword, IntWritable, Text, DoubleWritable> {
            DoubleWritable outputValue=new DoubleWritable();
            Text outputKey=new Text();
            int alldoc=0;//文件总数。

            @Override
            protected void setup(Context context) throws IOException,
                    InterruptedException {
                // TODO Auto-generated method stub
                //将run()中传递的变量读入内存，获得文件总数。
                alldoc=Integer.parseInt(context.getConfiguration().get("filesnum"));
            }

            @Override
            protected void reduce(Docword key, Iterable<IntWritable> values,
                    Context context)
                    throws IOException, InterruptedException {
                // TODO Auto-generated method stub
                //由于Job1的输出数据按照filename进行排序，顺序读入记录并比较前后两个记录的filename有没有改变。
                // 如果没有改变，则文件总数不变；如果有改变，则文件总数加1。
                int termdocnum=0;
                for(IntWritable value:values){
                        termdocnum+=value.get();//单词对应文件数加1。
                }
                outputKey.set(key.getWord());
                outputValue.set((double)Math.log((double)(alldoc+1)/termdocnum));
                context.write(outputKey, outputValue);//输出idf计算结果，输出格式“wordname,idf”
            }
        }
    public static class Idfpartition extends Partitioner<Docword, IntWritable> {

        @Override
        public int getPartition(Docword key, IntWritable value,
                int numPartitions) {
            // TODO Auto-generated method stub
            return Math.abs((key.getWord().hashCode()))%numPartitions;
        }

    }
    public static class Idfsort extends WritableComparator {
        //在shuffle中，所有记录按照wordname进行排序，按照wordname进行合并。
        @Override
        public int compare(WritableComparable a, WritableComparable b) {
            // TODO Auto-generated method stub
            Docword lhs=(Docword)a;
            Docword rhs=(Docword)b;
            return lhs.getWord().compareTo(rhs.getWord());
        }

        public Idfsort() {
            super(Docword.class,true);
            // TODO Auto-generated constructor stub
        }
    }

    /* =======================Job3：Tfidf-tfidf=======================
     * 程序说明：计算各单词tfidf。程序利用MultipleInputs分别读入Job1、Job2的输出数据，类似ReduceSide Join在Reduce中进行汇总计算。
     * - MultipleInputs：配置Map，分别读入Job1、Job2的输出数据。其中，读入job2的输入数据后，设置filename为“!alldoc”。由于!的ASCII码小于所有字母，所以同一个word的job2记录在shuffle中排在job1记录前面。
     * - Reduce：设置sortComparator，按照wordname（高优先级）、filename（低优先级）进行排序；设置groupComparator，按照wordname进行合并记录。
     *
     * 程序输入：Job1的输出文件，格式为“filename, wordname, termfreq”；Job2的输出文件，格式为“wordname,idf”。
     * 程序输出：单词的tfIdf，格式为“filename,wordname, idf”。
     * （Shuffle）排序规则：1）wordname。2）filename。
     * （Shuffle）合并规则：wordname。
     */
    public static class Tfidf_tfmapper extends
            Mapper<LongWritable, Text, Docword, Docword> {
        Docword outputKey=new Docword();
        Docword outputValue=new Docword();
        @Override
        protected void map(LongWritable key, Text value,
                Context context)
                throws IOException, InterruptedException {
            // TODO Auto-generated method stub
            String []words=StringUtils.split(value.toString(), ',');
            outputKey.setWord(words[1]);
            outputKey.setDocname(words[0]);
            outputValue.setDocname(words[0]);
            outputValue.setWord(words[1]);
            outputValue.setIndex(Double.parseDouble(words[2]));
            context.write(outputKey, outputValue);//读入Job1的输出文件，格式为“filename, wordname, termfreq”
        }
    }

    public static class Tfidf_idfmapper extends
            Mapper<LongWritable, Text, Docword, Docword> {
        Docword outputValue=new Docword();
        Docword outputKey=new Docword();
        @Override
        protected void map(LongWritable key, Text value,
                Context context)
                throws IOException, InterruptedException {
            // TODO Auto-generated method stub
            String []words=StringUtils.split(value.toString(), ',');
            outputValue.setDocname("!alldoc");
            outputValue.setWord(words[0]);
            outputValue.setIndex(Double.parseDouble(words[1]));
            outputKey.setWord(words[0]);
            outputKey.setDocname("!alldoc");
            context.write(outputKey, outputValue);//读入Job2的输出文件，格式为“wordname,idf”。
        }
    }

    public static class Tfidfreducer extends
            Reducer<Docword, Docword, Text, DoubleWritable> {
        Text outputKey=new Text();
        DoubleWritable outputValue=new DoubleWritable();
        @Override
        protected void reduce(Docword key, Iterable<Docword> values,
                Context context)
                throws IOException, InterruptedException {
            // TODO Auto-generated method stub
            double termidf=0.0,termfq=0.0;
            for(Docword value:values){
                //读入job2的输入数据后，设置filename为“!alldoc”。由于!的ASCII码小于所有字母，所以同一个word的job2记录在shuffle中排在job1记录前面。
                if(value.getDocname().equals("!alldoc")){
                    termidf=value.getIndex();
                }else{
                    termfq=value.getIndex();
                    outputKey.set(value.getDocname()+","+value.getWord());
                    outputValue.set(termidf*termfq);
                    context.write(outputKey, outputValue);
                }
            }
        }
    }
    public static class Tfidfsort extends WritableComparator {

        @Override
        public int compare(WritableComparable a, WritableComparable b) {
            // TODO Auto-generated method stub
            Docword lhs=(Docword)a;
            Docword rhs=(Docword)b;
            int result=0;
            result=(result!=0)?result:lhs.getWord().compareTo(rhs.getWord());
            result=(result!=0)?result:lhs.getDocname().compareTo(rhs.getDocname());
            return result;//按照wordname、filename进行排序。
        }

        public Tfidfsort() {
            super(Docword.class,true);
            // TODO Auto-generated constructor stub
        }

    }
    public static class Tfidfpartition extends Partitioner<Docword, Docword> {

        @Override
        public int getPartition(Docword key, Docword value,
                int numPartitions) {
            // TODO Auto-generated method stub
            return Math.abs((key.getWord().hashCode()))%numPartitions;
        }

    }
    public static class Tfidfgroup extends WritableComparator {

        @Override
        public int compare(WritableComparable a, WritableComparable b) {
            // TODO Auto-generated method stub
            Docword lhs=(Docword)a;
            Docword rhs=(Docword)b;
            return lhs.getWord().compareTo(rhs.getWord());//按照wordname进行合并。
        }

        public Tfidfgroup() {
            super(Docword.class,true);
            // TODO Auto-generated constructor stub
        }

    }
    public int run(String []args) throws Exception{
        Path in1=new Path("data/wordcount");//输入文件路径
        Path out1=new Path("output/tfidf-tf");//Job1的tf结果输出路径
        Path out2=new Path("output/tfidf-idf");//Job2的idf结果输出路径
        Path out3=new Path("output/tfidf-tfidf");//Job3的tfidf结果输出路径
        URI skipstr=new URI("data/skipstr");//Job1的待删除字符串路径

        //============Job1配置============
        Job job1=Job.getInstance(getConf(), "tfidf-tf");
        Configuration conf1=job1.getConfiguration();
        job1.setJarByClass(getClass());

        FileInputFormat.setInputPaths(job1, in1);

        out1.getFileSystem(conf1).delete(out1, true);
        FileOutputFormat.setOutputPath(job1, out1);
        conf1.set(TextOutputFormat.SEPERATOR, DELIMITER);

        job1.setInputFormatClass(TextInputFormat.class);
        job1.setOutputFormatClass(TextOutputFormat.class);

        job1.setMapperClass(Tfmapper.class);
        job1.setMapOutputKeyClass(Docword.class);
        job1.setMapOutputValueClass(IntWritable.class);
        job1.setCombinerClass(Tfcombiner.class);
        job1.setReducerClass(Tfreducer.class);
        job1.setOutputKeyClass(Docword.class);
        job1.setOutputValueClass(DoubleWritable.class);
        job1.setPartitionerClass(Tfpartition.class);
        job1.addCacheFile(skipstr);
        job1.setNumReduceTasks(3);

        if(job1.waitForCompletion(true)==false)
            return 1;

        //============Job2配置============
        Job job2=Job.getInstance(getConf(), "tfidf-idf");
        Configuration conf2=job2.getConfiguration();
        job2.setJarByClass(getClass());

        FileInputFormat.setInputPaths(job2, out1);
        out2.getFileSystem(conf2).delete(out2, true);

        //利用HDFS API接口，获得输入文件总数，并通过变量filesum传递给Job2。
        FileSystem hdfs=FileSystem.get(conf2);
        FileStatus p[]=hdfs.listStatus(in1);
        conf2.set("filesnum", Integer.toString(p.length));

        FileOutputFormat.setOutputPath(job2, out2);
        conf2.set(TextOutputFormat.SEPERATOR, DELIMITER);
        job2.setInputFormatClass(TextInputFormat.class);
        job2.setOutputFormatClass(TextOutputFormat.class);

        job2.setSortComparatorClass(Idfsort.class);
        job2.setGroupingComparatorClass(Idfsort.class);

        job2.setMapperClass(Idfmapper.class);
        job2.setMapOutputKeyClass(Docword.class);
        job2.setMapOutputValueClass(IntWritable.class);
        job2.setCombinerClass(Idfcombiner.class);
        job2.setReducerClass(Idfreducer.class);
        job2.setOutputKeyClass(Text.class);
        job2.setOutputValueClass(DoubleWritable.class);
        job2.setNumReduceTasks(3);
        job2.setPartitionerClass(Idfpartition.class);

        if(job2.waitForCompletion(true)==false)
            return 1;

        //============Job3配置============
        Job job3=Job.getInstance(getConf(), "tfidf-tfidf");
        Configuration conf3=job3.getConfiguration();
        job3.setJarByClass(getClass());

        out3.getFileSystem(conf3).delete(out3, true);
        FileOutputFormat.setOutputPath(job3, out3);
        conf3.set(TextOutputFormat.SEPERATOR, DELIMITER);
        job3.setOutputFormatClass(TextOutputFormat.class);

        //利用MultipleInputs，配置Map分别读取Job1、Job2的输出文件
        MultipleInputs.addInputPath(job3, out1, TextInputFormat.class, Tfidf_tfmapper.class);
        MultipleInputs.addInputPath(job3, out2, TextInputFormat.class, Tfidf_idfmapper.class);

        job3.setMapOutputKeyClass(Docword.class);
        job3.setMapOutputValueClass(Docword.class);

        job3.setReducerClass(Tfidfreducer.class);
        job3.setOutputKeyClass(Text.class);
        job3.setOutputValueClass(DoubleWritable.class);
        job3.setNumReduceTasks(3);
        job3.setSortComparatorClass(Tfidfsort.class);
        job3.setGroupingComparatorClass(Tfidfgroup.class);
        job3.setPartitionerClass(Tfidfpartition.class);
        return job3.waitForCompletion(true)?0:1;

    }
    public static void main(String []args) throws Exception{
        int result=0;
        try{
            result=ToolRunner.run(new Configuration(), new Tfidf(), args);
        }catch(Exception e){
            e.printStackTrace();
        }
        System.exit(result);
    }

}</span>

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)

TFIDF算法Hadoop实现的相关文章

名称节点处于安全模式

我提到了这些问题名称节点处于安全模式无法离开 https stackoverflow com questions 15803266 name node is in safe mode not able to leave and SafeM
无法在 Hadoop Map-Reduce 作业中加载 OpenNLP 句子模型

我正在尝试将 OpenNLP 集成到 Hadoop 上的 Map Reduce 作业中从一些基本的句子分割开始在地图函数中运行以下代码 public AnalysisFile analyze String content InputS
Spark 和 Python 使用自定义文件格式/生成器作为 RDD 的输入

我想问一下 Spark 中输入的可能性我可以看到从http spark apache org docs latest programming guide html http spark apache org docs latest pro
从 HDFS 传出文件

我想将文件从 HDFS 传输到另一台服务器的本地文件系统该服务器不在 hadoop 集群中而是在网络中我本可以这样做 hadoop fs copyToLocal
如何找到 JAR：/home/hadoop/contrib/streaming/hadoop-streaming.jar

我正在练习有关 Amazon EMR 的复数视角视频教程我被困住了因为我收到此错误而无法继续 Not a valid JAR home hadoop contrib streaming hadoop streaming jar 请注意
如何直接将一个mapper-reducer的输出发送到另一个mapper-reducer而不将输出保存到hdfs中

问题最终解决检查底部的我的解决方案最近我尝试运行 Mahout in Action 的第 6 章列表 6 1 6 4 中的推荐示例但我遇到了一个问题我用谷歌搜索但找不到解决方案问题是我有一对映射器减速器 public fina
是否可以使用 Java 读写 Parquet，而不依赖 Hadoop 和 HDFS？

我一直在寻找这个问题的解决方案在我看来如果不引入对 HDFS 和 Hadoop 的依赖就无法在 Java 程序中嵌入读写 Parquet 格式它是否正确我想在 Hadoop 集群之外的客户端计算机上进行读写我开始对 Apache
更改 Hadoop 中的数据节点数量

如何改变数据节点的数量即禁用和启用某些数据节点来测试可扩展性说得更清楚一点我有4个数据节点我想一一实验1 2 3 4个数据节点的性能是否可以只更新名称节点中的从属文件临时停用节点的正确方法创建一个排除文件这列出了您想要删除
在蜂巢中出现错误

当我连接到 ireport 时如果说在 hive shell 中显示表则会出现此错误元数据错误 java lang RuntimeException 无法实例化 org apache hadoop hive metastore Hiv
公平调度器和容量调度器有什么区别？

我是 Hadoop 世界的新手想了解公平调度程序和容量调度程序之间的区别另外我们什么时候应该使用每一个请简单地回答一下因为我在网上读了很多东西但从中得到的不多公平调度是一种为作业分配资源的方法使得所有作业随着时间的推移平均获得
Python 包安装：pip 与 yum，还是两者一起安装？

我刚刚开始管理 Hadoop 集群我们使用 Bright Cluster Manager 直至操作系统级别 CentOS 7 1 然后使用 Ambari 以及适用于 Hadoop 的 Hortonworks HDP 2 3 我不断收到安装
无法在 Presto 中读取数据 - 在 Hive 中可以读取数据

我有一个 Hive DB 我创建了一个与 Parquet 文件类型兼容的表 CREATE EXTERNAL TABLE default table date date udid string message token string PAR
MongoDB/PyMongo：如何在 Map 函数中使用点表示法？

我正在尝试计算每个邮政编码中找到的记录数在我的 MongoDB 中嵌入了邮政编码使用点表示法它位于 a res z a 代表地址 res 代表住宅 z 代表邮政编码例如这工作得很好 db NY count a res z 141
以不同用户身份运行 MapReduce 作业

我有一个与 Hadoop 交互的 Web 应用程序 Cloudera cdh3u6 特定的用户操作应在集群中启动新的 MapReduce 作业该集群不是一个安全集群但它使用简单的组身份验证因此如果我以自己的身份通过 ssh 连接到它
在 Apache Spark 上下文中，内存数据存储意味着什么？

我读到 Apache Spark 将数据存储在内存中然而 Apache Spark 旨在分析大量数据又称大数据分析在这种情况下内存数据存储的真正含义是什么它可以存储的数据是否受到可用 RAM 的限制它的数据存储与使用HDFS的A
为什么在我的例子中 For 循环比 Map、Reduce 和 List 理解更快

我编写了一个简单的脚本来测试速度这就是我发现的结果实际上 for 循环在我的例子中是最快的这真的让我感到惊讶请查看下面正在计算平方和这是因为它在内存中保存列表还是有意为之谁能解释一下这一点 from functools imp
MapReduce 中的分区到底是如何工作的？

我认为我总体上对 MapReduce 编程模型有一定的了解但即使在阅读了原始论文和其他一些来源之后我仍然不清楚许多细节特别是关于中间结果的分区我将快速总结到目前为止我对 MapReduce 的理解我们有一个可能非常大的输入数据集
Hadoop 超立方体

嘿我正在启动一个基于 hadoop 的超立方体具有灵活的维度数有人知道这方面现有的方法吗我刚刚发现PigOLAP草图 http wiki apache org pig PigOLAPSketch 但没有代码可以使用它另一种方法是Z
如何修复“任务尝试_201104251139_0295_r_000006_0 未能报告状态 600 秒”。

我编写了一个 MapReduce 作业来从数据集中提取一些信息该数据集是用户对电影的评分用户数量约25万电影数量约30万地图的输出是
匿名类上的 NotSerializedException

我有一个用于过滤项目的界面 public interface KeyValFilter extends Serializable public static final long serialVersionUID 7069537470113

随机推荐

SpringCache -- Redis --- 配置与缓存使用--配置过期时间

写在前面学redis 还是得搭配SpringCache来玩一玩前置内容 win安装 redis基础 springboot使用redis 文章目录导入依赖配置cache 使用 Cacheable CachePut CacheEvict
imx6ull: 从内核、buildroot配置实现ffmpeg+nginx+rtmp+USB摄像头

前言根据正点原子的教程在官方提供的出厂根文件系统下很容易就实现了 I MX6U 嵌入式 Linux C 应用编程指南第三十四章的视频监控项目但是想自己从内核根文件系统自己配置来实现整个流程以便于在其他平台下能够迁移而且也算是熟
Visual Studio Code+phpstudy(WampServer、LNMP...)搭建PHP开发环境

VS Code是微软近年推出的一款文本编辑器相关下载 https code visualstudio com Download phpstudy是将Apache Nginx PHP MySQ等等整合在一块的一个软件为搭建软件开发运行环
Python数据可视化的例子——直方图（hist）和核密度曲线（kde）

直方图一般用来观察数据的分布形态横坐标代表数值的均匀分段纵坐标代表每个段内的观测数量频数一般直方图都会与核密度图搭配使用目的是更加清晰地掌握数据的分布特征下面将详细介绍该类型图形的绘制 1 matplotlib模块 matplo
【达内课程】Android自动化测试框架Robotium

文章目录 Robotium中各个类的用途小试牛刀测试1 测试2 测试3 测试4 测试5 Github地址相关jar包下载 Robotium中各个类的用途类用途方法 By 查询条件类类似于UIAutomator的By类只是 R
LeetCode介绍

力扣 LeetCode 是领扣网络旗下专注于程序员技术成长和企业技术人才服务的品牌源自美国硅谷力扣为全球程序员提供了专业的IT 技术职业化提升平台有效帮助程序员实现快速进步和长期成长此外力扣 LeetCode 致力于解决程序员技术
点滴记录——使用Ganglia监控Openstack Swift状态

转载请说明出处 http blog csdn net cywosp article details 42304487 在官方文档中有对StatsD来对Swift状态进行监控的描述 http docs openstack org develo
Android Studio中AndroidManifest.xml文件中application标签

AndroidManifest xml 是每个android程序中必须的文件它位于整个项目的根目录描述了package中暴露的组件 activities services 等等他们各自的实现类各种能被处理的数据和启动位置 Andro
maven 常见命令学习笔记（一）之 -pl -am -amd

假设现有项目结构如下 dailylog parent dailylog common dailylog web 三个文件夹处在同级目录中 dailylog web依赖dailylog common dailylog parent管理dail
react native打包apk时配置gradle阿里云maven仓库加速依赖下载

前言使用react native进行打包apk时因为maven仓库的原因会导致某些依赖和包没有添加成功会导致一些问题所以做法就是将gradle中的仓库地址进行配置而且配置过程中有一些注意事项要注意问题详解进入android目录
xshell和xsftp学生版下载链接

xshell https www netsarang com zh downloading token X3loQWFwNVBtWWRnaFpZazRIQ0RnQUBVWWxIT3c4VHRfTEFOdGs5Z3Y1N093 有效期 Sep
实战教程：如何将自己的Python包发布到PyPI上

1 PyPi的用途 Python中我们经常会用到第三方的包默认情况下用到的第三方工具包基本都是从Pypi org里面下载我们举个栗子如果你希望用Python实现一个金融量化分析工具目前比较好用的金融数据来源是 Yahoo 和 Go
原生js+html+css实现从表单（form）动态加数据到表格（table）

css部分 html部分
TensorFlow制作自己的数据集，并用神经网络来训练自己制作的数据集【上】

参考文章将数据导入TensorFlow 使用tensorflow训练自己的数据集一制作数据集用Tensorflow处理自己的数据制作自己的TFRecords数据集在用tensorflow来进行网络模型的训练时我们总是需要先输入
突破github的100M单个大文件上传限制

偶尔把几本电子书传到github上其中一本关于c 的有147M 在本地磁盘占用了150M空间使用普通的方式最终会被github 服务器拒绝在github 官网上也有了相关说明具体请查看 https help github com
Python量化分析（1）——Tushare的使用介绍

1 Tushare简介 Tushare是国内免费库中最好的财经数据获取接口数据包含股票基金期货债券外汇行业大数据同时包括了数字货币行情等区块链数据的全数据品类的金融大数据最重要的是免费免费免费虽然初始注册账户的积分只
虚拟机Linux：ping不通外网，但是宿主机可以ping的通；ip、网关配置都没什么问题

查看vi etc sysconfig network scripts ifcfg ens33的配置也没有什么问题但是还是ping不通外网所以我将拷贝自己没有问题的虚拟机 etc sysconfig network scripts ifc
数据结构-用单链表实现集合的并运算和交运算

问题描述有A B两个集合分别用两个单链表存放假设集合中无重复的元素要求编写两个独立的函数分别实现集合的并运算和交运算运算结果存放在第3个链表中运算不能改变原来的A B链表假设单链表中的元素值均为正整数建立链表时输入 1时停
解决有时候加载不出img标签图片

在vue前端浏览器加载图片时其他任何地方都能加载出就唯独一个地方显示无法载入此图像完全无法理解解决方法是在在图片显示的界面把meta referrer标签改为never 或者在img标签上加上referrerpolicy no re
TFIDF算法Hadoop实现

程序说明利用MapReduce计算框架计算一组英文文档中各个单词的TFIDF 某单词在某文档的TFIDF 该单词中该文档的TF 该单词IDF 其中 TF i j 单词i在文档j中出现的频率 Term Frequency TF i j N

TFIDF算法Hadoop实现

TFIDF算法Hadoop实现 的相关文章

随机推荐

热门标签

TFIDF算法Hadoop实现的相关文章