Twitter 数据 - 查找 MongoDB 中被提及最多的用户

2024-04-22

假设我有来自 Twitter API 的流数据,并且将数据作为文档存储在 MongoDB 中。我想要找到的是计数screen_name under entities.user_mentions.

{
    "_id" : ObjectId("50657d5844956d06fb5b36c7"),
    "contributors" : null,
    "text" : "",
    "entities" : {
        "urls" : [ ],
        "hashtags" : [
            {
                "text" : "",
                "indices" : [
                    26,
                    30
                ]
            },
            {
                "text" : "",
                "indices" : []
            }
        ],
        "user_mentions" : [ 
                {
                    "name":"Twitter API", 
                    "indices":[4,15], 
                    "screen_name":"twitterapi", 
                    "id":6253282, "id_str":"6253282"
                }]
    },
    ...

我尝试使用地图减少:

map = function() {
    if (!this.entities.user_mentions.screen_name) {
        return;
    }

    for (index in this.entities.user_mentions.screen_name) {
        emit(this.entities.user_mentions.screen_name[index], 1);
    }
}

reduce = function(previous, current) {
    var count = 0;

    for (index in current) {
        count += current[index];
    }

    return count;
}

result = db.runCommand({
    "mapreduce" : "twitter_sample",
    "map" : map,
    "reduce" : reduce,
    "out" : "user_mentions"
});

但它不太有效...


Since entities.user_mentions是一个数组,您希望为其中的每个 screen_name 发出一个值map():

var map = function() {
    this.entities.user_mentions.forEach(function(mention) {
        emit(mention.screen_name, { count: 1 });
    })
};

然后通过唯一的 screen_name 来计算值reduce():

var reduce = function(key, values) {
    // NB: reduce() uses same format as results emitted by map()
    var result = { count: 0 };

    values.forEach(function(value) {
        result.count += value.count;
    });

    return result;
};

注意:要调试你的map/reduce JavaScript函数,你可以使用print() and printjson()命令。输出将出现在您的mongod log.

编辑:为了比较,这里是一个使用新的示例聚合框架 http://docs.mongodb.org/manual/reference/aggregation/在 MongoDB 2.2 中:

db.twitter_sample.aggregate(
    // Project to limit the document fields included
    { $project: {
        _id: 0,
        "entities.user_mentions" : 1
    }},

    // Split user_mentions array into a stream of documents
    { $unwind: "$entities.user_mentions" },

    // Group and count the unique mentions by screen_name
    { $group : {
        _id: "$entities.user_mentions.screen_name",
        count: { $sum : 1 }
    }},

    // Optional: sort by count, descending
    { $sort : {
        "count" : -1
    }}
)

最初的 Map/Reduce 方法最适合大型数据集,正如 Twitter 数据所暗示的那样。有关 Map/Reduce 与聚合框架限制的比较,请参阅 StackOverflow 问题的相关讨论MongoDB group()、$group 和 MapReduce https://stackoverflow.com/questions/12337319/mongodb-group-group-and-mapreduce/12340283#12340283.

本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)

Twitter 数据 - 查找 MongoDB 中被提及最多的用户 的相关文章

随机推荐