Read this https://www.compose.com/articles/first-steps-of-an-analytics-platform-with-mongodb/,并尝试进行实验。
下面是我的架构。
SCHEMA:
{
"s" : "CB",
"c" : "REQ_RCV",
"e" : "sms_click",
"st" : "i",
"b" : "2",
"a" : "1",
"u" : "b1_h1_d1_m1_user_2",
"c#" : "b1_h1_d1_m1_cr-2",
"@" : ISODate("2016-10-01T06:03:00.000Z"), //Indexed
"@h" : "16100106", //hourly bucket
"@d" : "161001", //Indexed
"@m" : "1610"
}
以下是解释计划:
> 2017-01-22T13:43:47.764+0530 I COMMAND [conn34] command test-analytics.template3 appName: "MongoDB Shell" command: aggregate {
> aggregate: "template3", pipeline: [ { $match: { @: { $gte: new
> Date(1483228800000), $lte: new Date(1483315199000) } } }, { $group: {
> _id: { b: "$b", HOURLY: "$@h", s: "$s" }, count: { $sum: 1.0 } } }, { $project: { _id: 0.0, BUCKET: "$_id.b", SERVICE: "$_id.s", TIME:
> "$_id.HOURLY", count: 1.0 } }, { $sort: { SERVICE: 1.0, BUCKET: 1.0,
> TIME: 1.0 } } ], cursor: {} } planSummary: IXSCAN { @: 1.0 }
> keysExamined:106888 docsExamined:106888 hasSortStage:1
> cursorExhausted:1 numYields:925 nreturned:96 reslen:7095 locks:{
> Global: { acquireCount: { r: 1860 } }, Database: { acquireCount: { r:
> 930 } }, Collection: { acquireCount: { r: 929 } } }
> protocol:op_command **3499ms**
> 2017-01-22T13:44:24.825+0530 I COMMAND [conn34] command test-analytics.template3 appName: "MongoDB Shell" command: aggregate {
> aggregate: "template3", pipeline: [ { $match: { @d: "170101" } }, {
> $group: { _id: { b: "$b", HOURLY: "$@h", s: "$s" }, count: { $sum: 1.0
> } } }, { $project: { _id: 0.0, BUCKET: "$_id.b", SERVICE: "$_id.s",
> TIME: "$_id.HOURLY", count: 1.0 } }, { $sort: { SERVICE: 1.0, BUCKET:
> 1.0, TIME: 1.0 } } ], cursor: {} } planSummary: IXSCAN { @d: 1.0 } keysExamined:106888 docsExamined:106888 hasSortStage:1
> cursorExhausted:1 numYields:865 nreturned:96 reslen:7095 locks:{
> Global: { acquireCount: { r: 1740 } }, Database: { acquireCount: { r:
> 870 } }, Collection: { acquireCount: { r: 869 } } }
> protocol:op_command **1294ms**
问题:
- 尽管这两个查询都检查了相同数量的文档
为什么输出有时间差?
- Are
$lte
, $gte
运算符仅在日期范围上变慢,甚至在数字比较上也变慢?
- 由于分桶可以提供更快的响应,那么如何使用分桶进行范围查询?我可以进行多个聚合时间桶调用来支持范围查询,但这会增加往返时间,有什么建议吗?
- 是否可以使用
$and
两个人$match
在聚合查询中支持范围存储?目前聚合接受多个$match
但第一个输出$match
被赋予第二个$match
,但我想要的是添加/分组个人$match
结果到下一个管道。
第四季度的可能答案:
db.template3.aggregate([
{
$match: {
$or: [
{"@d":"170301"},
{"@d":"170306"},
{"@d":"170202"},
{"@d":"170303"},
{"@d":"170304"},
{"@d":"170305"}
]
}
},
{ $project: { _id: 0, "b": 1, "s": 1, "@h": 1 } },
{
$group: {
_id: {"b": "$b", "HOURLY": "$@h", "s": "$s" },
count: { $sum: 1 }
}
},
{
$project: {
_id: 0,
"BUCKET": "$_id.b",
"SERVICE": "$_id.s",
"TIME": "$_id.HOURLY",
count: 1
}
},
{ $sort: { "SERVICE": 1, "BUCKET": 1, "TIME": 1 } }
]);
在这个答案中,我们可以使用桶的混合(每日和每月),但这仍然会使用它自己的索引。读https://docs.mongodb.com/manual/reference/operator/query/or/#or-clauses-and-indexes https://docs.mongodb.com/manual/reference/operator/query/or/#or-clauses-and-indexes.
示例查询:
db.template3.aggregate([
{$match:{"@h":{$gte : 17020511, $lte : 17030511}, "st":"i"}},
{$project : {"_id":0, "@h":1,"c":1, "@m":1}},
{$group:{_id:{ "HOURLY":"$@h", "c":"$c"}, count:{$sum:1}}},
{$project : {_id:0, "COUNTER":"$_id.c","TIME":"$_id.HOURLY", count:1}},
{$sort:{"COUNTER":1,"TIME":1}}
]);
Output:
{ "count" : 2255, "COUNTER" : "REQ_RCVD", "TIME" : 17020511 }
{ "count" : 28888, "COUNTER" : "REQ_RCVD", "TIME" : 17020600 }
{ "count" : 37613, "COUNTER" : "REQ_RCVD", "TIME" : 17020601 }
{ "count" : 6723, "COUNTER" : "REQ_RCVD", "TIME" : 17020602 }
{ "count" : 14057, "COUNTER" : "REQ_RCVD", "TIME" : 17020603 }
{ "count" : 12405, "COUNTER" : "REQ_RCVD", "TIME" : 17020604 }
{ "count" : 2392, "COUNTER" : "REQ_RCVD", "TIME" : 17020611 }
{ "count" : 28784, "COUNTER" : "REQ_RCVD", "TIME" : 17020700 }
{ "count" : 37494, "COUNTER" : "REQ_RCVD", "TIME" : 17020701 }
{ "count" : 6697, "COUNTER" : "REQ_RCVD", "TIME" : 17020702 }
{ "count" : 13930, "COUNTER" : "REQ_RCVD", "TIME" : 17020703 }
{ "count" : 12493, "COUNTER" : "REQ_RCVD", "TIME" : 17020704 }
{ "count" : 2225, "COUNTER" : "REQ_RCVD", "TIME" : 17020711 }
{ "count" : 28821, "COUNTER" : "REQ_RCVD", "TIME" : 17020800 }
{ "count" : 37949, "COUNTER" : "REQ_RCVD", "TIME" : 17020801 }
{ "count" : 6676, "COUNTER" : "REQ_RCVD", "TIME" : 17020802 }
{ "count" : 14039, "COUNTER" : "REQ_RCVD", "TIME" : 17020803 }
{ "count" : 12349, "COUNTER" : "REQ_RCVD", "TIME" : 17020804 }
{ "count" : 2332, "COUNTER" : "REQ_RCVD", "TIME" : 17020811 }
{ "count" : 28379, "COUNTER" : "REQ_RCVD", "TIME" : 17020900 }
优化
我觉得由于读取非索引字段而花费了更多时间。
因此docsExamined: 106888
当我使用时有一定的改进$project
before $group
我变了"@h"
数据类型来自String
to Integer
(NumberInt
),我认为它会进一步改进。