地图兴趣点搜索三（ES相关性得分参数调整）

2023-11-16

1. 问题回顾

前面第一章，我们介绍了地图兴趣点检索的基本流程，以及如何用elasticsearch+ik搭建一个简单的demo。在运行demo时我们用“通州区万达广场“去搜索，结果排第一位的结果竟然是位于朝阳区的”建国路万达广场“。第二章，我们对ES的相关性打分原理进行了探索，了解了整体的打分策略。本文我们将利用ES提供的接口来调整打分规则，让搜索的结果符合我们的预期。

首先通过ES的explain参数来输出一下结果，具体分析一下为何第2名明显更符合常理的地址得分比较低。

get http://localhost:9200/idx_default/_search?explain=true

{

  "query": {

    "match": {

      "address": {

        "query": "通州区万达广场"

      }

    }

  }

}

结果如下（只摘出前两名）

{
                "_shard": "[idx_default][0]",
                "_node": "Crj7_cZOQT6w9sG0ryBbzQ",
                "_index": "idx_default",
                "_type": "_doc",
                "_id": "138069",
                "_score": 17.299044,
                "_source": {
                    "address": "建国路万达广场",
                    "name": "恒大山水城",
                    "location": "39.90867476611688,116.46468505121267"
                },
                "_explanation": {
                    "value": 17.299044,
                    "description": "sum of:",
                    "details": [
                        {
                            "value": 10.175069,
                            "description": "weight(address:万达 in 138410) [PerFieldSimilarity], result of:",
                            "details": [
                                {
                                    "value": 10.175069,
                                    "description": "score(freq=1.0), computed as boost * idf * tf from:",
                                    "details": [
                                        {
                                            "value": 2.2,
                                            "description": "boost",
                                            "details": []
                                        },
                                        {
                                            "value": 7.7361317,
                                            "description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                                            "details": [
                                                {
                                                    "value": 89,
                                                    "description": "n, number of documents containing term",
                                                    "details": []
                                                },
                                                {
                                                    "value": 204918,
                                                    "description": "N, total number of documents with field",
                                                    "details": []
                                                }
                                            ]
                                        },
                                        {
                                            "value": 0.59784806,
                                            "description": "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                                            "details": [
                                                {
                                                    "value": 1.0,
                                                    "description": "freq, occurrences of term within document",
                                                    "details": []
                                                },
                                                {
                                                    "value": 1.2,
                                                    "description": "k1, term saturation parameter",
                                                    "details": []
                                                },
                                                {
                                                    "value": 0.75,
                                                    "description": "b, length normalization parameter",
                                                    "details": []
                                                },
                                                {
                                                    "value": 3.0,
                                                    "description": "dl, length of field",
                                                    "details": []
                                                },
                                                {
                                                    "value": 7.245098,
                                                    "description": "avgdl, average length of field",
                                                    "details": []
                                                }
                                            ]
                                        }
                                    ]
                                }
                            ]
                        },
                        {
                            "value": 7.1239743,
                            "description": "weight(address:广场 in 138410) [PerFieldSimilarity], result of:",
                            "details": [
                                {
                                    "value": 7.1239743,
                                    "description": "score(freq=1.0), computed as boost * idf * tf from:",
                                    "details": [
                                        {
                                            "value": 2.2,
                                            "description": "boost",
                                            "details": []
                                        },
                                        {
                                            "value": 5.416376,
                                            "description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                                            "details": [
                                                {
                                                    "value": 910,
                                                    "description": "n, number of documents containing term",
                                                    "details": []
                                                },
                                                {
                                                    "value": 204918,
                                                    "description": "N, total number of documents with field",
                                                    "details": []
                                                }
                                            ]
                                        },
                                        {
                                            "value": 0.59784806,
                                            "description": "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                                            "details": [
                                                {
                                                    "value": 1.0,
                                                    "description": "freq, occurrences of term within document",
                                                    "details": []
                                                },
                                                {
                                                    "value": 1.2,
                                                    "description": "k1, term saturation parameter",
                                                    "details": []
                                                },
                                                {
                                                    "value": 0.75,
                                                    "description": "b, length normalization parameter",
                                                    "details": []
                                                },
                                                {
                                                    "value": 3.0,
                                                    "description": "dl, length of field",
                                                    "details": []
                                                },
                                                {
                                                    "value": 7.245098,
                                                    "description": "avgdl, average length of field",
                                                    "details": []
                                                }
                                            ]
                                        }
                                    ]
                                }
                            ]
                        }
                    ]
                }
            },
{
                "_shard": "[idx_default][0]",
                "_node": "Crj7_cZOQT6w9sG0ryBbzQ",
                "_index": "idx_default",
                "_type": "_doc",
                "_id": "28730",
                "_score": 16.216942,
                "_source": {
                    "address": "北京市通州区新华西街58号万达广场F2",
                    "name": "手寓工坊(万达广场店)",
                    "location": "39.904175142894765,116.63712318703388"
                },
                "_explanation": {
                    "value": 16.216942,
                    "description": "sum of:",
                    "details": [
                        {
                            "value": 2.879858,
                            "description": "weight(address:通州区 in 28165) [PerFieldSimilarity], result of:",
                            "details": [
                                {
                                    "value": 2.879858,
                                    "description": "score(freq=1.0), computed as boost * idf * tf from:",
                                    "details": [
                                        {
                                            "value": 2.2,
                                            "description": "boost",
                                            "details": []
                                        },
                                        {
                                            "value": 2.8400025,
                                            "description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                                            "details": [
                                                {
                                                    "value": 11972,
                                                    "description": "n, number of documents containing term",
                                                    "details": []
                                                },
                                                {
                                                    "value": 204918,
                                                    "description": "N, total number of documents with field",
                                                    "details": []
                                                }
                                            ]
                                        },
                                        {
                                            "value": 0.46092433,
                                            "description": "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                                            "details": [
                                                {
                                                    "value": 1.0,
                                                    "description": "freq, occurrences of term within document",
                                                    "details": []
                                                },
                                                {
                                                    "value": 1.2,
                                                    "description": "k1, term saturation parameter",
                                                    "details": []
                                                },
                                                {
                                                    "value": 0.75,
                                                    "description": "b, length normalization parameter",
                                                    "details": []
                                                },
                                                {
                                                    "value": 7.0,
                                                    "description": "dl, length of field",
                                                    "details": []
                                                },
                                                {
                                                    "value": 7.245098,
                                                    "description": "avgdl, average length of field",
                                                    "details": []
                                                }
                                            ]
                                        }
                                    ]
                                }
                            ]
                        },
                        {
                            "value": 7.844697,
                            "description": "weight(address:万达 in 28165) [PerFieldSimilarity], result of:",
                            "details": [
                                {
                                    "value": 7.844697,
                                    "description": "score(freq=1.0), computed as boost * idf * tf from:",
                                    "details": [
                                        {
                                            "value": 2.2,
                                            "description": "boost",
                                            "details": []
                                        },
                                        {
                                            "value": 7.7361317,
                                            "description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                                            "details": [
                                                {
                                                    "value": 89,
                                                    "description": "n, number of documents containing term",
                                                    "details": []
                                                },
                                                {
                                                    "value": 204918,
                                                    "description": "N, total number of documents with field",
                                                    "details": []
                                                }
                                            ]
                                        },
                                        {
                                            "value": 0.46092433,
                                            "description": "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                                            "details": [
                                                {
                                                    "value": 1.0,
                                                    "description": "freq, occurrences of term within document",
                                                    "details": []
                                                },
                                                {
                                                    "value": 1.2,
                                                    "description": "k1, term saturation parameter",
                                                    "details": []
                                                },
                                                {
                                                    "value": 0.75,
                                                    "description": "b, length normalization parameter",
                                                    "details": []
                                                },
                                                {
                                                    "value": 7.0,
                                                    "description": "dl, length of field",
                                                    "details": []
                                                },
                                                {
                                                    "value": 7.245098,
                                                    "description": "avgdl, average length of field",
                                                    "details": []
                                                }
                                            ]
                                        }
                                    ]
                                }
                            ]
                        },
                        {
                            "value": 5.4923873,
                            "description": "weight(address:广场 in 28165) [PerFieldSimilarity], result of:",
                            "details": [
                                {
                                    "value": 5.4923873,
                                    "description": "score(freq=1.0), computed as boost * idf * tf from:",
                                    "details": [
                                        {
                                            "value": 2.2,
                                            "description": "boost",
                                            "details": []
                                        },
                                        {
                                            "value": 5.416376,
                                            "description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                                            "details": [
                                                {
                                                    "value": 910,
                                                    "description": "n, number of documents containing term",
                                                    "details": []
                                                },
                                                {
                                                    "value": 204918,
                                                    "description": "N, total number of documents with field",
                                                    "details": []
                                                }
                                            ]
                                        },
                                        {
                                            "value": 0.46092433,
                                            "description": "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                                            "details": [
                                                {
                                                    "value": 1.0,
                                                    "description": "freq, occurrences of term within document",
                                                    "details": []
                                                },
                                                {
                                                    "value": 1.2,
                                                    "description": "k1, term saturation parameter",
                                                    "details": []
                                                },
                                                {
                                                    "value": 0.75,
                                                    "description": "b, length normalization parameter",
                                                    "details": []
                                                },
                                                {
                                                    "value": 7.0,
                                                    "description": "dl, length of field",
                                                    "details": []
                                                },
                                                {
                                                    "value": 7.245098,
                                                    "description": "avgdl, average length of field",
                                                    "details": []
                                                }
                                            ]
                                        }
                                    ]
                                }
                            ]
                        }
                    ]
                }
            }

从结果可见**”建国路万达广场“（后面简称“建国路地址”）得分为17.299044，而”北京市通州区新华西街58号万达广场F2“**（后面简称“通州区地址”）只有16.216942。上一章提到，建国路地址位于朝阳区，显然与我们的查询条件相差比较远，而通州区地址更符合预期。为什么会得到现在的结果，可以在_explanation内部找到答案，下面利用上一章学习的score模型我们来分析一下原因。

2. 原因分析

先来看看_explanation的结构，它是一个JSON对象，下面有三个属性"value"、“description”、“details”，分别表示“得分”，”计算公式“和公式中的所有”变量值“，其中details为一个数组，数组内的元素也是类似结构的JSON对象。这样的JSON对象有4层，第1层是总体得分对象；第2层是分词得分对象；第3层是子项得分对象，比如某个词条的idf得分；第4层是子项变量对象，比如某个词条idf公式内的变量N的值。下面是总得分的计算公式：
最终总得分 = ∑ i n 每个词条得分最终总得分=\sum_{i}^{n}每个词条得分最终总得分=i∑n每个词条得分
再具体分析单个词条，以“建国路万达广场”中的“万达”词条为例。我们找到“万达”JSON对象，再看它的details为“score(freq=1.0), computed as boost * idf * tf from:”，里面需要三个值：

**boost **是一个查询的权重项，我们可以在创建索引时，通过mapping对指定的field设定boost值，当我们进行多字段混合查询时可以区分不同field的权重。

**idf **即逆文档词频，描述为：“idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:”。具体含义参见上一章

**tf **即词频，描述为：“tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:”。具体含义参见上一章

了解了ES中得分的计算方式及结果的含义，我们来分析下，为什么建国路地址比通州区地址的得分要高。把JSON结果变换为如下词条得分表，每一行代表左侧的词条在两个地址中的得分。其中词条“通州区”只存在于通州区地址。但即使多了一个词条，通州区地址的得分仍然更低。

在这里插入图片描述

通过对“万达”、“广场”两个词在地址中的得分进一步分析，可以发现具体原因。下面表格表示计算“万达”和“广场”两个词时，每个计算项的得分。可以明显发现，通州区地址在tf词频得分上都更低，其他项则相同。

在这里插入图片描述

进一步看tf的计算情况，发现区别只在于dl，他的description为“dl, length of field”，即地址的长度。

在这里插入图片描述

回顾一下上一篇介绍的tf的计算公式（这个公式和ES默认的计算公式略有不同，ES版本分子上的k+1被省略了，但整体效果相同）：
T F = ( k + 1 ) ⋅ f i k ⋅ ( 1 − b + b ⋅ d l a v g ( d l ) ) + f i TF=\frac{(k+1)\cdot f_i}{k\cdot(1-b+b\cdot\frac{dl}{avg(dl)})+f_i} TF=k⋅(1−b+b⋅avg(dl)dl)+fi(k+1)⋅fi
其中dl为当前文档的长度，avg(dl)为文档库中文档的平均长度。显然这里avg(dl)大家是相同的，而dl越大tf的得分就越低。所以分析后得到的原因是通州区地址，即“北京市通州区新华西街58号万达广场F2”太长了。虽然它覆盖的词条更多（多了一个通州区），但是dl会影响每个词条的得分。下面我们看看有什么参数可以调节从而减少dl的影响。

3. 调整参数

上一篇文章最后我们介绍了tf公式内有一个参数b，提到了它是BM25让我们调节文档长度影响程度的因子，当b=0时，分母变为k+fi，完全消除了文档长度影响。当b值更高时，长度因素则会对TF得分有更大的影响。显然本文我们希望降低，甚至消除长度的影响，因为地址库里面所有地址长度差异不大，我希望它们公平竞争，谁匹配的词多谁得分高。

ES提供了非常方便的接口，只需要在创建索引时，在settings内部定义一下b的值。具体命令如下：

put http://localhost:9200/idx_default

{
    "settings": {
        "index": {
            "similarity": {
                "BM25_b_0": {
                    "type": "BM25",
                    "b": "0.0"
                }
            }
        }
    },
    "mappings": {
        "poipo": {
            "properties": {
                "location": {
                    "type": "geo_point"
                },
                "address": {
                    "type": "text",
                    "similarity": "BM25_b_0"
                }
            }
        }
    }
}

BM25_b_0是我们定义的相似性计算模型，type指定了它是一个BM25模型，b则指明我们要覆盖此变量让其值变为0。然后在下面mappings中指定address字段的similarity为新模型。至此我们完成了新索引的构建，重新导入数据后再次查询。结果如下：

{
                "_index": "idx_default",
                "_type": "poipo",
                "_id": "56963",
                "_score": 17.46982,
                "_source": {
                    "address": "北京市通州区新华街道建国路93号院万达广场11号楼",
                    "location": {
                        "lon": 116.6574382584145,
                        "lat": 39.92313729883979
                    }
                }
            },
            {
                "_index": "idx_default",
                "_type": "poipo",
                "_id": "87454",
                "_score": 16.99757,
                "_source": {
                    "address": "北京市通州区北苑街道手寓工坊(万达广场店)",
                    "location": {
                        "lon": 116.64295933891906,
                        "lat": 39.905244856754514
                    }
                }
            }
...

这里只列举前两个结果，显然都是通州区的万达广场，说明我们的参数调整已经发挥作用。

本文我们利用一个例子说明了如何查看ES查询结果及详情，并通过分析得分的计算细节，找出了错误排名的原因。最后，利用ES提供的参数调整接口实现了模型的修改。这个调参的案例比较粗暴的将长度因子进行了剔除，后面章节我们会尝试从词条的优先级入手探讨更细粒度的调参策略。

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)