对 `search_as_you_type` ngram 子字段感到困惑

2023-12-31

我正在尝试将“键入时搜索”功能添加到 Elasticsearch 中名为email_address。我的理解从文档 https://www.elastic.co/guide/en/elasticsearch/reference/7.7/search-as-you-type.html是如果我创建一个search_as_you_type字段,它应该自动创建为查找部分匹配而优化的 ngram 子字段。

然而,它似乎并没有按照我期望的方式工作,而且我似乎没有从这种特殊的字段类型中获得我期望的好处。

首先,我创建了一个索引:

$ curl -s -H 'Content-Type: application/json' -XPUT http://localhost:9200/mytestindex -d '
{
  "mappings": {
    "properties": {
      "email_address": {"type": "search_as_you_type"}
    }
  }
}
'

当我请求新创建的电子邮件字段时,我看到的是:

$ curl -s -H 'Content-Type: application/json' http://localhost:9200/mytestindex/_mapping/field/email_address | json_pp
{
   "mytestindex" : {
      "mappings" : {
         "email_address" : {
            "full_name" : "email_address",
            "mapping" : {
               "email_address" : {
                  "max_shingle_size" : 3,
                  "type" : "search_as_you_type"
               }
            }
         }
      }
   }
}

最后,我填充了一些示例数据:

$ curl -s -H 'Content-Type: application/json' http://localhost:9200/mytestindex/_doc -d '
{"email_address": "[email protected] /cdn-cgi/l/email-protection"}'

$ curl -s -H 'Content-Type: application/json' http://localhost:9200/mytestindex/_doc -d '
{"email_address": "[email protected] /cdn-cgi/l/email-protection"}'

$ curl -s -H 'Content-Type: application/json' http://localhost:9200/mytestindex/_doc -d '
{"email_address": "[email protected] /cdn-cgi/l/email-protection"}'

$ curl -s -H 'Content-Type: application/json' http://localhost:9200/mytestindex/_doc -d '
{"email_address": "[email protected] /cdn-cgi/l/email-protection"}'

官方文档建议使用bool_prefix multi_match包含以下字段:email_address, email_address._2gram, and email_address._3gram。我很想尝试这些子字段,因此测试了仅包含它们的搜索,但我无法返回任何结果:

$ curl -s -H 'Content-Type: application/json' http://localhost:9200/mytestindex/_search -d '
{
  "query": {
    "multi_match": {
      "query": "sa",
      "type": "bool_prefix",
      "fields": [
        "email_address._2gram",
        "email_address._3gram"
      ]
    }
  }
}
' | json_pp

{
   "hits" : {
      "hits" : [],
      "max_score" : null,
      "total" : {
         "value" : 0,
         "relation" : "eq"
      }
   },
   "took" : 4,
   "_shards" : {
      "skipped" : 0,
      "successful" : 1,
      "total" : 1,
      "failed" : 0
   },
   "timed_out" : false
}

我尝试过各种长度的部分查询(s, sa, sam等)但我从未得到任何结果。

当我执行相同的搜索但仅包含email_address字段本身,我得到了我期望的所有结果:

curl -s -H 'Content-Type: application/json' http://localhost:9200/mytestindex/_search -d '
{
  "query": {
    "multi_match": {
      "query": "sa",
      "type": "bool_prefix",
      "fields": [
        "email_address"
      ]
    }
  }
}
' | json_pp
{
   "timed_out" : false,
   "hits" : {
      "max_score" : 1,
      "total" : {
         "relation" : "eq",
         "value" : 3
      },
      "hits" : [
         {
            "_index" : "mytestindex",
            "_id" : "gEbkCXUBC6_J-EeLAygM",
            "_score" : 1,
            "_type" : "_doc",
            "_source" : {
               "email_address" : "[email protected] /cdn-cgi/l/email-protection"
            }
         },
         {
            "_index" : "mytestindex",
            "_source" : {
               "email_address" : "[email protected] /cdn-cgi/l/email-protection"
            },
            "_score" : 1,
            "_type" : "_doc",
            "_id" : "gUbkCXUBC6_J-EeLWigu"
         },
         {
            "_index" : "mytestindex",
            "_id" : "jUb5CXUBC6_J-EeL1ij1",
            "_type" : "_doc",
            "_score" : 1,
            "_source" : {
               "email_address" : "[email protected] /cdn-cgi/l/email-protection"
            }
         }
      ]
   },
   "took" : 2,
   "_shards" : {
      "failed" : 0,
      "skipped" : 0,
      "successful" : 1,
      "total" : 1
   }
}

结果我不明白有什么好处_2gram and _3gram子字段正在提供。我是否设置错误?或者我对这些字段的实际用途感到困惑吗?


The 按您的类型搜索 https://www.elastic.co/guide/en/elasticsearch/reference/current/search-as-you-type.html字段类型是类似文本的字段 进行了优化,为按您键入的查询提供支持 完成用例

添加包含索引数据、映射、搜索查询和搜索结果的工作示例

索引映射:

{
  "mappings": {
    "properties": {
      "title": {
        "type": "search_as_you_type"
      }
    }
  }
}

指数数据:

{"title": "how shingles are actually used"}

分析API

elasticsearch中默认的分词器是“标准分词器”,它使用基于语法的分词技术

为文本生成的各个标记是

{
  "tokens": [
    {
      "token": "how",
      "start_offset": 0,
      "end_offset": 3,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "shingles",
      "start_offset": 4,
      "end_offset": 12,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "are",
      "start_offset": 13,
      "end_offset": 16,
      "type": "<ALPHANUM>",
      "position": 2
    },
    {
      "token": "actually",
      "start_offset": 17,
      "end_offset": 25,
      "type": "<ALPHANUM>",
      "position": 3
    },
    {
      "token": "used",
      "start_offset": 26,
      "end_offset": 30,
      "type": "<ALPHANUM>",
      "position": 4
    }
  ]
}

生产 3 个字的木瓦

POST/_analyze

{
  "tokenizer": "standard",
  "filter": [
    {
      "type": "shingle",
      "min_shingle_size": 3,
      "max_shingle_size": 3,
      "output_unigrams":false
    }
  ],
  "text": "how shingles are actually used"
}

生成的代币为:

{
  "tokens": [
    {
      "token": "how shingles are",
      "start_offset": 0,
      "end_offset": 16,
      "type": "shingle",
      "position": 0
    },
    {
      "token": "shingles are actually",
      "start_offset": 4,
      "end_offset": 25,
      "type": "shingle",
      "position": 1
    },
    {
      "token": "are actually used",
      "start_offset": 13,
      "end_offset": 30,
      "type": "shingle",
      "position": 2
    }
  ]
}

搜索查询:

title._3gram - 用 shingle 令牌包装 my_field 的分析器 木瓦尺寸 3 的过滤器

{
  "query": {
    "multi_match": {
      "query": "shingles are actually",
      "type": "bool_prefix",
      "fields": [
        "title._3gram"
      ]
    }
  }
}

搜索结果:

"hits": [
      {
        "_index": "test",
        "_type": "_doc",
        "_id": "1",
        "_score": 1.0,
        "_source": {
          "title": "how shingles are actually used"
        }
      }
    ]

在你的情况下,考虑到"text": "[email protected] /cdn-cgi/l/email-protection",生成的各个令牌是:samantha and example.com

创建 2 个单词的木瓦时,生成的标记为:

{
  "tokens": [
    {
      "token": "samantha example.com",
      "start_offset": 0,
      "end_offset": 20,
      "type": "shingle",
      "position": 0
    }
  ]
}

所以当你搜索时sa它不会匹配,因为没有生成与其对应的令牌。 当使用带有布尔前缀查询的多重匹配时(在email_address字段,它匹配是因为" type": "bool prefix"。阅读本文以了解更多信息匹配布尔前缀查询 https://www.elastic.co/guide/en/elasticsearch/reference/7.x/query-dsl-match-bool-prefix-query.html#query-dsl-match-bool-prefix-query.

如果你想查询sa,并获得所有结果,然后您可以使用完成建议者 https://www.elastic.co/guide/en/elasticsearch/reference/current/search-suggesters.html你甚至可以通过UAX URL 电子邮件标记器 https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-uaxurlemail-tokenizer.html

本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)

对 `search_as_you_type` ngram 子字段感到困惑 的相关文章

随机推荐