如何在Elasticsearch中获得每学期统计数据

Sebastian Lore 发表于 Dev

塞巴斯蒂安·洛尔（Sebastian Lore）

我需要在后端实现以下功能：用户键入查询并获取命中以及命中的统计信息。下面是一个简化的示例。

假设查询为Grif，则用户返回（例如，随机单词）

格里菲斯
格里芬
格里夫
脾气暴躁
格里芬斯

频率+某个术语出现的文档数，例如：

格里菲斯（频率10，3文档）
格里芬（17、9文档）
Grif（频率6，3文档）
Grift（频率9、5文档）
格里芬斯（频率11，4文档）

我是Elasticsearch的新手，所以我不确定从哪里开始实现这样的东西。哪种查询最适合此查询？我可以用来获取此类统计信息吗？任何其他建议也将不胜感激。

乔·索罗金

这有多层。您需要：

n-gram /部分/按类型搜索匹配
一种按其原始形式对匹配关键字进行分组的方法
一种反向查找文档和术语频率的机制。

我不知道有什么方法可以一次性实现，但是这是我的看法。

您可以从一个特殊的，由n-gram驱动的分析器开始，如我的其他答案所述。有原始content字段，还有所述分析器的多字段映射，还有一个keyword要汇总的字段：

PUT my-index
{
  "settings": {
    "index": {
      "max_ngram_diff": 20
    },
    "analysis": {
      "tokenizer": {
        "my_ngrams": {
          "type": "ngram",
          "min_gram": 3,
          "max_gram": 20,
          "token_chars": [
            "letter",
            "digit"
          ]
        }
      },
      "analyzer": {
        "my_ngrams_analyzer": {
          "tokenizer": "my_ngrams",
          "filter": [
            "lowercase"
          ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "content": {
        "type": "text",
        "fields": {
          "analyzed": {
            "type": "text",
            "analyzer": "my_ngrams_analyzer"
          },
          "keyword": {
            "type": "keyword"
          }
        }
      }
    }
  }
}

接下来，在content字段中批量插入一些包含文本的示例文档。请注意，每个文档也都有一个_id-以后将需要它们。

POST _bulk
{"index":{"_index":"my-index", "_id":1}}
{"content":"Griffith"}
{"index":{"_index":"my-index", "_id":2}}
{"content":"Griffin"}
{"index":{"_index":"my-index", "_id":3}}
{"content":"Grif"}
{"index":{"_index":"my-index", "_id":4}}
{"content":"Grift"}
{"index":{"_index":"my-index", "_id":5}}
{"content":"Griffins"}
{"index":{"_index":"my-index", "_id":6}}
{"content":"Griffith"}
{"index":{"_index":"my-index", "_id":7}}
{"content":"Griffins"}

在.analyzed字段中搜索n-gram，然后通过terms聚合将匹配的文档按原始术语分组。同时，_id通过top_hits聚合检索存储桶中的文档之一。顺便说一句，_id在给定存储桶中返回哪个都没关系，所有的存储桶都将包含相同的存储项。

POST my-index/_search?filter_path=aggregations.*.buckets.key,aggregations.*.buckets.doc_count,aggregations.*.buckets.*.hits.hits._id
{
  "size": 0, 
  "query": {
    "term": {
      "content.analyzed": "grif"
    }
  },
  "aggs": {
    "full_terms": {
      "terms": {
        "field": "content.keyword",
        "size": 10
      },
      "aggs": {
        "top_doc": {
          "top_hits": {
            "size": 1,
            "_source": false
          }
        }
      }
    }
  }
}

观察响应。在filter_path从以前的请求URL参数减少，只是这些属性，我们需要的响应-未被破坏的，原来full_terms加一个潜在的ID：

{
  "aggregations" : {
    "full_terms" : {
      "buckets" : [
        {
          "key" : "Griffins",
          "doc_count" : 2,
          "top_doc" : {
            "hits" : {
              "hits" : [
                {
                  "_id" : "5"
                }
              ]
            }
          }
        },
        {
          "key" : "Griffith",
          "doc_count" : 2,
          "top_doc" : {
            "hits" : {
              "hits" : [
                {
                  "_id" : "1"
                }
              ]
            }
          }
        },
        {
          "key" : "Grif",
          "doc_count" : 1,
          "top_doc" : {
            "hits" : {
              "hits" : [
                {
                  "_id" : "3"
                }
              ]
            }
          }
        },
        {
          "key" : "Griffin",
          "doc_count" : 1,
          "top_doc" : {
            "hits" : {
              "hits" : [
                {
                  "_id" : "2"
                }
              ]
            }
          }
        },
        {
          "key" : "Grift",
          "doc_count" : 1,
          "top_doc" : {
            "hits" : {
              "hits" : [
                {
                  "_id" : "4"
                }
              ]
            }
          }
        }
      ]
    }
  }
}

时间到了有趣的部分。

有一个称为Term Vectors的专用Elasticsearch API ，它完全可以满足您的需求-它从整个索引中检索字段和术语统计信息。为了将这些统计信息交给您，它需要文档ID-您将从上述汇总中获得文档ID！

最后，由于您可以使用多个术语向量，因此您可以像这样使用Multiterm vectors API，再次通过filter_path以下方式浓缩响应：

POST /my-index/_mtermvectors?filter_path=docs.term_vectors.*.*.*.doc_freq,docs.term_vectors.*.*.*.term_freq
{
  "docs": [
    {
      "_id": "5",                 <--- guaranteeing
      "fields": [
        "content.keyword"
      ],
      "payloads": false,
      "positions": false,
      "offsets": false,
      "field_statistics": false,
      "term_statistics": true
    },
    {
      "_id": "1",                 <--- the response
      "fields": [
        "content.keyword"
      ],
      "payloads": false,
      "positions": false,
      "offsets": false,
      "field_statistics": false,
      "term_statistics": true
    },
    {
      "_id": "3",                 <--- order
      "fields": [
        "content.keyword"
      ],
      "payloads": false,
      "positions": false,
      "offsets": false,
      "field_statistics": false,
      "term_statistics": true
    },
    {
      "_id": "2",
      "fields": [
        "content.keyword"
      ],
      "payloads": false,
      "positions": false,
      "offsets": false,
      "field_statistics": false,
      "term_statistics": true
    },
    {
      "_id": "4",
      "fields": [
        "content.keyword"
      ],
      "payloads": false,
      "positions": false,
      "offsets": false,
      "field_statistics": false,
      "term_statistics": true
    }
  ]
}

可以在后端对结果进行后处理，以形成自动完成响应。您已经得到A）完整术语，B）匹配文档的数量（doc_freq），以及C）术语频率：

{
  "docs" : [
    {
      "term_vectors" : {
        "content.keyword" : {
          "terms" : {
            "Griffins" : {      |      term
              "doc_freq" : 2,   | <--  # of docs
              "term_freq" : 1   |      term frequency
            }
          }
        }
      }
    },
    {
      "term_vectors" : {
        "content.keyword" : {
          "terms" : {
            "Griffith" : {
              "doc_freq" : 2,
              "term_freq" : 1
            }
          }
        }
      }
    },
    {
      "term_vectors" : {
        "content.keyword" : {
          "terms" : {
            "Grif" : {
              "doc_freq" : 1,
              "term_freq" : 1
            }
          }
        }
      }
    },
    {
      "term_vectors" : {
        "content.keyword" : {
          "terms" : {
            "Griffin" : {
              "doc_freq" : 1,
              "term_freq" : 1
            }
          }
        }
      }
    },
    {
      "term_vectors" : {
        "content.keyword" : {
          "terms" : {
            "Grift" : {
              "doc_freq" : 1,
              "term_freq" : 1
            }
          }
        }
      }
    }
  ]
}