termvectors api 能够取回 es document 中的 token 信息。但是这个接口在 nested 字段上的表现需要特别注意。 该接口无法直接返回 nested 文档的相关统计结果(参见Issue)。为了统计 nested 字段的 term 情况,有两种解决方案

  • 在 nested 字段上开启include_in_root 选项,将 nested 字段中的内容 duplicate 到父文档
  • 在 nested 中需要统计 termvectors 的字段上,将term_vector选项设置为 no,手动关闭 term_vector 存储。这种情况下_termvectors 接口会在运行时计算 termvector.

一些测试

测试用的 index mapping 定义

{
  "mappings": {
    "properties": {
      "f4": {
        "type": "text",
        "term_vector": "with_positions_offsets_payloads"
      },
      "f1": {
        "type": "nested",
        "include_in_root": true,
        "properties": {
          "text": {
            "type": "text",
            "term_vector": "with_positions_offsets_payloads"
          }
        }
      },
      "f2": {
        "type": "nested",
        "include_in_root": false,
        "properties": {
          "text": {
            "type": "text",
            "term_vector": "no"
          }
        }
      },
      "f3": {
        "type": "nested",
        "include_in_root": false,
        "properties": {
          "text": {
            "type": "text",
            "term_vector": "with_positions_offsets_payloads"
          }
        }
      }
    }
  }
}

测试用的 doc

{
  "f1": {
    "text": "hello world"
  },
  "f2": {
    "text": "hello world"
  },
  "f3": {
    "text": "hello world"
  }
}

测试结果:

{
  "_index": "foo",
  "_id": "1",
  "_version": 1,
  "found": true,
  "took": 0,
  "term_vectors": {
    "f2.text": {
      "field_statistics": {
        "sum_doc_freq": 2,
        "doc_count": 1,
        "sum_ttf": 2
      },
      "terms": {
        "hello": {
          "term_freq": 1,
          "tokens": [
            {
              "position": 0,
              "start_offset": 0,
              "end_offset": 5
            }
          ]
        },
        "world": {
          "term_freq": 1,
          "tokens": [
            {
              "position": 1,
              "start_offset": 6,
              "end_offset": 11
            }
          ]
        }
      }
    },
    "f4": {
      "field_statistics": {
        "sum_doc_freq": 2,
        "doc_count": 1,
        "sum_ttf": 2
      },
      "terms": {
        "hello": {
          "term_freq": 1,
          "tokens": [
            {
              "position": 0,
              "start_offset": 0,
              "end_offset": 5
            }
          ]
        },
        "world": {
          "term_freq": 1,
          "tokens": [
            {
              "position": 1,
              "start_offset": 6,
              "end_offset": 11
            }
          ]
        }
      }
    },
    "f1.text": {
      "field_statistics": {
        "sum_doc_freq": 4,
        "doc_count": 2,
        "sum_ttf": 4
      },
      "terms": {
        "hello": {
          "term_freq": 1,
          "tokens": [
            {
              "position": 0,
              "start_offset": 0,
              "end_offset": 5
            }
          ]
        },
        "world": {
          "term_freq": 1,
          "tokens": [
            {
              "position": 1,
              "start_offset": 6,
              "end_offset": 11
            }
          ]
        }
      }
    }
  }
}

分析

  • f1.text 有返回,因为include_in_root f2.text 有返回,因为 termvectors 的 on-the-fly 计算 f3 没返回,符合 termvectors 的奇怪行为 f4 有返回,符合预期
  • 注意需要在调用 termvectors时显式声明 fields