termvectors api 能够取回 es document 中的 token 信息。但是这个接口在 nested
字段上的表现需要特别注意。
该接口无法直接返回 nested 文档的相关统计结果(参见Issue)。为了统计 nested 字段的 term 情况,有两种解决方案
- 在 nested 字段上开启
include_in_root
选项,将 nested 字段中的内容 duplicate 到父文档 - 在 nested 中需要统计 termvectors 的字段上,将term_vector选项设置为 no,手动关闭 term_vector 存储。这种情况下_termvectors 接口会在运行时计算 termvector.
一些测试
测试用的 index mapping 定义
{
"mappings": {
"properties": {
"f4": {
"type": "text",
"term_vector": "with_positions_offsets_payloads"
},
"f1": {
"type": "nested",
"include_in_root": true,
"properties": {
"text": {
"type": "text",
"term_vector": "with_positions_offsets_payloads"
}
}
},
"f2": {
"type": "nested",
"include_in_root": false,
"properties": {
"text": {
"type": "text",
"term_vector": "no"
}
}
},
"f3": {
"type": "nested",
"include_in_root": false,
"properties": {
"text": {
"type": "text",
"term_vector": "with_positions_offsets_payloads"
}
}
}
}
}
}
测试用的 doc
{
"f1": {
"text": "hello world"
},
"f2": {
"text": "hello world"
},
"f3": {
"text": "hello world"
}
}
测试结果:
{
"_index": "foo",
"_id": "1",
"_version": 1,
"found": true,
"took": 0,
"term_vectors": {
"f2.text": {
"field_statistics": {
"sum_doc_freq": 2,
"doc_count": 1,
"sum_ttf": 2
},
"terms": {
"hello": {
"term_freq": 1,
"tokens": [
{
"position": 0,
"start_offset": 0,
"end_offset": 5
}
]
},
"world": {
"term_freq": 1,
"tokens": [
{
"position": 1,
"start_offset": 6,
"end_offset": 11
}
]
}
}
},
"f4": {
"field_statistics": {
"sum_doc_freq": 2,
"doc_count": 1,
"sum_ttf": 2
},
"terms": {
"hello": {
"term_freq": 1,
"tokens": [
{
"position": 0,
"start_offset": 0,
"end_offset": 5
}
]
},
"world": {
"term_freq": 1,
"tokens": [
{
"position": 1,
"start_offset": 6,
"end_offset": 11
}
]
}
}
},
"f1.text": {
"field_statistics": {
"sum_doc_freq": 4,
"doc_count": 2,
"sum_ttf": 4
},
"terms": {
"hello": {
"term_freq": 1,
"tokens": [
{
"position": 0,
"start_offset": 0,
"end_offset": 5
}
]
},
"world": {
"term_freq": 1,
"tokens": [
{
"position": 1,
"start_offset": 6,
"end_offset": 11
}
]
}
}
}
}
}
分析
f1.text
有返回,因为include_in_root
f2.text
有返回,因为 termvectors 的 on-the-fly 计算f3
没返回,符合 termvectors 的奇怪行为f4
有返回,符合预期- 注意需要在调用
termvectors
时显式声明fields