The index Plan

发布时间：2020-07-05 03:45:10 作者：努力的C
来源：网络阅读：328

In order to index the CSV, we want to take two fields from each row, title and description, and turn them into suitable terms. For straightforward textual search we don’t need document values.

Because we’re dealing with free text, and because we know the whole dataset is in English, we can use stemming so that for instance searching for “sundial” and “sundials” will both match the same documents. This way people don’t need to worry too much about exactly which words to use in their query.

Finally, we want a way of separating the two fields. In Xapian this is done using term prefixes, basically by putting short strings at the beginning of terms to indicate which field the term indexes. As well as prefixed terms, we also want to generate unprefixed terms, so that as well as searching within fields you can also search for text in any field.

There are some conventional prefixes used, which is helpful if you ever need to interoperate with omega (a web-based search engine) or other compatible systems. From this, we’ll use ‘S’ to prefix title (it stands for ‘subject’), and for description we’ll use ‘XD’. A full list of conventional prefixes is given at the top of the omega documentation on termprefixes.

When you’re indexing multiple fields like this, the term positions used for each field when indexed unprefixed need to be kept apart. Say you have a title of “The Saints”, and description “Don’t like rabbits? Keep reading.” If you index those fields without a gap, the phrase search “Saints don’t like rabbits” will match, where it really shouldn’t. Usually a gap of 100 between each field is enough.

To write to a database, we use the WritableDatabase class, which allows us to create, update or overwrite a database.

To create terms, we use Xapian’s TermGenerator, a built-in class to make turning free text into terms easier. It will split into words, apply stemming, and then add term prefixes as needed. It can also take care of term positions, including the gap between different fields.

为了对CSV进行索引，我们要从每行中取两个字段，标题和描述，并将其转换成合适的term。对于简单的文本搜索，我们不需要文档值。

因为我们正在处理自由文本，并且因为我们知道整个数据集是英文的，所以我们可以使用词干，例如搜索“sundial”和“sundials”都将匹配相同的文档。这样一来，人们不需要太多关心在查询中使用哪些单词。

最后，我们想要一种分离这两个字段的方法。在Xapian中，这是使用trem prefixes完成的，基本上是通过在术语开头放短字符串来指示术语索引的字段。除了前缀术语之外，我们还要生成无偏见的术语，以便在字段内搜索，也可以在任何字段中搜索文本。

有一些常规的前缀使用，如果您需要与omega（基于Web的搜索引擎）或其他兼容系统进行互操作，这是有帮助的。从此，我们将使用'S'来标题（它代表'subject'），对于描述，我们将使用'XD'。 omega文档的顶部提供了常规前缀的完整列表。

当您对这样的多个字段进行索引时，需要将索引未修改的每个字段使用的术语位置分开。说你有一个标题“圣徒”，并描述“不喜欢兔子？继续读书。“如果你没有间隙地索引这些字段，搜索”圣徒不喜欢兔子“这个词将会匹配，真的不应该。通常每个领域之间的差距就足够了。

要写入数据库，我们使用WritableDatabase类，它允许我们创建，更新或覆盖数据库。

要创建条款，我们使用Xapian的TermGenerator，一个内置的类来使自由文本变得更容易。它将分割成单词，应用词干，然后根据需要添加术语前缀。它也可以照顾到职位，包括不同领域之间的差距。

The index Plan

相关阅读