Skip to content

Search tokenization

Tokens are the separate terms that Elasticsearch has taken from input text and divided up. Tokens are stored in Elasticsearch and used to find matches against search terms.

Documents are the data structures used in Elasticsearch indexes. Each document is stored in JSON format and has a collection of fields, which is the data in key-value pairings.

Tokenization typically occurs:

  1. When a document (e.g. post) is stored or updated in Elasticsearch.
  2. During a search, the search term is broken down into tokens.

Search term tokens are then compared against the document tokens to find a match. The tokenization process applies a more sophisticated set of rules during a search, which eliminate the need for full-text search and also returns more relevant document results.

Prerequisites

Retrieve settings

The wp vip-search get-index-settings CLI command retrieves the current settings of an index:

vip @example-app.develop -- wp vip-search get-index-settings <indexable_type> 

The default analyzer will be included in the results returned by the command.

In this example, the get-index-settings command is run against the application and environment @example-app.develop. A specific section of the settings that sets the default analyzer is requested, and this particular example requires that jq is installed on the local machine:

$ vip @example-app.develop -- wp vip-search get-index-settings post --format=json | jq '.[][].settings.index.analysis.analyzer.default'
{
"filter": [
"ep_synonyms_filter",
"ewp_word_delimiter",
"lowercase",
"stop",
"ewp_snowball"
],
"char_filter": [
"html_strip"
],
"language": "english",
"tokenizer": "standard"
}

Unless the predefined behavior has been modified, the configuration of a site’s index default analyzer will be the same as the example above.

Retrieve settings for a specific filter

For users with jq installed on their local machine, settings for individual filters can be queried with the get-index-settings command. In this example, settings for the ep_synonyms_filter are queried:

$ vip @example-app.develop -- wp vip-search get-index-settings post --format=json | jq '.[][].settings.index.analysis.filter.ep_synonyms_filter'

The settings returned for the example command above:

{
  "type": "synonym_graph",
  "lenient": "true",
  "synonyms": [
    "sneakers, tennis shoes, trainers, runners",
    "shoes => sneaker, sandal, boots, high heels"
  ]
}

The example output above shows that the built-in synonym_graph is serving as a base for ep_synonyms_filter. Also shown in the returned output are examples of defined synonyms.

This example demonstrates two important things:

  1. Collection of tokens is not a list, but rather a graph that can branch out and connect back in.
  2. Filters are not limited to one-to-one conversations. For example, two tokens "tennis" "shoes" can become one – "sneakers" and vice-versa.

Filters

Filters—specifically token filters—are the most useful tool for increasing search relevancy.

Token filters parse the tokens that were generated by the previous filter, and produce new tokens based on their settings. Every filter will receive the result of the previous filter. Because of this, the order of filters will have an impact on results in most cases.

In an earlier example, the filter settings returned by get-index-settings were:

"filter": [
    "ep_synonyms_filter",
    "ewp_word_delimiter",
    "lowercase",
    "stop",
    "ewp_snowball"
  ],

The filter keyword sets the filters in the settings, and the filters are listed in the order which they are applied by the analyzer.

The filters listed in the example above and the filters defined below do not represent the complete list of all available filters. ElasticSearch has several additional built-in filters that can be configured to define a custom filter.

ep_synonyms_filter

The ep_synonyms_filter custom filter allows the analyzer to handle synonyms, including multi-word synonyms.

{
  "type": "synonym_graph",
  "lenient": "true",
  "synonyms": [
    "sneakers, tennis shoes, trainers, runners",
    "shoes => sneaker, sandal, boots, high heels"
  ]
}

In the above example, the built-in synonym_graph is set as a base for ep_synonyms_filter. The tokens "green" "tennis" "shoes" transform into "green" ("sneakers"|"tennis" "shoes"|"trainers"|"runners").

The above settings allow a search query for blue sneakers to return the same results as a query for blue tennis shoes.

ewp_word_delimiter

The ewp_word_delimiter custom filter uses the base word_delimeter filter serves as a tool to break down composed terms or words to tokens based on several rules..

{
  "type": "word_delimiter",
  "preserve_original": "true"
}

As an example, "WordPress" will be broken down into two terms "Word" "Press". This allows the search term "Word" to match a "WordPress" document.

lowercase

Lowercase is a built-in filter that converts all letters to lowercase, enabling search to become case insensitive.

As an example of the importance of the order of filters, if the lowercase filter was applied before ewp_word_delimiter, the term "WordPress" would not be split into "Word" "Press". The lowercase filter would convert "WordPress" to "wordpress" before it was passed to the ewp_word_delimiter filter, so the rule to split tokens at letter case transitions would not apply.

stop

The stop filter removes a predefined list of stop word lists for several languages when applied. For the English language, stop words include a or the, for example. Removing these words from the token collection helps documents that are more relevant to a search term to score higher than documents that have large numbers of stop words in them.

ewp_snowball

The base snowball filter stems words into their basic form. For example, "jumping" "fox" will be converted to "jump" "fox".

{
  "type": "snowball",
  "language": "english"
}

The result of this filter if applied, is that if one document contains "jumping bear", and another document contains "high jump" in its content. They will both score the same for the search term "jumping".

Customizing the analyzer

An analyzer is the main tool Elasticsearch uses to produce tokens and is a collection of rules that govern how tokenization will be executed. Enterprise Search defines a default custom analyzer that will be used if a field in mappings does not have an explicit analyzer.

Several ElasticPress filters are available for customizing the way the default analyzer operates.

Note

Added customizations are only applied when an index is initially created or when an existing index is versioned.

Changing default filters

The ep_default_analyzer_filters WordPress filter returns an array of filters that will be applied to the default analyzer.

This filter can be used to modify the list of token filters that will be applied to the default analyzer.
For example, to make the search case sensitive remove the lowercase filter:

add_filter( 'ep_default_analyzer_filters', function ( $filters ) {
    if ( ( $key = array_search( 'lowercase', $filters ) ) !== false ) {
        unset( $filters[ $key ] );
    }
    return $filters;
} );

Be aware that ElasticSearch adds the  ep_synonyms_filter token filter as the first item in the array of filters returned by ep_default_analyzer_filters.

So for example the following code:

add_filter( 'ep_default_analyzer_filters', function ( $filters ) {
    return array('lowercase', 'stop');
} );

will actually return the array:  array( 'ep_synonyms_filter', 'lowercase', 'stop')

Changing language

Many of the filters for customization are language-dependent. By default, the ep_analyzer_language filter is used to change the stemming Snowball token filter, and a limited number of accepted languages can be applied.

For example, to switch language to French:

add_filter( 'ep_analyzer_language', function( $language, $context ) {
    return 'french';
}, 10, 2 );

Define synonyms

A list of synonyms can be defined by implementing ep_synonyms filter.

Note that the group of synonyms is each a single string separated with commas ,.

add_filter( 'ep_synonyms', function() {
    return array( 
		'vip,automattic',
		'red,green,blue'
	);
} );

To disable synonyms return an empty array. The ep_synonyms_filter will then not be created nor used in the default analyzer.

add_filter( 'ep_synonyms', '__return_empty_array' );

Define custom filters

Custom filters can be added by extending an existing filter.

The ep_<indexable>_mapping WordPress filter allows mappings to be modified after its generation, but before publishing.

This is a kind of catch-all filter that can be used to make any adjustments that do not already have a custom WordPress filter defined.

In the following example, the ep_post_mapping filter is used to define the custom token filter my_custom_word_delimiter. This variant of the filter will not split on case change, nor on _ or -. Adding this to ep_default_analyzer_filters will ensure that the filter is added to the default analyzer and is applied in the correct order of the list of token filters.

add_filter( 'ep_post_mapping', function ( $mapping ) {
	$mapping['settings']['analysis']['filter']['my_custom_word_delimiter'] = [
		'type'                 => 'word_delimiter_graph',
		'preserve_original'    => true,
		'split_on_case_change' => false,
		'type_table'           => array( '_ => ALPHA', '- => ALPHA' ),
	];

	return $mapping;
} );

add_filter( 'ep_default_analyzer_filters', function() {
	return array( 'my_custom_word_delimiter', 'lowercase', 'stop', 'kstem' );
} );

Language agnostic

To make search insensitive to diacritics, add asciifolding into the filter ep_default_analyzer_filters to convert Unicode characters to their ASCII counterparts:

add_filter( 'ep_default_analyzer_filters', 'vip_add_asciifolding_filter' );

function vip_add_asciifolding_filter( $filters ) {
    $filters[] = 'asciifolding';

    return $filters;
}

For example, Elasticsearch considers “Español” and “Espanol” to be different terms and will yield different search results. Once asciifolding has been applied, “Español” and “Espanol” will be treated as the same term.

Last updated: August 19, 2024

Relevant to

  • WordPress