Setting up Full Text Search

The goal of the full-text search is to provide a "Google-like" search: a single search field that searches "intelligently" through an entire document, without the user needing to specify which fields to search.

To achieve this, Wedia aggregates all the data from a file into a single index, the "full-text" index, which allows the user to search multiple attributes in a single request. Other search indexes exist, but they use attributes to carry out searches in a specific field and not in an entire file.

The information stored in the full-text index is also reprocessed in order to understand the user’s request better.

This processing sometimes causes a feeling of lost precision, and when a search must be perfectly precise (a product code for example), a search in the specific field storing the product code will be more relevant.

This article explains how the Wedia full-text engine works to provide users with more help in their search.

Understanding how media files are stored in the full-text index.

Understanding how the Wedia search engine works means understanding how it indexes documents and how it finds them.

We define what must be indexed in the file in the configuration mechanism: not everything is necessarily indexed, nor will everything stand out in the full-text search. In the following example, the text fields and tags are indexed, as is the type of view and the type of rights: this means that you will find "BMW 6 external" in the full-text search.

 

As soon as a file is created or modified, its contents are stored in the index. However, it does pass through several filters, whose role will be to standardize the stored text in order to improve the relevance of the results.

By default, Wedia analyzes the text using rules adapted to French, which are as follows:

  • Removing capital letters:
    A search for BMW or BmW will provide the same result.

  • Removing punctuation:
    All periods, commas, apostrophes and slashes are deleted. This rule may then affect searches for references like "1212.A.US" or "121UA/FR", which are no longer searchable as is.

  • Standardizing accented characters:
    é, à, ç, ð, æ, å are stored as e, a, c, o, ae, a.

  • Separating text into several words:
    This filter studies punctuation to index all content of a text separately: for example, in the text "To be or not to be.That is the question". You will notice that there is no space after the period. Without this filter, Wedia will index "be.That" as a single word: "bethat". With this filter, it will understand that it has to index "be" and "that" separately. This filter also acts on dashes: "up-to-date" is indexed as "up" and "to" and "date". And again, a code with dashes "1212-AUS" will be indexed separately as "1212" and "AUS".

  • Stopword:
    This filter consists of a list of insignificant words that are removed from the document before the indexing process begins. This filter is used to avoid Indexing words like "and", "from", "to", "that", "who", etc. Of course, the list is specific to each language. Some words are significant in one language and not in another. In German, the word "man" is removed, yet it is very significant in English. In French, we remove the word "car", which is very significant in English. If English text is found in German fields and vice-versa by mistake, the "stopword" filter can create confusion. As a result, it is very important to characterize the language of the attribute properly when configuring.

  • Stemmer:
    The term is then passed through a filter that searches for its root: a stemmer (stem in the sense of "root" in English) uses certain rules to determine the root of a word. For example, the words "developer", "develop", "development", etc. will be processed into "develop".

  • Elision:
    The elision filter can be very important in some languages (such as French) but rather less important in others (like English). It removes insignificant "words" before indexing. For example, "I’m waiting for you to call me" will be indexed as "wait for you to call" (in the end, "for" and "to" will probably be removed by the stopword filter). As you can see, the words "I" and "me" have been removed because they are part of the elision filter list.

  • Synonyms:
    When you include thesauruses in Wedia, you can state that one term is synonymous with another. In these cases, Wedia stores synonymous terms to improve searches.

This default analysis is applied to non-language fields. Fields supporting a language are analyzed by using an analyzer compatible with that language if possible. The difference is usually in the radicalization of words and a list of "stop words" specific to this language.

The mechanism for searching a multi-language document corpus is explained later in this document.

You can refine the analysis of certain fields by clarifying specific analysis rules such as:

  • remove management of radicalization and "stop words";

  • completely remove the ability to cut up words (stemming) in order to manage the resulting references for example.

Remember that filters alter the way in which terms are stored.
In particular, when they act on references, this can totally distort them. A search for "ABD12-AND/SMALL" will not work optimally. The search engine knows only one file with the words "ABD12", "SMALL", the "AND" having been removed as a "stopword".

Word combinations: the "or" by default

When a search is initiated for a phrase such as "blue boat", the Wedia search engine translates this as all texts including "blue" or "boat". It’s a broad search, similar to Google’s.

You can modify the default search by making it an inclusive search: when this method is configured, it will only search for documents containing both "blue" and "boat".

Ask your project manager to show you how the default search in your Wedia project is configured. By default, Wedia is delivered with an "or" search, just like Google.

When you want to look for exact files, you can add details to the search engine. This can be done simply by using operators in the desired sentence.

For example, in the search for "blue boat", we want to focus on finding documents that only have the two terms "blue" and "boat"

In this case, we can specify this search to the search engine by adding a "+" in front of the mandatory words: "blue +boat" will search for documents that have both the term "blue" and "boat".

We can also exclude a term using the "-" operator: "blue +boat -sail" will obligatorily find documents with the words "blue" and "boat" but never with "sail" (or sailing, etc.) in the document.

The desired, but not mandatory, word can be prefixed by "~". For example, if you want to search for blue or red sailboats, you can write the following query: "+boat +sail ~blue ~red".

Note: this is what the search engine does by default. When you enter the phrase "blue boat", it translates it internally as "~blue ~boat"

Phrase searches

A word search finds documents containing these words anywhere in a document. Therefore, the search for blue boat, will find documents containing "blue boat" or "boating holiday under a blue sky".

You can make more demanding searches using the position of search terms by enclosing "linked" words using quotation marks to search for a phrase.

In the previous example, you can find just the first document by searching for "blue boat". Syntax analysis is always applied to this kind of search, and therefore we will obtain the same result by searching for "blue boats".

Begins with, contains or ends with

If you want to search for all documents that start with a term, you can use the "*" operator:

For example, "a green boa*" will return documents with "boat, boa constrictor, boar", etc.

For performance reasons, searches for "contains" or "ends with" are not supported. Searches for "bl*e" or "*lue" will return nothing. If you want to do this type of search, you will have to do it in the index of a specific field (the product code for example), and not through the full-text index.

Number searches

The search works on numbers: BMW 6 Series will find all documents with a "6" in them.

There is one exception: if we have a file in which a field is defined as a numeric attribute (a weight, for example), we can choose never to index it so that it will not spoil the results. If it is absolutely imperative to index it, then a specific configuration is necessary.

See : https://crossmedia.atlassian.net/wiki/spaces/WD/pages/11665908/Indexation#SKU%2C-product-references%2C-identifier-indexation

Searching for document content

If you wish, you can index the content of Word, Excel, PowerPoint, PDF and InDesign documents that are inserted into Wedia.

By default, the full-text search will look for the indexed content of these files. You can disable this search behavior by unchecking the corresponding option under the search field as shown in the screen shot below.

Multilingual searches

When Wedia is installed in a multilingual environment, full-text search applies to all indexed languages by default.

It is possible to limit the search to a few languages, such as the languages contributed by the connected user by means of a simple configuration during the initial set up.

During a multilingual search, the searched text is analyzed according to the indexed fields.

For example, if the search is limited to users' contributed languages and a French/English contributor searches for ~"red sailboat" ~"red boat", the text is analyzed by the English analyzer before being searched for in English fields and analyzed in French for French fields. If some fields are configured using simpler rules, the text is also analyzed using these rules to search in these fields (product reference, email, etc.).

Sorting results by relevance

By default, the results returned by a full-text search are sorted by relevance. The rules for calculating relevance are complex but respect a few simple rules.

A search using "OR" is just as relevant as the words that are found. For example, if you search for ~red ~sailboat, the documents containing "red" and "sailboat" will be found before those containing only "red" or only "sailboat".

The text of a document is more relevant than the results of the categorization attribute. To use the previous example once again, if our documents have a categorization attribute of "color", documents containing "red sailboat" in their title will be returned before those containing only "sailboat" and categorized by the color "red".

A word is just as relevant in a text if it is rare. If we look for "sailboat", the document titled "recreational sailboat" is more relevant than the one titled "sailboat, sailboat, sailboat, etc."

 

Enhancing Full-Text Search with Property Boosting

In full-text search implementations, it can be beneficial to adjust the significance of specific properties over others to refine the search results. For instance, a document's title might be more relevant than the photographer's name when executing a search. This customization is referred to as "term boosting."

Default Behavior: By default, all properties contribute equally to the full-text search process.

Activating Term Boosting: To implement term boosting, you must enable the "WXM_FulltextSandbox" plugin. Once the plugin is active, you can configure the "fieldsBoost" property to modify the search relevance of certain properties.

 

Configuration Syntax: Term boosting is configured by supplying a JSON object where the key is the property name and the value is the boost factor. The syntax is as follows: "property_name": boost_value.

Example: To boost the importance of the name property of an asset by a factor of 42, the JSON configuration would be: { "asset.name": 42 }.

Using Wildcards: Boost factors can also be applied to multiple properties using wildcards. For example, { "*.name": 42 } applies a boost factor of 42 to the name property across all indexed objects.

Tagging for Boosting: Another approach to boosting involves tagging properties with a free-format tag, such as "#boost10". To boost all properties tagged with #boost10 by a factor of 10, include in the JSON object: {"*.#boost10": 10}. This will enhance the importance of all tagged properties accordingly.

Configuration specific to your project

In order to match the full-text search to your needs in the best way possible, you can fine-tune your search during your project implementation stage.

The table below shows current configurations, and you can ask your project manager to explain to you how your Wedia is set up.

Project configuration:

  • Default search:

X

OR (in a Google-like search for "blue boat" you will get documents with "blue", others with "boat", and some with both "blue" and "boat").

 

AND (strict search: in a search for "blue boat" you will obtain documents with only "blue" and "boat").

  • Limiting search languages by user:

X

by contributed language (if the user has the right to contribute files in English and French, searches will be made in English and French fields, but not in Spanish ones).

 

Without restriction (Users can search in all languages).

 

Restricted to the interface language (Users with a Spanish interface may only search for texts translated into Spanish).

 

Other (rule is customized in the planning stage).

  • Customizing analyzers:

X

Analyzers (stop words, stemmers, etc.) are not customized.

 

Analyzers have been customized in the project stage.