PHOTO:
unsplash
Sinequa's recent announcement of the availability of its enterprise search application on Azure focused heavily on the ability of cloud services to cope with the challenges of building and maintaining a search index. In this column I want to outline how a search index is created using a content processing pipeline, and touch on a few index management issues.
The Role of an Inverted File
The underlying technology of search is referred to as an inverted file. This is a list, not a database. The easiest way to think of it is as the index of a book. Every content item is given a number and then taken to pieces called tokens, which are not words. The key to compiling an index is the content processing pipeline. Natural language processing (NLP) plays a key role here: extracting entities, disambiguating personal names and normalizing dates are just a few examples. PDF documents, PowerPoint presentations and tables/Excel spreadsheets all need careful curation, as will metadata tags.