API reference

This reference provides an overview of the key APIs, data types, and functions created with the extension.

The following data type is used to store token IDs and their respective texts:

Data type	Description
bm25vector	A specialized data type that stores a representation of token IDs and their respective text for BM25 indexing. It is designed to work efficiently with the BM25 ranking algorithm.

The following functions are used to initialize a tokenizer and converts the text to generate token IDs:

Component	Description
create_tokenizer(tokenizer_name text, config jsonb)	Initializes a tokenizer with the specified name and configuration. The configuration can include parameters such as the pre-trained model to use (e.g., BERT).
tokenize(string_text text, tokenizer_name text)	Converts a string text into a `bm25vector` format using the specified tokenizer. This function processes the input text and generates the necessary token IDs and frequencies for BM25 indexing.

The following functions and operators are used to calculate document relevance:

Component	Description
to_bm25query(index_name regclass, query_vector bm25vector)	Converts a `bm25vector` into a `bm25query` format. Requires the index name to ensure the query is compatible with the index structure.
`<&>`	A binary operator that takes a `bm25vector` and a `bm25query` and returns a `float4` score. The score is negative, where more negative values indicate higher relevance. This operator is used to calculate the BM25 score for a given document against a query.