API reference
This reference provides an overview of the key APIs, data types, and functions created with the extension.
Data types
The following data type is used to store token IDs and their respective texts:
| Data type | Description |
|---|---|
| bm25vector | A specialized data type that stores a representation of token IDs and their respective text for BM25 indexing. It is designed to work efficiently with the BM25 ranking algorithm. |
Tokenization functions
The following functions are used to initialize a tokenizer and converts the text to generate token IDs:
| Component | Description |
|---|---|
| create_tokenizer(tokenizer_name text, config jsonb) | Initializes a tokenizer with the specified name and configuration. The configuration can include parameters such as the pre-trained model to use (e.g., BERT). |
| tokenize(string_text text, tokenizer_name text) | Converts a string text into a bm25vector format using the specified tokenizer. This function processes the input text and generates the necessary token IDs and frequencies for BM25 indexing. |
Search and scoring operators
The following functions and operators are used to calculate document relevance:
| Component | Description |
|---|---|
| to_bm25query(index_name regclass, query_vector bm25vector) | Converts a bm25vector into a bm25query format. Requires the index name to ensure the query is compatible with the index structure. |
<&> | A binary operator that takes a bm25vector and a bm25query and returns a float4 score. The score is negative, where more negative values indicate higher relevance. This operator is used to calculate the BM25 score for a given document against a query. |
Could this page be better? Report a problem or suggest an addition!