Page 1 of 1

Product Quantification (PQ)

Posted: Sun Jan 26, 2025 5:05 am
by Fgjklf
Another way to create an index is product quantization (PQ), which is a lossy compression technique for high-dimensional vectors. It takes the original vector, splits it into smaller chunks, simplifies the representation of each chunk by creating a representative “code” for each chunk, and then puts all the chunks back together, without losing information that is vital for similarity operations. The PQ process can be broken down into four steps: splitting, training, encoding, and querying.

Product quantification
Division: Vectors are divided into segments.
Training: We build a “codebook” for each segment. Simply put, the algorithm generates a pool of potential “codes” that could be assigned to a vector. In practice, this “codebook” is composed of the center points of effective poland mobile numbers list the clusters created by performing k-means clustering on each of the segments of the vector. We would have the same number of values ​​in the segment’s codebook as the value we used for k-means clustering.
Encoding: The algorithm assigns a specific code to each segment. In practice, we find the closest value in the codebook to each vector segment after training is complete. Our PQ code for the segment will be the identifier of the corresponding value in the codebook. We could use as many PQ codes as we wanted, meaning we can select multiple values ​​from the codebook to represent each segment.
Query: When we query, the algorithm decomposes the vectors into subvectors and quantizes them using the same codebook. It then uses the indexed codes to find the closest vectors to the query vector.
The number of representative vectors in the codebook is a trade-off between the accuracy of the representation and the computational cost of searching the codebook. The more representative vectors in the codebook, the more accurate the representation of the vectors in the subspace, but the higher the computational cost of searching the codebook. Conversely, the fewer representative vectors in the codebook, the less accurate the representation, but the lower the computational cost.

Locality Sensitive Hashing (LSH)
Locality-Sensitive Hashing (LSH) is a technique for indexing in the context of an approximate nearest neighbor search. It is optimized for speed while still providing an approximate, non-exhaustive result. LSH maps similar vectors into "buckets" using a set of hash functions, as seen below:


To find the nearest neighbors for a given query vector, we use the same hash functions that are used to "group" similar vectors in hash tables. The query vector is applied to a particular table and then compared to the other vectors in that same table to find the closest matches. This method is much faster than searching the entire dataset because there are far fewer vectors in each hash table than in the entire space.

It is important to remember that LSH is an approximate method and the quality of the approximation depends on the properties of the hash functions. In general, the more hash functions used, the better the quality of the approximation. However, using a large number of hash functions can be computationally expensive and may not be feasible for large data sets.

Hierarchical Navigable Small World (HNSW)
The HNSW algorithm is a method used to perform efficient searches in high-dimensional vector databases. It was proposed as an alternative to binary search trees and kd trees, which do not perform optimally in high-dimensional spaces.

The HNSW algorithm is based on the idea of ​​building a hierarchical structure that organizes data into different levels. Each level is composed of a set of graphs, where each graph is a set of nodes connected to each other. These nodes represent the data points in the vector database.

The construction process of the HNSW algorithm consists of the following steps:

An initial graph is created with a node representing a data point randomly selected from the database.
More nodes are then added to this graph. Each new node is connected to a set of existing nodes, using a proximity function that measures the similarity between the feature vectors of the data points.
The process of node aggregation is repeated until a complete graph is constructed at a given level.
The next level of the hierarchy is then created, using a neighbor selection function to connect nodes in one level to those in the lower level. This function ensures that a hierarchical structure is maintained at higher levels and allows for fast searching through the hierarchy.
The level-building process is repeated until the lowest level is reached, which is usually a complete graph where every node is connected to every other node.