VectorStore DB settings
Sparse vs Dense Index
Dense Index
- What it is:
Dense indexing is used for vectors where nearly every element carries a value—these represent continuous, high‑dimensional embeddings, typically produced by neural networks. - Usage:
Ideal for applications where semantic meaning is captured in every dimension of the vector. For example, text embeddings where each number contributes some notion of context or meaning. - Analogy:
Imagine a dense index as a full‑colour image where every pixel holds a piece of the picture.
Why Dense Indexes for LLM Embeddings?
- Continuous, high‑dimensional space
Neural embedders output real‑valued vectors (e.g.[0.12, –0.03, 1.27, …]
) in which every dimension carries subtle semantic signals. - Smooth similarity landscape
Nearby points in this space interpolate meaning smoothly—e.g.Approximate nearest‑neighbour (ANN) structures like HNSW or IVF‑PQ are optimised for these dense vectors.vec("king") – vec("man") + vec("woman") ≈ vec("queen")
- Semantic arithmetic
Real‑valued dimensions support vector arithmetic and permit fine‑grained semantic shifts along learned axes (gender, tense, topic, …).
Sparse Index
- What it is:
Sparse indexing is designed for vectors that contain many zero (or near‑zero) values, with only a few non‑zero entries carrying information. This is common in representations like bag‑of‑words or TF‑IDF. - Usage:
Best for scenarios where only a handful of discrete features matter—e.g. keyword matching in search engines. - Analogy:
Think of a sparse index as a dot‑to‑dot drawing where only specific points matter and the majority of the canvas remains blank.
Similarity Metrics
When querying vector databases like Pinecone, you choose a metric that determines how similarity between vectors is computed.
Metric | Description | When to Use | Example |
---|---|---|---|
Cosine | Computes the cosine of the angle between two vectors. Emphasises direction rather than magnitude, useful when vectors are normalised. | When you care more about orientation or semantic similarity regardless of scale. | Comparing sentence embeddings to find articles about the same topic, irrespective of length or word count. |
Euclidean | Measures the straight‑line (L2‑norm) distance between two vectors. Sensitive to magnitude differences. | When absolute distance is important, as with spatial coordinates or raw feature‑space distances. | Locating the nearest stores to a customer on a map (latitude/longitude embeddings). |
Dot Product | Calculates the inner product of two vectors, combining magnitude and direction. Closely related to cosine if vectors are normalised. | When vector magnitude carries meaning (e.g. popularity, confidence). | Recommending products where popularity (magnitude) and similarity both matter—higher‑rated items get a bigger boost. |
Dimension
- Definition:
The dimension of a vector refers to the number of elements (or features) it contains. For example, an embedding with a 256‑dimension vector has 256 features. - Impact of Dimension Variations:
- Lower dimension (e.g. 256):
Captures less detail but is more computationally efficient and less storage intensive. - Higher dimension (e.g. 1024):
Captures more nuanced features, potentially improving accuracy, at the cost of more storage and compute and the risk of the curse of dimensionality.
- Lower dimension (e.g. 256):
- AWS Titan Text Embeddings V2 Example:
Choosing between 256, 512 or 1024 dimensions is a trade‑off:- 256‑dim: Faster, lighter, but may miss subtle signals.
- 1024‑dim: Richer semantics, but heavier on resources.
- Analogy:
Consider dimension as the number of pixels in an image: more pixels (higher dimension) provide a higher resolution, while fewer pixels yield a simpler, coarser image.