Embeddings and Semantic Search

Embeddings convert text into vectors so machines can compare meaning using math.

Why embeddings are useful

Find relevant content even when words are different.
Support natural language search in large document sets.
Power recommendation and clustering systems.

flowchart LR A[Query] --> B[Query embedding] C[Document chunks] --> D[Chunk embeddings] B --> E[Vector similarity] D --> E E --> F[Top matches]

Keyword vs semantic search

Keyword: looks for exact words.
Semantic: looks for meaning similarity.

Example: query "reduce infra cost" can match doc "cloud spend optimization."

One-hot encoding limitations

Historically, each vocabulary word mapped to a sparse vector of length |V| with a single 1. Problems:

Extreme sparsity — 100k words ⇒ 100k dimensions mostly zeros.
No semantic similarity — dot product between any two distinct one-hot vectors is 0; “cat” is as far from “dog” as from “refrigerator”.

Distributed embeddings (Word2Vec-style)

Dense vectors (often 100–300 dims) encode latent features learned from context. Similar words cluster in space, enabling analogies like king − man + woman ≈ queen.

Cosine similarity vs Euclidean distance

Cosine similarity measures the angle between vectors, normalizing for length and frequency skew:

cos(θ) = (A · B) / (‖A‖ ‖B‖)

We prefer cosine over raw Euclidean distance when word frequency stretches vector magnitude. We avoid tangent-based metrics because they explode near orthogonal vectors.

Best practices

Use good chunking strategy (size + overlap).
Store metadata for filtering.
Re-rank top results before giving to LLM.