Embeddings and Semantic Search

Embeddings convert text into vectors so machines can compare meaning using math.

Why embeddings are useful

flowchart LR A[Query] --> B[Query embedding] C[Document chunks] --> D[Chunk embeddings] B --> E[Vector similarity] D --> E E --> F[Top matches]

Keyword vs semantic search

Keyword: looks for exact words.
Semantic: looks for meaning similarity.

Example: query "reduce infra cost" can match doc "cloud spend optimization."

One-hot encoding limitations

Historically, each vocabulary word mapped to a sparse vector of length |V| with a single 1. Problems:

Distributed embeddings (Word2Vec-style)

Dense vectors (often 100–300 dims) encode latent features learned from context. Similar words cluster in space, enabling analogies like king − man + woman ≈ queen.

Cosine similarity vs Euclidean distance

Cosine similarity measures the angle between vectors, normalizing for length and frequency skew:

cos(θ) = (A · B) / (‖A‖ ‖B‖)

We prefer cosine over raw Euclidean distance when word frequency stretches vector magnitude. We avoid tangent-based metrics because they explode near orthogonal vectors.

Best practices