Retrieval Systems AI

I was building a docs assistant and kept seeing two-tower retrieval and RAG used as if they were the same thing.
They are related, but they solve different layers of the stack, so this is the quick mental model I use.

1. Two-tower embedding model (dual encoder)

1
2
3
4
5
6
7
# Query tower and document tower produce vectors in the same space.
q_vec = query_tower.encode("how to reset password")
doc_vecs = document_tower.encode_batch(documents)

# Fast retrieval via vector similarity.
scores = cosine_similarity(q_vec, doc_vecs)
top_k_docs = top_k(documents, scores, k=5)

A two-tower model is about efficient retrieval at scale.
I can precompute document embeddings once, then only encode the query at runtime and run nearest-neighbor search.

Typical use cases:

  1. Search candidate retrieval
  2. Recommendations (products, ads, videos)
  3. First-stage ranking over very large corpora

Docs: Vector search overview (Azure AI Search)

2. RAG (Retrieval-Augmented Generation)

1
2
3
4
5
6
7
8
# 1) Retrieve relevant chunks
chunks = retriever.search("how do I rotate storage keys?", k=5)

# 2) Ground the prompt with retrieved context
prompt = build_prompt(question="how do I rotate storage keys?", context=chunks)

# 3) Generate answer with an LLM
answer = llm.generate(prompt)

RAG is a system pattern: retrieve context first, then generate an answer grounded in that context.
It improves factuality and makes answers reflect your own documents instead of only model pretraining.

Typical use cases:

  1. Chatbot that can answer questions about some unstructured data:
  • Internal knowledge chat
  • Question answering over private docs
  1. Support assistants with current company data

Docs: Retrieval Augmented Generation (RAG) in Azure AI Search

3. How they fit together

1
2
3
4
5
User query
-> Two-tower retrieval (fast candidate search)
-> Optional reranker
-> RAG prompt assembly
-> LLM answer

A practical setup is to use a two-tower model for fast retrieval and then pass the best chunks into a RAG pipeline.
So: two-tower is mostly about finding relevant content quickly, and RAG is about using that content to produce better answers.

Topic Two-tower model RAG
Main goal Fast relevance retrieval Grounded answer generation
Core output Ranked docs/items Natural language answer
Runtime pattern Encode query + vector search Retrieve + prompt + generate
Common pairing Used as retriever Uses retriever output

Docs: What is Azure AI Search? and Use embeddings and vector search