Single-Page Reference · Production GenAI Stack

GenAI Engineer cheatsheet

Everything on one page — fundamentals through cloud platforms. Search to filter, or jump by topic. Cloud blocks reflect 2026 naming (AgentCore, Microsoft Foundry, Gemini Enterprise Agent Platform) with the older names kept where interviewers still use them.

Foundations

LLMcore

LLM Fundamentals

A decoder-only transformer predicting the next token, autoregressively.

  • Lifecycle: pretraining (next-token on web scale) → SFT (instruction tuning) → alignment via RLHF or DPO.
  • Decoding knobs: temperature scales randomness, top-p (nucleus) caps cumulative prob mass, top-k caps candidate count.
  • Output is a probability distribution (logits → softmax); sampling strategy turns it into text.
  • MoE models route each token to a few expert sub-networks — more params, similar compute.
Interview gotchatemperature=0 is near-deterministic, not guaranteed deterministic — kernels/hardware can still vary.
tokenscost

Tokens & Tokenization

Models read tokens (subword units), not words or characters.

  • Built with BPE/subword algorithms — frequent strings become single tokens, rare words split.
  • Rule of thumb: ~0.75 words per token in English (~4 chars/token). Code & non-English inflate counts.
  • Tokens drive cost (priced per token) and limits (context = token budget, not word budget).
  • Input + output tokens both bill; output usually costs more.
Interview gotchaJSON, whitespace and rare names burn tokens fast — a reason structured prompts can be pricier than they look.
context

Context Window

The max tokens (prompt + response) a model can attend to in one call.

  • Bigger ≠ free: attention cost scales ~quadratically with sequence length; latency & memory grow.
  • "Lost in the middle": models recall the start and end of long context better than the middle — order matters.
  • Manage it: summarize history, retrieve only relevant chunks, trim, or use a rolling/episodic memory.
  • Long-context ≠ replacement for RAG — retrieval still wins on cost, freshness, and provenance.
transformerattention

Attention & Transformers

Self-attention lets every token weigh every other token.

  • Each token projects to Query, Key, Value; attention = softmax(Q·Kᵀ/√d)·V.
  • Multi-head: several attention subspaces run in parallel, then concatenate.
  • Positional info (RoPE, ALiBi, learned) injects order, since attention is permutation-invariant.
  • KV cache stores past K/V so generation is O(1) per new token instead of recomputing — the main inference memory cost.

Prompting & RAG

prompting

Prompting Patterns

Shaping behaviour without touching weights.

  • Zero / few-shot: instructions only vs instructions + examples.
  • Chain-of-Thought: "think step by step" — better reasoning, more tokens/latency.
  • ReAct: interleave reasoning + tool actions (the agent loop).
  • Structured output: force JSON via schema/function-calling; validate with Pydantic.
  • Put durable rules in the system prompt; keep task-specific detail in the user turn.
embeddingssearch

Embeddings & Retrieval

Text → dense vectors where semantic closeness = proximity.

  • Similarity via cosine/inner product; search via ANN (HNSW, IVF) for scale.
  • Query & doc embeddings must share a model; know symmetric vs asymmetric.
  • Hybrid search = dense + sparse (BM25) — sparse nails exact terms, numbers, names.
  • Rerank (cross-encoder) the top-k for a precision boost before generation.
RAGretrieval

RAG Pipeline

Ground generation in retrieved external knowledge.

  • Flow: parse → chunk → embed → index → retrieve → (rerank) → generate w/ citations.
  • Chunking: structure-aware; never split tables/clauses; ~10–20% overlap.
  • Advanced: multi-hop (decompose / iterate), adaptive routing (simple vs complex), Graph RAG for relationship-heavy corpora.
  • Small-to-big: retrieve precise chunks, return larger parent for context.
evals

Evaluation

Score retrieval and generation separately.

  • Retrieval: context precision/recall, hit rate, MRR, NDCG.
  • Generation: faithfulness (grounded, no hallucination), answer relevance, correctness.
  • Tooling: RAGAS, LLM-as-judge, custom rubrics.
  • Run a golden set as a regression gate on every prompt/retrieval change; wire scores to traces.
OCRDocument AI

OCR & Document AI

Turning messy PDFs/scans into structured, model-ready data.

  • Text-layer PDFs: pdfplumber / PyMuPDF extract text & tables directly — no OCR needed.
  • Scanned/image PDFs: pytesseract (Tesseract) or cloud OCR (Textract, Azure Document Intelligence, Google Document AI).
  • Layout-aware models: LayoutLM / Donut understand structure (tables, key-value) not just raw text.
  • Pattern: extract → layout-aware parse → LLM (Llama-3) for field extraction → Pydantic schema validation → confidence/field-accuracy metric.
Project proofThis is your AR invoice pipeline (pdfplumber + pytesseract + Llama-3 → 96% field accuracy) — lead with the metric and the validation layer.
vector DBcompare

Vector DB Comparison

Picking the store: library vs database, self-host vs managed.

  • FAISS: in-process library — fast, embeddable, you own persistence/filtering. Prototypes & single-tenant (your NewsRAG).
  • pgvector: vectors inside Postgres — one DB, transactional, great when you already run Postgres.
  • Qdrant / Weaviate / Milvus: purpose-built, rich metadata filtering, hybrid search, self-host or managed.
  • Pinecone: fully managed, zero-ops, scales — pay for convenience. Chroma: lightweight, dev-friendly.
Decision axesScale · metadata-filtering needs · write/update frequency · hybrid search · ops burden. Most "which DB?" answers come down to these five.

Agents & Frameworks

agents

Agents

An LLM in a loop with tools, state, and a goal.

  • Loop: reason → act (tool) → observe → repeat until done (ReAct).
  • Components: tools, memory/state, planning, a stopping condition.
  • Multi-agent: specialized agents (retriever, analyst, reviewer) coordinated by a supervisor/graph.
  • Guardrails: step limits, HITL gates, tracing — agents are non-deterministic and loop-prone.
  • Async matters: parallelize independent tool calls / retrievals to cut latency.
LangChain

LangChain

Composable building blocks for LLM apps.

  • LCEL (Expression Language): pipe components with | into Runnables — sync/async/stream/batch for free.
  • Core pieces: prompts, models, output parsers, retrievers, tools, document loaders, memory.
  • Great for linear chains & quick assembly; for branching/looping/state, reach for LangGraph.
Framing"LangChain for composition, LangGraph for control flow" is the clean one-liner.
LangGraphcore

LangGraph

Agents as stateful graphs — nodes, edges, shared state.

  • State: a typed dict passed between nodes; nodes return partial updates.
  • Reducers: define how updates merge vs overwrite — essential for message accumulation & parallel nodes.
  • Edges: conditional edges route based on state (e.g. tool-call? → tools : end).
  • Checkpointer: persists state → enables HITL interrupts, resume, time-travel.
messages: Annotated[list, add_messages]
graph.add_conditional_edges("agent", route,
   {"tools":"tools","end":END})
graph.add_edge("tools","agent")  # loop
MCP

MCP (Model Context Protocol)

Open standard for exposing tools/data to LLM apps.

  • Server advertises tools/resources; client (agent) discovers & calls them over one protocol.
  • Turns N bespoke integrations into one uniform contract — swappable, discoverable.
  • Decouples tool implementation from the agent (the "OpenAPI moment" for agent tooling).
  • Now first-class in cloud agent runtimes (Bedrock AgentCore Gateway, Vertex, Foundry tools).
LlamaIndexRAG

LlamaIndex

A data framework purpose-built for RAG / retrieval over your data.

  • Core abstractions: Documents → Nodes (chunks), Indexes, Retrievers, Query Engines, ingestion pipelines.
  • Strong at the data/ingestion + retrieval layer — many readers, node parsers, and built-in advanced retrieval (auto-merging, recursive, sub-question).
  • LlamaParse handles complex document/table parsing for RAG ingestion.
Framing"LlamaIndex is retrieval-first / data-centric; LangChain is orchestration-first / general-purpose. They compose — LlamaIndex as the retriever inside a LangChain/LangGraph app."
HuggingFacetransformers

HuggingFace Transformers

The standard library for loading, running & fine-tuning open models.

  • AutoModel / AutoTokenizer load any Hub model; pipeline() is the one-liner for inference.
  • Ecosystem: datasets, PEFT (LoRA/QLoRA), TRL (SFT/DPO), accelerate, bitsandbytes (quantization), safetensors.
  • This is the layer under your LoRA/QLoRA work and any self-hosted open-weight (Llama-3) inference.
  • For high-throughput serving you graduate from raw transformers to vLLM/TGI.

Fine-tuning

PEFTLoRA

LoRA / QLoRA / PEFT

Adapt a model by training a tiny fraction of parameters.

  • LoRA: freeze base, inject low-rank adapters (A·B) — train ~0.1–1% of params, swappable per task.
  • QLoRA: LoRA on a 4-bit (NF4) quantized base + paged optimizers → fits big models on one GPU.
  • Full FT: max quality, max cost; rarely needed when PEFT suffices.
  • Key hyperparams: rank r, alpha, target modules, learning rate.
alignment

Alignment: RLHF vs DPO

Teach a model human preferences over outputs.

  • RLHF: train a reward model on preference pairs, then optimize the policy with PPO — powerful, complex, unstable.
  • DPO: optimize directly on preference pairs, no separate reward model or RL loop — simpler, popular default.
  • Both come after SFT; they shape tone, safety, helpfulness — not new facts.
DecisionFine-tune for form/behaviour; use RAG for facts. Combine when you need both.

Production & Reliability

securityinjection

Prompt Injection & Security

Untrusted text overriding the system's intended instructions.

  • Direct: user types "ignore previous instructions". Indirect: malicious instructions hidden in retrieved docs/web pages the agent reads.
  • Risks: jailbreak, data exfiltration, unauthorized tool calls (#1 on OWASP LLM Top 10).
  • Defenses: separate trusted vs untrusted content, input/output filtering, least-privilege tools, sandboxing, human approval for sensitive actions, allow-lists.
  • There's no perfect prompt-level fix — treat it as a systems/permissions problem, not just a prompt one.
Interview gotchaIndirect injection is the scary one for RAG/agents — the payload arrives inside data you retrieved, not from the user.
cachingcost

Caching

Avoid recomputing what you've already paid for.

  • Prompt / context caching: provider caches a long static prefix (system prompt, docs) → big cost & latency cut on repeat calls.
  • Semantic cache: embed the query; if a near-duplicate was answered, return the cached response.
  • KV cache: the per-token generation cache inside inference (memory-bound).
  • Embedding cache: don't re-embed unchanged documents.
Interview gotchaSemantic caching needs a similarity threshold + invalidation strategy, or you serve stale/wrong answers.
observability

Observability & Tracing

Per-step visibility into non-deterministic, multi-step runs.

  • A trace = spans, one per step: prompt, output, tool calls, retrieval hits, latency, tokens, cost.
  • Tools: LangSmith (native LangGraph), Langfuse, Phoenix, OpenTelemetry + OpenLLMetry (vendor-neutral).
  • Senior move: link traces to eval scores and harvest prod inputs into the eval set.
  • Track: tool error rate, retrieval quality, token/cost per request, loop detection.
deployment

Deployment & Serving

Where and how the system runs.

  • Serverless (Lambda, Cloud Run): scale-to-zero, spiky traffic, orchestration/API tier calling hosted models. Cons: cold starts, time/memory caps, weak GPU story.
  • Kubernetes: GPU scheduling, long-running model servers, HPA/KEDA autoscaling; needed for self-hosted models. Cons: ops overhead.
  • Self-host serving: vLLM / TGI with continuous batching + PagedAttention for throughput.
  • Inference quantization (INT8/4-bit) shrinks memory & cost at small quality loss.
PatternServerless API front door + GPU inference (K8s or managed endpoint) behind it is the common hybrid.
FastAPIreliability

API Layer & Reliability

Serving LLM calls without falling over.

  • FastAPI async shines on I/O-bound LLM calls; never call a blocking lib inside async def.
  • Pydantic validates LLM JSON into typed objects; extra="forbid" rejects hallucinated fields.
  • Retries: exponential backoff + jitter on 429/5xx only; never on 4xx.
  • Timeouts on every call; circuit breaker for sustained failure; stream tokens for UX.
  • Mind idempotency — don't retry non-idempotent side-effects (double-charge risk).
costlatency

Cost & Latency Levers

The dials you turn when the bill or p95 hurts.

  • Model routing: cheap/small model for easy queries, escalate only hard ones.
  • Prompt caching for repeated prefixes; semantic cache for repeated queries.
  • Trim tokens: tighter prompts, retrieve less, summarize history.
  • Batching for throughput; streaming for perceived latency; distillation for steady high volume.
asyncconcurrency

Async & Concurrency

Doing many slow I/O things at once without threads.

  • GenAI work is I/O-bound (network waits on LLMs/DBs) — perfect for asyncio.
  • await asyncio.gather(*calls) fans out independent LLM/tool/retrieval calls in parallel — big latency win for multi-agent / multi-hop.
  • One blocking call stalls the whole event loop → use async clients or run_in_executor.
  • CPU-bound work (embedding locally, parsing) needs processes/threadpool, not asyncio — the GIL blocks true CPU parallelism.
Interview gotchaconcurrency = dealing with many things at once (async); parallelism = doing many at once (multiprocessing). Async ≠ faster CPU work.
streamingreal-time

Streaming & Real-time

Showing tokens as they generate, not after.

  • SSE (Server-Sent Events): one-way server→client stream over HTTP — the default for LLM token streaming; simple, auto-reconnect.
  • WebSockets: full-duplex — use when you need bidirectional (voice, live collaboration, interruption).
  • Key metric: TTFT (time-to-first-token) — streaming slashes perceived latency even if total time is unchanged.
  • FastAPI: StreamingResponse over an async generator yielding chunks.
MLOpsMLflow

MLOps & Experiment Tracking

Making GenAI changes measurable, versioned & repeatable.

  • MLflow: log params (prompt version, model, chunk size, k), metrics (eval scores, latency, cost), artifacts (prompts, eval sets); compare runs side-by-side.
  • Model/Prompt Registry: version & stage-promote prompts and models — "which prompt is in prod?" must be answerable.
  • CI/CD for AI: eval suite as a gate in the pipeline; promote only if regression checks pass.
  • Monitor in prod: drift, quality decay, cost/latency trends — feed prod data back into the eval set.
guardrailsgovernance

Guardrails & Governance

Keeping outputs safe, compliant & auditable.

  • Input guardrails: injection/jailbreak detection, off-topic + PII filtering before the model sees it.
  • Output guardrails: schema/grounding checks, toxicity/PII redaction, hallucination (contextual grounding) checks.
  • Governance: data residency, access control, audit trails of who saw/approved what — the backbone of regulated domains.
  • Tooling: Bedrock Guardrails, NeMo Guardrails, Guardrails AI, or custom validators + HITL gates.
Project proofYour LRA compliance-review system is AI governance — HITL approval + provenance/citations + audit trail. That maps straight onto "enterprise AI governance" JD lines.
system designcapstone

GenAI System Design

The end-to-end reference architecture to whiteboard.

  • Ingestion: layout-aware parse (tables intact) → structure-aware chunking → embed → hybrid index (vector + BM25 + metadata).
  • Retrieval: adaptive routing → multi-hop loop → rerank → grounded generation with citations.
  • Control: LangGraph state graph, HITL interrupts, guardrails on I/O.
  • Serving: FastAPI async + Pydantic, retries/timeouts, streaming; serverless API tier, GPU/K8s only if self-hosting.
  • Closing the loop: RAGAS eval gate + LangSmith tracing wired to MLflow runs.
Framing"This is the LRA architecture generalised" — turn the design question into a credibility statement.

Cloud Platforms (2026)

AWSBedrock

AWS Bedrock

A uniform model surface over many providers, on AWS primitives.

  • Models: Claude, Llama, Mistral, Cohere, Amazon Nova/Titan, DeepSeek, gpt-oss — one request shape, your IAM/KMS/VPC. Provider never sees your traffic.
  • Converse API: unified chat + tool-use interface across models.
  • Knowledge Bases: managed RAG (RetrieveAndGenerate).
  • AgentCore (was "Agents for Bedrock"): Runtime, Gateway (exposes Lambdas as MCP tools), Identity, Memory, Observability.
  • Guardrails: content filters, prompt-injection detection, denied topics, PII redaction, contextual grounding + Automated Reasoning hallucination checks; model-agnostic ApplyGuardrail API.
  • Bedrock Data Automation: OCR/extraction pipelines.
AzureFoundry

Azure → Microsoft Foundry

Azure OpenAI + AI Studio consolidated into one platform (brand drifting to "Microsoft Foundry").

  • Azure OpenAI SKU still exists for single-model GPT workloads; Foundry adds non-OpenAI models, agents, observability.
  • Models: GPT family, Claude, Gemma, plus Cohere/DeepSeek/Meta/Mistral/xAI on one Azure bill.
  • Foundry Agent Service (GA) + Responses API — replaces the Assistants API (retires 2026-08-26).
  • Microsoft Agent Framework (GA): multi-agent orchestration SDK (.NET/Python).
  • Grounding/RAG: Azure AI Search. Unified observability across MS Agent Framework + LangChain/LangGraph/OpenAI SDK.
  • Enterprise: Entra RBAC, private networking, MCP/tool connectivity over private paths.
GCPVertex

Vertex AI → Gemini Enterprise Agent Platform

Google's agent platform (rebranded from Vertex AI Agent Builder at Cloud Next 2026). API is still aiplatform.googleapis.com.

  • ADK: code-first Agent Development Kit (Python). Agent Studio: low-code builder. Agent Garden: samples.
  • Agent Engine (now "Deployments"): managed agent runtime.
  • Memory: Memory Bank (long-term, cross-session) + Sessions (within-session state).
  • RAG / grounding: RAG Engine, Vector Search, and Search (formerly Vertex AI Search).
  • Models: Model Garden — Gemini (3.x), Claude, 200+ models. MCP servers + Agent Registry for governance.
Your gap noteThis is the platform you flagged as thin — the IAM/service-account grounding setup is the part teams actually struggle with, worth a sentence in interviews.
No blocks match — try a broader term.