GenAI Engineer · Single-Page Cheatsheet

Foundations

LLMcore

LLM Fundamentals

A decoder-only transformer predicting the next token, autoregressively.

Lifecycle: pretraining (next-token on web scale) → SFT (instruction tuning) → alignment via RLHF or DPO.
Decoding knobs: temperature scales randomness, top-p (nucleus) caps cumulative prob mass, top-k caps candidate count.
Output is a probability distribution (logits → softmax); sampling strategy turns it into text.
MoE models route each token to a few expert sub-networks — more params, similar compute.

Interview gotchatemperature=0 is near-deterministic, not guaranteed deterministic — kernels/hardware can still vary.

tokenscost

Tokens & Tokenization

Models read tokens (subword units), not words or characters.

Built with BPE/subword algorithms — frequent strings become single tokens, rare words split.
Rule of thumb: ~0.75 words per token in English (~4 chars/token). Code & non-English inflate counts.
Tokens drive cost (priced per token) and limits (context = token budget, not word budget).
Input + output tokens both bill; output usually costs more.

Interview gotchaJSON, whitespace and rare names burn tokens fast — a reason structured prompts can be pricier than they look.

context

Context Window

The max tokens (prompt + response) a model can attend to in one call.

Bigger ≠ free: attention cost scales ~quadratically with sequence length; latency & memory grow.
"Lost in the middle": models recall the start and end of long context better than the middle — order matters.
Manage it: summarize history, retrieve only relevant chunks, trim, or use a rolling/episodic memory.
Long-context ≠ replacement for RAG — retrieval still wins on cost, freshness, and provenance.

transformerattention

Attention & Transformers

Self-attention lets every token weigh every other token.

Each token projects to Query, Key, Value; attention = softmax(Q·Kᵀ/√d)·V.
Multi-head: several attention subspaces run in parallel, then concatenate.
Positional info (RoPE, ALiBi, learned) injects order, since attention is permutation-invariant.
KV cache stores past K/V so generation is O(1) per new token instead of recomputing — the main inference memory cost.

Prompting & RAG

prompting

Prompting Patterns

Shaping behaviour without touching weights.

Zero / few-shot: instructions only vs instructions + examples.
Chain-of-Thought: "think step by step" — better reasoning, more tokens/latency.
ReAct: interleave reasoning + tool actions (the agent loop).
Structured output: force JSON via schema/function-calling; validate with Pydantic.
Put durable rules in the system prompt; keep task-specific detail in the user turn.

embeddingssearch

Embeddings & Retrieval

Text → dense vectors where semantic closeness = proximity.

Similarity via cosine/inner product; search via ANN (HNSW, IVF) for scale.
Query & doc embeddings must share a model; know symmetric vs asymmetric.
Hybrid search = dense + sparse (BM25) — sparse nails exact terms, numbers, names.
Rerank (cross-encoder) the top-k for a precision boost before generation.

RAGretrieval

RAG Pipeline

Ground generation in retrieved external knowledge.

Flow: parse → chunk → embed → index → retrieve → (rerank) → generate w/ citations.
Chunking: structure-aware; never split tables/clauses; ~10–20% overlap.
Advanced: multi-hop (decompose / iterate), adaptive routing (simple vs complex), Graph RAG for relationship-heavy corpora.
Small-to-big: retrieve precise chunks, return larger parent for context.

evals

Evaluation

Score retrieval and generation separately.

Retrieval: context precision/recall, hit rate, MRR, NDCG.
Generation: faithfulness (grounded, no hallucination), answer relevance, correctness.
Tooling: RAGAS, LLM-as-judge, custom rubrics.
Run a golden set as a regression gate on every prompt/retrieval change; wire scores to traces.

OCRDocument AI

OCR & Document AI

Turning messy PDFs/scans into structured, model-ready data.

Text-layer PDFs: pdfplumber / PyMuPDF extract text & tables directly — no OCR needed.
Scanned/image PDFs: pytesseract (Tesseract) or cloud OCR (Textract, Azure Document Intelligence, Google Document AI).
Layout-aware models: LayoutLM / Donut understand structure (tables, key-value) not just raw text.
Pattern: extract → layout-aware parse → LLM (Llama-3) for field extraction → Pydantic schema validation → confidence/field-accuracy metric.

Project proofThis is your AR invoice pipeline (pdfplumber + pytesseract + Llama-3 → 96% field accuracy) — lead with the metric and the validation layer.

vector DBcompare

Vector DB Comparison

Picking the store: library vs database, self-host vs managed.

FAISS: in-process library — fast, embeddable, you own persistence/filtering. Prototypes & single-tenant (your NewsRAG).
pgvector: vectors inside Postgres — one DB, transactional, great when you already run Postgres.
Qdrant / Weaviate / Milvus: purpose-built, rich metadata filtering, hybrid search, self-host or managed.
Pinecone: fully managed, zero-ops, scales — pay for convenience. Chroma: lightweight, dev-friendly.

Decision axesScale · metadata-filtering needs · write/update frequency · hybrid search · ops burden. Most "which DB?" answers come down to these five.

Agents & Frameworks

agents

Agents

An LLM in a loop with tools, state, and a goal.

Loop: reason → act (tool) → observe → repeat until done (ReAct).
Components: tools, memory/state, planning, a stopping condition.
Multi-agent: specialized agents (retriever, analyst, reviewer) coordinated by a supervisor/graph.
Guardrails: step limits, HITL gates, tracing — agents are non-deterministic and loop-prone.
Async matters: parallelize independent tool calls / retrievals to cut latency.

LangChain

Composable building blocks for LLM apps.

LCEL (Expression Language): pipe components with | into Runnables — sync/async/stream/batch for free.
Core pieces: prompts, models, output parsers, retrievers, tools, document loaders, memory.
Great for linear chains & quick assembly; for branching/looping/state, reach for LangGraph.

Framing"LangChain for composition, LangGraph for control flow" is the clean one-liner.

LangGraphcore

LangGraph

Agents as stateful graphs — nodes, edges, shared state.

State: a typed dict passed between nodes; nodes return partial updates.
Reducers: define how updates merge vs overwrite — essential for message accumulation & parallel nodes.
Edges: conditional edges route based on state (e.g. tool-call? → tools : end).
Checkpointer: persists state → enables HITL interrupts, resume, time-travel.

messages: Annotated[list, add_messages]
graph.add_conditional_edges("agent", route,
   {"tools":"tools","end":END})
graph.add_edge("tools","agent")  # loop

MCP

MCP (Model Context Protocol)

Open standard for exposing tools/data to LLM apps.

Server advertises tools/resources; client (agent) discovers & calls them over one protocol.
Turns N bespoke integrations into one uniform contract — swappable, discoverable.
Decouples tool implementation from the agent (the "OpenAPI moment" for agent tooling).
Now first-class in cloud agent runtimes (Bedrock AgentCore Gateway, Vertex, Foundry tools).

LlamaIndexRAG

LlamaIndex

A data framework purpose-built for RAG / retrieval over your data.

Core abstractions: Documents → Nodes (chunks), Indexes, Retrievers, Query Engines, ingestion pipelines.
Strong at the data/ingestion + retrieval layer — many readers, node parsers, and built-in advanced retrieval (auto-merging, recursive, sub-question).
LlamaParse handles complex document/table parsing for RAG ingestion.

Framing"LlamaIndex is retrieval-first / data-centric; LangChain is orchestration-first / general-purpose. They compose — LlamaIndex as the retriever inside a LangChain/LangGraph app."

HuggingFacetransformers

HuggingFace Transformers

The standard library for loading, running & fine-tuning open models.

AutoModel / AutoTokenizer load any Hub model; pipeline() is the one-liner for inference.
Ecosystem: datasets, PEFT (LoRA/QLoRA), TRL (SFT/DPO), accelerate, bitsandbytes (quantization), safetensors.
This is the layer under your LoRA/QLoRA work and any self-hosted open-weight (Llama-3) inference.
For high-throughput serving you graduate from raw transformers to vLLM/TGI.

Fine-tuning

PEFTLoRA

LoRA / QLoRA / PEFT

Adapt a model by training a tiny fraction of parameters.

LoRA: freeze base, inject low-rank adapters (A·B) — train ~0.1–1% of params, swappable per task.
QLoRA: LoRA on a 4-bit (NF4) quantized base + paged optimizers → fits big models on one GPU.
Full FT: max quality, max cost; rarely needed when PEFT suffices.
Key hyperparams: rank r, alpha, target modules, learning rate.

alignment

Alignment: RLHF vs DPO

Teach a model human preferences over outputs.

RLHF: train a reward model on preference pairs, then optimize the policy with PPO — powerful, complex, unstable.
DPO: optimize directly on preference pairs, no separate reward model or RL loop — simpler, popular default.
Both come after SFT; they shape tone, safety, helpfulness — not new facts.

DecisionFine-tune for form/behaviour; use RAG for facts. Combine when you need both.

Production & Reliability

securityinjection

Prompt Injection & Security

Untrusted text overriding the system's intended instructions.

Direct: user types "ignore previous instructions". Indirect: malicious instructions hidden in retrieved docs/web pages the agent reads.
Risks: jailbreak, data exfiltration, unauthorized tool calls (#1 on OWASP LLM Top 10).
Defenses: separate trusted vs untrusted content, input/output filtering, least-privilege tools, sandboxing, human approval for sensitive actions, allow-lists.
There's no perfect prompt-level fix — treat it as a systems/permissions problem, not just a prompt one.

Interview gotchaIndirect injection is the scary one for RAG/agents — the payload arrives inside data you retrieved, not from the user.

cachingcost

Caching

Avoid recomputing what you've already paid for.

Prompt / context caching: provider caches a long static prefix (system prompt, docs) → big cost & latency cut on repeat calls.
Semantic cache: embed the query; if a near-duplicate was answered, return the cached response.
KV cache: the per-token generation cache inside inference (memory-bound).
Embedding cache: don't re-embed unchanged documents.

Interview gotchaSemantic caching needs a similarity threshold + invalidation strategy, or you serve stale/wrong answers.

observability

Observability & Tracing

Per-step visibility into non-deterministic, multi-step runs.

A trace = spans, one per step: prompt, output, tool calls, retrieval hits, latency, tokens, cost.
Tools: LangSmith (native LangGraph), Langfuse, Phoenix, OpenTelemetry + OpenLLMetry (vendor-neutral).
Senior move: link traces to eval scores and harvest prod inputs into the eval set.
Track: tool error rate, retrieval quality, token/cost per request, loop detection.

deployment

Deployment & Serving

Where and how the system runs.

Serverless (Lambda, Cloud Run): scale-to-zero, spiky traffic, orchestration/API tier calling hosted models. Cons: cold starts, time/memory caps, weak GPU story.
Kubernetes: GPU scheduling, long-running model servers, HPA/KEDA autoscaling; needed for self-hosted models. Cons: ops overhead.
Self-host serving: vLLM / TGI with continuous batching + PagedAttention for throughput.
Inference quantization (INT8/4-bit) shrinks memory & cost at small quality loss.

PatternServerless API front door + GPU inference (K8s or managed endpoint) behind it is the common hybrid.

FastAPIreliability

API Layer & Reliability

Serving LLM calls without falling over.

FastAPI async shines on I/O-bound LLM calls; never call a blocking lib inside async def.
Pydantic validates LLM JSON into typed objects; extra="forbid" rejects hallucinated fields.
Retries: exponential backoff + jitter on 429/5xx only; never on 4xx.
Timeouts on every call; circuit breaker for sustained failure; stream tokens for UX.
Mind idempotency — don't retry non-idempotent side-effects (double-charge risk).

costlatency

Cost & Latency Levers

The dials you turn when the bill or p95 hurts.

Model routing: cheap/small model for easy queries, escalate only hard ones.
Prompt caching for repeated prefixes; semantic cache for repeated queries.
Trim tokens: tighter prompts, retrieve less, summarize history.
Batching for throughput; streaming for perceived latency; distillation for steady high volume.

asyncconcurrency

Async & Concurrency

Doing many slow I/O things at once without threads.

GenAI work is I/O-bound (network waits on LLMs/DBs) — perfect for asyncio.
await asyncio.gather(*calls) fans out independent LLM/tool/retrieval calls in parallel — big latency win for multi-agent / multi-hop.
One blocking call stalls the whole event loop → use async clients or run_in_executor.
CPU-bound work (embedding locally, parsing) needs processes/threadpool, not asyncio — the GIL blocks true CPU parallelism.

Interview gotchaconcurrency = dealing with many things at once (async); parallelism = doing many at once (multiprocessing). Async ≠ faster CPU work.

streamingreal-time

Streaming & Real-time

Showing tokens as they generate, not after.

SSE (Server-Sent Events): one-way server→client stream over HTTP — the default for LLM token streaming; simple, auto-reconnect.
WebSockets: full-duplex — use when you need bidirectional (voice, live collaboration, interruption).
Key metric: TTFT (time-to-first-token) — streaming slashes perceived latency even if total time is unchanged.
FastAPI: StreamingResponse over an async generator yielding chunks.

MLOpsMLflow

MLOps & Experiment Tracking

Making GenAI changes measurable, versioned & repeatable.

MLflow: log params (prompt version, model, chunk size, k), metrics (eval scores, latency, cost), artifacts (prompts, eval sets); compare runs side-by-side.
Model/Prompt Registry: version & stage-promote prompts and models — "which prompt is in prod?" must be answerable.
CI/CD for AI: eval suite as a gate in the pipeline; promote only if regression checks pass.
Monitor in prod: drift, quality decay, cost/latency trends — feed prod data back into the eval set.

guardrailsgovernance

Guardrails & Governance

Keeping outputs safe, compliant & auditable.

Input guardrails: injection/jailbreak detection, off-topic + PII filtering before the model sees it.
Output guardrails: schema/grounding checks, toxicity/PII redaction, hallucination (contextual grounding) checks.
Governance: data residency, access control, audit trails of who saw/approved what — the backbone of regulated domains.
Tooling: Bedrock Guardrails, NeMo Guardrails, Guardrails AI, or custom validators + HITL gates.

Project proofYour LRA compliance-review system is AI governance — HITL approval + provenance/citations + audit trail. That maps straight onto "enterprise AI governance" JD lines.

system designcapstone

GenAI System Design

The end-to-end reference architecture to whiteboard.

Ingestion: layout-aware parse (tables intact) → structure-aware chunking → embed → hybrid index (vector + BM25 + metadata).
Retrieval: adaptive routing → multi-hop loop → rerank → grounded generation with citations.
Control: LangGraph state graph, HITL interrupts, guardrails on I/O.
Serving: FastAPI async + Pydantic, retries/timeouts, streaming; serverless API tier, GPU/K8s only if self-hosting.
Closing the loop: RAGAS eval gate + LangSmith tracing wired to MLflow runs.

Framing"This is the LRA architecture generalised" — turn the design question into a credibility statement.

Cloud Platforms (2026)

AWSBedrock

AWS Bedrock

A uniform model surface over many providers, on AWS primitives.

Models: Claude, Llama, Mistral, Cohere, Amazon Nova/Titan, DeepSeek, gpt-oss — one request shape, your IAM/KMS/VPC. Provider never sees your traffic.
Converse API: unified chat + tool-use interface across models.
Knowledge Bases: managed RAG (RetrieveAndGenerate).
AgentCore (was "Agents for Bedrock"): Runtime, Gateway (exposes Lambdas as MCP tools), Identity, Memory, Observability.
Guardrails: content filters, prompt-injection detection, denied topics, PII redaction, contextual grounding + Automated Reasoning hallucination checks; model-agnostic ApplyGuardrail API.
Bedrock Data Automation: OCR/extraction pipelines.

AzureFoundry

Azure → Microsoft Foundry

Azure OpenAI + AI Studio consolidated into one platform (brand drifting to "Microsoft Foundry").

Azure OpenAI SKU still exists for single-model GPT workloads; Foundry adds non-OpenAI models, agents, observability.
Models: GPT family, Claude, Gemma, plus Cohere/DeepSeek/Meta/Mistral/xAI on one Azure bill.
Foundry Agent Service (GA) + Responses API — replaces the Assistants API (retires 2026-08-26).
Microsoft Agent Framework (GA): multi-agent orchestration SDK (.NET/Python).
Grounding/RAG: Azure AI Search. Unified observability across MS Agent Framework + LangChain/LangGraph/OpenAI SDK.
Enterprise: Entra RBAC, private networking, MCP/tool connectivity over private paths.

GCPVertex

Vertex AI → Gemini Enterprise Agent Platform

Google's agent platform (rebranded from Vertex AI Agent Builder at Cloud Next 2026). API is still aiplatform.googleapis.com.

ADK: code-first Agent Development Kit (Python). Agent Studio: low-code builder. Agent Garden: samples.
Agent Engine (now "Deployments"): managed agent runtime.
Memory: Memory Bank (long-term, cross-session) + Sessions (within-session state).
RAG / grounding: RAG Engine, Vector Search, and Search (formerly Vertex AI Search).
Models: Model Garden — Gemini (3.x), Claude, 200+ models. MCP servers + Agent Registry for governance.

Your gap noteThis is the platform you flagged as thin — the IAM/service-account grounding setup is the part teams actually struggle with, worth a sentence in interviews.

No blocks match — try a broader term.