LLMcore
LLM Fundamentals
A decoder-only transformer predicting the next token, autoregressively.
- Lifecycle: pretraining (next-token on web scale) → SFT (instruction tuning) → alignment via RLHF or DPO.
- Decoding knobs: temperature scales randomness, top-p (nucleus) caps cumulative prob mass, top-k caps candidate count.
- Output is a probability distribution (logits → softmax); sampling strategy turns it into text.
- MoE models route each token to a few expert sub-networks — more params, similar compute.
Interview gotchatemperature=0 is near-deterministic, not guaranteed deterministic — kernels/hardware can still vary.
tokenscost
Tokens & Tokenization
Models read tokens (subword units), not words or characters.
- Built with BPE/subword algorithms — frequent strings become single tokens, rare words split.
- Rule of thumb: ~0.75 words per token in English (~4 chars/token). Code & non-English inflate counts.
- Tokens drive cost (priced per token) and limits (context = token budget, not word budget).
- Input + output tokens both bill; output usually costs more.
Interview gotchaJSON, whitespace and rare names burn tokens fast — a reason structured prompts can be pricier than they look.
context
Context Window
The max tokens (prompt + response) a model can attend to in one call.
- Bigger ≠ free: attention cost scales ~quadratically with sequence length; latency & memory grow.
- "Lost in the middle": models recall the start and end of long context better than the middle — order matters.
- Manage it: summarize history, retrieve only relevant chunks, trim, or use a rolling/episodic memory.
- Long-context ≠ replacement for RAG — retrieval still wins on cost, freshness, and provenance.
transformerattention
Attention & Transformers
Self-attention lets every token weigh every other token.
- Each token projects to Query, Key, Value; attention = softmax(Q·Kᵀ/√d)·V.
- Multi-head: several attention subspaces run in parallel, then concatenate.
- Positional info (RoPE, ALiBi, learned) injects order, since attention is permutation-invariant.
- KV cache stores past K/V so generation is O(1) per new token instead of recomputing — the main inference memory cost.
prompting
Prompting Patterns
Shaping behaviour without touching weights.
- Zero / few-shot: instructions only vs instructions + examples.
- Chain-of-Thought: "think step by step" — better reasoning, more tokens/latency.
- ReAct: interleave reasoning + tool actions (the agent loop).
- Structured output: force JSON via schema/function-calling; validate with Pydantic.
- Put durable rules in the system prompt; keep task-specific detail in the user turn.
embeddingssearch
Embeddings & Retrieval
Text → dense vectors where semantic closeness = proximity.
- Similarity via cosine/inner product; search via ANN (HNSW, IVF) for scale.
- Query & doc embeddings must share a model; know symmetric vs asymmetric.
- Hybrid search = dense + sparse (BM25) — sparse nails exact terms, numbers, names.
- Rerank (cross-encoder) the top-k for a precision boost before generation.
RAGretrieval
RAG Pipeline
Ground generation in retrieved external knowledge.
- Flow: parse → chunk → embed → index → retrieve → (rerank) → generate w/ citations.
- Chunking: structure-aware; never split tables/clauses; ~10–20% overlap.
- Advanced: multi-hop (decompose / iterate), adaptive routing (simple vs complex), Graph RAG for relationship-heavy corpora.
- Small-to-big: retrieve precise chunks, return larger parent for context.
evals
Evaluation
Score retrieval and generation separately.
- Retrieval: context precision/recall, hit rate, MRR, NDCG.
- Generation: faithfulness (grounded, no hallucination), answer relevance, correctness.
- Tooling: RAGAS, LLM-as-judge, custom rubrics.
- Run a golden set as a regression gate on every prompt/retrieval change; wire scores to traces.
OCRDocument AI
OCR & Document AI
Turning messy PDFs/scans into structured, model-ready data.
- Text-layer PDFs: pdfplumber / PyMuPDF extract text & tables directly — no OCR needed.
- Scanned/image PDFs: pytesseract (Tesseract) or cloud OCR (Textract, Azure Document Intelligence, Google Document AI).
- Layout-aware models: LayoutLM / Donut understand structure (tables, key-value) not just raw text.
- Pattern: extract → layout-aware parse → LLM (Llama-3) for field extraction → Pydantic schema validation → confidence/field-accuracy metric.
Project proofThis is your AR invoice pipeline (pdfplumber + pytesseract + Llama-3 → 96% field accuracy) — lead with the metric and the validation layer.
vector DBcompare
Vector DB Comparison
Picking the store: library vs database, self-host vs managed.
- FAISS: in-process library — fast, embeddable, you own persistence/filtering. Prototypes & single-tenant (your NewsRAG).
- pgvector: vectors inside Postgres — one DB, transactional, great when you already run Postgres.
- Qdrant / Weaviate / Milvus: purpose-built, rich metadata filtering, hybrid search, self-host or managed.
- Pinecone: fully managed, zero-ops, scales — pay for convenience. Chroma: lightweight, dev-friendly.
Decision axesScale · metadata-filtering needs · write/update frequency · hybrid search · ops burden. Most "which DB?" answers come down to these five.
agents
Agents
An LLM in a loop with tools, state, and a goal.
- Loop: reason → act (tool) → observe → repeat until done (ReAct).
- Components: tools, memory/state, planning, a stopping condition.
- Multi-agent: specialized agents (retriever, analyst, reviewer) coordinated by a supervisor/graph.
- Guardrails: step limits, HITL gates, tracing — agents are non-deterministic and loop-prone.
- Async matters: parallelize independent tool calls / retrievals to cut latency.
LangChain
LangChain
Composable building blocks for LLM apps.
- LCEL (Expression Language): pipe components with | into Runnables — sync/async/stream/batch for free.
- Core pieces: prompts, models, output parsers, retrievers, tools, document loaders, memory.
- Great for linear chains & quick assembly; for branching/looping/state, reach for LangGraph.
Framing"LangChain for composition, LangGraph for control flow" is the clean one-liner.
LangGraphcore
LangGraph
Agents as stateful graphs — nodes, edges, shared state.
- State: a typed dict passed between nodes; nodes return partial updates.
- Reducers: define how updates merge vs overwrite — essential for message accumulation & parallel nodes.
- Edges: conditional edges route based on state (e.g. tool-call? → tools : end).
- Checkpointer: persists state → enables HITL interrupts, resume, time-travel.
messages: Annotated[list, add_messages]
graph.add_conditional_edges("agent", route,
{"tools":"tools","end":END})
graph.add_edge("tools","agent") # loop
MCP
MCP (Model Context Protocol)
Open standard for exposing tools/data to LLM apps.
- Server advertises tools/resources; client (agent) discovers & calls them over one protocol.
- Turns N bespoke integrations into one uniform contract — swappable, discoverable.
- Decouples tool implementation from the agent (the "OpenAPI moment" for agent tooling).
- Now first-class in cloud agent runtimes (Bedrock AgentCore Gateway, Vertex, Foundry tools).
LlamaIndexRAG
LlamaIndex
A data framework purpose-built for RAG / retrieval over your data.
- Core abstractions: Documents → Nodes (chunks), Indexes, Retrievers, Query Engines, ingestion pipelines.
- Strong at the data/ingestion + retrieval layer — many readers, node parsers, and built-in advanced retrieval (auto-merging, recursive, sub-question).
- LlamaParse handles complex document/table parsing for RAG ingestion.
Framing"LlamaIndex is retrieval-first / data-centric; LangChain is orchestration-first / general-purpose. They compose — LlamaIndex as the retriever inside a LangChain/LangGraph app."
HuggingFacetransformers
HuggingFace Transformers
The standard library for loading, running & fine-tuning open models.
- AutoModel / AutoTokenizer load any Hub model; pipeline() is the one-liner for inference.
- Ecosystem: datasets, PEFT (LoRA/QLoRA), TRL (SFT/DPO), accelerate, bitsandbytes (quantization), safetensors.
- This is the layer under your LoRA/QLoRA work and any self-hosted open-weight (Llama-3) inference.
- For high-throughput serving you graduate from raw transformers to vLLM/TGI.
securityinjection
Prompt Injection & Security
Untrusted text overriding the system's intended instructions.
- Direct: user types "ignore previous instructions". Indirect: malicious instructions hidden in retrieved docs/web pages the agent reads.
- Risks: jailbreak, data exfiltration, unauthorized tool calls (#1 on OWASP LLM Top 10).
- Defenses: separate trusted vs untrusted content, input/output filtering, least-privilege tools, sandboxing, human approval for sensitive actions, allow-lists.
- There's no perfect prompt-level fix — treat it as a systems/permissions problem, not just a prompt one.
Interview gotchaIndirect injection is the scary one for RAG/agents — the payload arrives inside data you retrieved, not from the user.
cachingcost
Caching
Avoid recomputing what you've already paid for.
- Prompt / context caching: provider caches a long static prefix (system prompt, docs) → big cost & latency cut on repeat calls.
- Semantic cache: embed the query; if a near-duplicate was answered, return the cached response.
- KV cache: the per-token generation cache inside inference (memory-bound).
- Embedding cache: don't re-embed unchanged documents.
Interview gotchaSemantic caching needs a similarity threshold + invalidation strategy, or you serve stale/wrong answers.
observability
Observability & Tracing
Per-step visibility into non-deterministic, multi-step runs.
- A trace = spans, one per step: prompt, output, tool calls, retrieval hits, latency, tokens, cost.
- Tools: LangSmith (native LangGraph), Langfuse, Phoenix, OpenTelemetry + OpenLLMetry (vendor-neutral).
- Senior move: link traces to eval scores and harvest prod inputs into the eval set.
- Track: tool error rate, retrieval quality, token/cost per request, loop detection.
deployment
Deployment & Serving
Where and how the system runs.
- Serverless (Lambda, Cloud Run): scale-to-zero, spiky traffic, orchestration/API tier calling hosted models. Cons: cold starts, time/memory caps, weak GPU story.
- Kubernetes: GPU scheduling, long-running model servers, HPA/KEDA autoscaling; needed for self-hosted models. Cons: ops overhead.
- Self-host serving: vLLM / TGI with continuous batching + PagedAttention for throughput.
- Inference quantization (INT8/4-bit) shrinks memory & cost at small quality loss.
PatternServerless API front door + GPU inference (K8s or managed endpoint) behind it is the common hybrid.
FastAPIreliability
API Layer & Reliability
Serving LLM calls without falling over.
- FastAPI async shines on I/O-bound LLM calls; never call a blocking lib inside async def.
- Pydantic validates LLM JSON into typed objects; extra="forbid" rejects hallucinated fields.
- Retries: exponential backoff + jitter on 429/5xx only; never on 4xx.
- Timeouts on every call; circuit breaker for sustained failure; stream tokens for UX.
- Mind idempotency — don't retry non-idempotent side-effects (double-charge risk).
costlatency
Cost & Latency Levers
The dials you turn when the bill or p95 hurts.
- Model routing: cheap/small model for easy queries, escalate only hard ones.
- Prompt caching for repeated prefixes; semantic cache for repeated queries.
- Trim tokens: tighter prompts, retrieve less, summarize history.
- Batching for throughput; streaming for perceived latency; distillation for steady high volume.
asyncconcurrency
Async & Concurrency
Doing many slow I/O things at once without threads.
- GenAI work is I/O-bound (network waits on LLMs/DBs) — perfect for asyncio.
- await asyncio.gather(*calls) fans out independent LLM/tool/retrieval calls in parallel — big latency win for multi-agent / multi-hop.
- One blocking call stalls the whole event loop → use async clients or run_in_executor.
- CPU-bound work (embedding locally, parsing) needs processes/threadpool, not asyncio — the GIL blocks true CPU parallelism.
Interview gotchaconcurrency = dealing with many things at once (async); parallelism = doing many at once (multiprocessing). Async ≠ faster CPU work.
streamingreal-time
Streaming & Real-time
Showing tokens as they generate, not after.
- SSE (Server-Sent Events): one-way server→client stream over HTTP — the default for LLM token streaming; simple, auto-reconnect.
- WebSockets: full-duplex — use when you need bidirectional (voice, live collaboration, interruption).
- Key metric: TTFT (time-to-first-token) — streaming slashes perceived latency even if total time is unchanged.
- FastAPI: StreamingResponse over an async generator yielding chunks.
MLOpsMLflow
MLOps & Experiment Tracking
Making GenAI changes measurable, versioned & repeatable.
- MLflow: log params (prompt version, model, chunk size, k), metrics (eval scores, latency, cost), artifacts (prompts, eval sets); compare runs side-by-side.
- Model/Prompt Registry: version & stage-promote prompts and models — "which prompt is in prod?" must be answerable.
- CI/CD for AI: eval suite as a gate in the pipeline; promote only if regression checks pass.
- Monitor in prod: drift, quality decay, cost/latency trends — feed prod data back into the eval set.
guardrailsgovernance
Guardrails & Governance
Keeping outputs safe, compliant & auditable.
- Input guardrails: injection/jailbreak detection, off-topic + PII filtering before the model sees it.
- Output guardrails: schema/grounding checks, toxicity/PII redaction, hallucination (contextual grounding) checks.
- Governance: data residency, access control, audit trails of who saw/approved what — the backbone of regulated domains.
- Tooling: Bedrock Guardrails, NeMo Guardrails, Guardrails AI, or custom validators + HITL gates.
Project proofYour LRA compliance-review system is AI governance — HITL approval + provenance/citations + audit trail. That maps straight onto "enterprise AI governance" JD lines.
system designcapstone
GenAI System Design
The end-to-end reference architecture to whiteboard.
- Ingestion: layout-aware parse (tables intact) → structure-aware chunking → embed → hybrid index (vector + BM25 + metadata).
- Retrieval: adaptive routing → multi-hop loop → rerank → grounded generation with citations.
- Control: LangGraph state graph, HITL interrupts, guardrails on I/O.
- Serving: FastAPI async + Pydantic, retries/timeouts, streaming; serverless API tier, GPU/K8s only if self-hosting.
- Closing the loop: RAGAS eval gate + LangSmith tracing wired to MLflow runs.
Framing"This is the LRA architecture generalised" — turn the design question into a credibility statement.
AWSBedrock
AWS Bedrock
A uniform model surface over many providers, on AWS primitives.
- Models: Claude, Llama, Mistral, Cohere, Amazon Nova/Titan, DeepSeek, gpt-oss — one request shape, your IAM/KMS/VPC. Provider never sees your traffic.
- Converse API: unified chat + tool-use interface across models.
- Knowledge Bases: managed RAG (RetrieveAndGenerate).
- AgentCore (was "Agents for Bedrock"): Runtime, Gateway (exposes Lambdas as MCP tools), Identity, Memory, Observability.
- Guardrails: content filters, prompt-injection detection, denied topics, PII redaction, contextual grounding + Automated Reasoning hallucination checks; model-agnostic ApplyGuardrail API.
- Bedrock Data Automation: OCR/extraction pipelines.
AzureFoundry
Azure → Microsoft Foundry
Azure OpenAI + AI Studio consolidated into one platform (brand drifting to "Microsoft Foundry").
- Azure OpenAI SKU still exists for single-model GPT workloads; Foundry adds non-OpenAI models, agents, observability.
- Models: GPT family, Claude, Gemma, plus Cohere/DeepSeek/Meta/Mistral/xAI on one Azure bill.
- Foundry Agent Service (GA) + Responses API — replaces the Assistants API (retires 2026-08-26).
- Microsoft Agent Framework (GA): multi-agent orchestration SDK (.NET/Python).
- Grounding/RAG: Azure AI Search. Unified observability across MS Agent Framework + LangChain/LangGraph/OpenAI SDK.
- Enterprise: Entra RBAC, private networking, MCP/tool connectivity over private paths.
GCPVertex
Vertex AI → Gemini Enterprise Agent Platform
Google's agent platform (rebranded from Vertex AI Agent Builder at Cloud Next 2026). API is still aiplatform.googleapis.com.
- ADK: code-first Agent Development Kit (Python). Agent Studio: low-code builder. Agent Garden: samples.
- Agent Engine (now "Deployments"): managed agent runtime.
- Memory: Memory Bank (long-term, cross-session) + Sessions (within-session state).
- RAG / grounding: RAG Engine, Vector Search, and Search (formerly Vertex AI Search).
- Models: Model Garden — Gemini (3.x), Claude, 200+ models. MCP servers + Agent Registry for governance.
Your gap noteThis is the platform you flagged as thin — the IAM/service-account grounding setup is the part teams actually struggle with, worth a sentence in interviews.
No blocks match — try a broader term.