-
Zero-Copy RAG: Leveraging Unified Memory for Vector Search on the GB10
I started writing this article with a clear thesis: the GB10’s coherent unified memory eliminates the PCIe tax that x86+discrete-GPU systems pay on every RAG query, and I was going to measure exactly how much that saves you. I built a single-node K3s pipeline with Qdrant, BGE-small, and an 8B-parameter NVFP4 generator, all sharing the…
-
Beyond Integer GPUs: Mastering DRA for ML Workloads
Stop treating a $30K A100 like a boolean. Dynamic Resource Allocation (GA in Kubernetes 1.34) lets you claim GPUs by VRAM, compute capability, interconnect topology, and MIG profile — then share them safely across workloads. This article walks through every pattern with real manifests. The Problem: GPUs Are Not Integers For years, requesting a GPU…
Sidecar Pattern in K8s MLOps
Over last 1.5 years, I have built a lot of POCs, End-to-End products leveraging ML models, LLMs etc. With Gemini, Claude at your disposal, I am sure many of us would have done the same. At the end of 2025, my home lab was serving 20+ models with a mix of docker, EKS, 100+ exporters…