Skip to content

Mind Of The Machine

About

Tiny LLM: Train & Serve on Apple Silicon

by cspnanda April 26, 2026April 26, 2026

Zero-Copy RAG: Leveraging Unified Memory for Vector Search on the GB10

I started writing this article with a clear thesis: the GB10’s coherent unified memory eliminates the PCIe tax that x86+discrete-GPU systems pay on every RAG query, and I was going to measure exactly how much that saves you. I built a single-node K3s pipeline with Qdrant, BGE-small, and an 8B-parameter NVFP4 generator, all sharing the…

by cspnanda April 25, 2026April 25, 2026

Ship Models, Not Outages: Canary Deployments for AI Workloads on Kubernetes

by cspnanda April 1, 2026April 1, 2026
Beyond Integer GPUs: Mastering DRA for ML Workloads

Stop treating a $30K A100 like a boolean. Dynamic Resource Allocation (GA in Kubernetes 1.34) lets you claim GPUs by VRAM, compute capability, interconnect topology, and MIG profile — then share them safely across workloads. This article walks through every pattern with real manifests. The Problem: GPUs Are Not Integers For years, requesting a GPU…

by cspnanda March 25, 2026April 1, 2026

Sidecar Pattern in K8s MLOps

Over last 1.5 years, I have built a lot of POCs, End-to-End products leveraging ML models, LLMs etc. With Gemini, Claude at your disposal, I am sure many of us would have done the same. At the end of 2025, my home lab was serving 20+ models with a mix of docker, EKS, 100+ exporters…

by cspnanda March 18, 2026March 19, 2026

Loading Comments...

Write a Comment...

Email (Required)

Name (Required)

Website