Vrinda Kohli

Vrinda Kohli

Attention Residuals: What If Your Network Could Choose Which Layer to Listen To?

Attention Residuals: What If Your Network Could Choose Which Layer to Listen To?

Residual connections are one of those ML ideas that feels obvious in hindsight. Instead of each layer completely overwriting the previous representation, you just add the transformation on top: h_l = h_{l-1} + f(h_{l-1}) This single + sign is what made training 100-layer networks possible. It gives gradients a

Speculative Speculative Decoding: How Researchers Are Teaching LLMs to Think Ahead of Themselves

Speculative Speculative Decoding: How Researchers Are Teaching LLMs to Think Ahead of Themselves

The Problem Nobody Talks About Enough You've probably noticed that ChatGPT, Claude, or any large language model streams text to you token by token, one word (or part of a word) at a time. This is a fundamental constraint of how these models work. Every time a transformer

Is AI Distillation Theft or Just How Knowledge Evolves?

Is AI Distillation Theft or Just How Knowledge Evolves?

Last month Anthropic published something unusual: a detailed accusation. Three Chinese AI labs, DeepSeek, Moonshot AI, and MiniMax, had collectively generated over 16 million exchanges with Claude through ~24,000 fraudulent accounts. They bypassed regional access controls using commercial proxy networks. They targeted specific, high-value capabilities. Anthropic called it a

The Attention Arms Race: How Modern Open-Source LLMs Are Reinventing the Transformer's Core

The Attention Arms Race: How Modern Open-Source LLMs Are Reinventing the Transformer's Core

Introduction If you follow the LLM space, you've probably heard a lot about parameter counts, context windows, and benchmark scores. What gets discussed far less often is the mechanism that makes all of it possible: attention. Every major language model (GPT, Llama, Gemini, Qwen, DeepSeek) is built on

Your Base Model Is Smarter Than You Think: And Here's How to Prove It

Your Base Model Is Smarter Than You Think: And Here's How to Prove It

There's a quiet assumption baked into most of the recent excitement around reasoning models: that the impressive gains you see from systems like DeepSeek-R1 or similar RL-trained models come from something genuinely new: novel capabilities that the base model simply didn't have before training. A new

Can LLMs Actually Judge Web Development Quality? Spoiler: Not Really

Can LLMs Actually Judge Web Development Quality? Spoiler: Not Really

I recently came across a fascinating paper at ICLR’26 that tackles a question many of us AI developers have been wrestling with: can we trust LLMs to evaluate complex, interactive task? The authors focus on the domain of web development, and the short answer: we've got a

Beyond the Benchmark: Why TruthTensor Might Be the Eval Framework We've Been Missing

Beyond the Benchmark: Why TruthTensor Might Be the Eval Framework We've Been Missing

When was the last time you confidently trusted a benchmark to tell you how an LLM would actually perform in production? The gap between benchmark performance and real-world reliability is significant, and it's a problem that deserves more attention. I recently read through this paper by Inference Labs,

Making Voice Assistants Faster Without Losing Accuracy

Making Voice Assistants Faster Without Losing Accuracy

Have you ever noticed how some voice assistants seem to understand you instantly, while others leave you waiting? That delay isn't random. It's the result of a fundamental trade-off that has plagued speech recognition for years. Systems could either be fast or accurate, but rarely both.

Semantic Highlighting: Making RAG Cheaper Without Compromises

Semantic Highlighting: Making RAG Cheaper Without Compromises

Recent research from the Zilliz team tackles a problem that shows up constantly in production RAG systems: how do you actually show users why a document is relevant to their query? Consider a typical scenario. A user asks: "How can I speed up my Python code?" The vector

Breaking the Context Window: How Recursive Language Models Handle Infinite Input

Breaking the Context Window: How Recursive Language Models Handle Infinite Input

Long-context understanding has been a persistent challenge in language model research. Despite architectural innovations (ALiBi, YaRN, RoPE variants) and massive context window expansions (Claude 3.5 at 200k tokens, GPT-5 at 256k+), models still exhibit performance degradation on long inputs, a phenomenon known as "context rot." The community

Streaming Speech Synthesis Without the Trade-offs: Meet StreamFlow

Streaming Speech Synthesis Without the Trade-offs: Meet StreamFlow

The last few years of neural speech synthesis have been wild. Flow matching models, diffusion transformers, and insanely natural TTS systems keep raising the bar. The catch? Most of these models expect to see the whole audio sequence at once . While this is great for offline generation, but if you’

VITA-Audio: Making AI Voice Assistants Actually Feel Instant

VITA-Audio: Making AI Voice Assistants Actually Feel Instant

Have you ever noticed that frustrating pause when you ask your voice assistant a question? You speak, there's a beat of silence, and then it finally starts responding. That delay might seem minor, but it breaks the natural flow of conversation and reminds you that you're

Kimi K2 Thinking: Engineering Deep Reasoning at Scale

Kimi K2 Thinking: Engineering Deep Reasoning at Scale

Introduction Moonshot AI recently open-sourced Kimi K2 and its reasoning-optimized variant, K2 Thinking. As someone who works with large language models, I wanted to break down what makes this release interesting and where it pushes forward the state of open-source AI. K2 Thinking is a 1-trillion parameter model that can

FastLongSpeech: 30x Compression That Doesn't Murder Your Context

FastLongSpeech: 30x Compression That Doesn't Murder Your Context

Speech models are having a moment and seems like they’re here to stay. They can transcribe your rambling, understand your questions, and even tell when you're being sarcastic. But ask them to process anything longer than a TikTok video and they straight-up collapse. The problem? Speech can

Chain-Talker: Teaching AI to Speak with Empathy

Chain-Talker: Teaching AI to Speak with Empathy

When AI systems engage in conversation, getting the words right is only half the battle. The real challenge is emotional appropriateness: responding to "I just lost my job" with genuine sympathy rather than robotic cheerfulness, or matching enthusiasm when someone shares good news. Current conversational speech synthesis (CSS)