Latest

OpenAI’s BrowseComp: Redefining How We Benchmark Web-Browsing Agents

OpenAI’s BrowseComp: Redefining How We Benchmark Web-Browsing Agents

As language models become increasingly agentic, including browsing the internet, reasoning across sources, and acting on user instructions, it is imperative that our methods of evaluating their capabilities must evolve too. OpenAI’s BrowseComp introduces a fresh benchmark for this paradigm, offering a challenging, realistic, and carefully curated evaluation framework

APIGen-MT: Structured Multi-Turn Data via Simulation

APIGen-MT: Structured Multi-Turn Data via Simulation

Intro: Why this matters Picture this: You’re building an AI agent to help users book travel, troubleshoot software, or manage finances. One task leads to another. The user changes their mind halfway. New context unfolds mid-conversation. How do you train your AI to navigate all of that without losing

Building Trustworthy AI: Thoughtful’s Journey with Maxim AI

Building Trustworthy AI: Thoughtful’s Journey with Maxim AI

About Thoughtful Thoughtful is redefining AI companionship with T, an AI-powered emotional support companion designed to help users navigate life’s challenges with clarity and confidence. Built with mental experts-informed guidance, T creates a safe, judgment-free space for users to reflect, set goals, and make progress—while feeling genuinely heard

✨ Public APIs, xAI support, MCP, and more

✨ Public APIs, xAI support, MCP, and more

Feature spotlight 🔗 Prompt chains: Now more powerful Maxim’s revamped Prompt Chains lets you prototype every step of your complex agentic workflow with greater clarity and control—right from our more intuitive UI. Key highlights: * 🔀 Create parallel chains to execute concurrent or conditional tasks. * 🔁 Flexibly transfer data between ports—choose

CoTools and the Future of LLM Tool Use for Complex Reasoning

CoTools and the Future of LLM Tool Use for Complex Reasoning

Imagine asking an AI, “What’s the best route to work tomorrow morning?” and receiving a response that not only understands your intent but actually checks live traffic data via an external API and gives you the optimal route. This isn’t a future concept. It’s happening right now—

“I don't have access to…”: A Guide to MCP

“I don't have access to…”: A Guide to MCP

You: Hey AI, what did Prime last tweet about? AI: I don’t have access to Twitter / X. You: Okay, whatever—just tell me if Vercel has any incidents right now. AI: I can’t browse the internet. Bruh. At this point, you’re wondering why this “intelligent” model can

Agent Evaluation: Metrics for Evaluating Agentic Workflows

Agent Evaluation: Metrics for Evaluating Agentic Workflows

This is Part 2 of our Agent Evaluations series. Here are Part 1 and Part 3 in this series. As AI agents start to gain traction across industries, driving innovation in tasks ranging from customer support to automation of tasks like booking requires their real-world performance evaluation to go beyond

Agent Evaluation: Understanding Agentic Systems and their Quality

Agent Evaluation: Understanding Agentic Systems and their Quality

This is Part 1 of our Agent Evaluations series. Here are Part 2 and Part 3 in this series In today’s rapidly advancing world of artificial intelligence (AI), agentic systems are becoming an integral part of numerous industries, powering everything from customer support to robotics. But what exactly are

From Zero to OTel: Architecting a Stateless Tracing SDK for GenAI

From Zero to OTel: Architecting a Stateless Tracing SDK for GenAI

Introduction Unlike deterministic systems, GenAI introduces fundamental unpredictability to production environments. Traditional observability tools fail to capture the semantic drift, hallucination patterns, and prompt-dependent behaviors inherent to large language models. While conventional systems map cleanly from input to output, GenAI requires monitoring the unexplored territory between what we asked for

Uber: Natural language to SQL

Uber: Natural language to SQL

In today’s data-driven world, businesses increasingly rely on sophisticated querying systems to extract valuable insights. The complexity of writing SQL Queries can be a barrier, particularly for non-technical teams. Uber has revolutionized this process by leveraging GenAI to empower its teams to generate accurate SQL queries simply by asking

Mindtickle’s Robust AI Productionizing Process powered by Maxim

Mindtickle’s Robust AI Productionizing Process powered by Maxim

About Mindtickle Mindtickle is the market-leading revenue enablement platform that combines on-the-job learning and deal execution to drive behavior change and get more revenue per rep. Mindtickle is recognized as a market leader by top industry analysts and is ranked by G2 as the #1 sales onboarding and training product.

✨ Maxim AI February 2025 Update

✨ Maxim AI February 2025 Update

Feature spotlight 🚀 Agent simulation Teams today spend countless hours manually going back and forth with their agents to assess quality and identify failure modes. And it’s not a one-time thing— unintended regressions happen all the time, whether due to model updates or optimizations based on user feedback. Introducing Maxim’

✨ Maxim AI January 2025 Updates

✨ Maxim AI January 2025 Updates

Feature spotlight 🔍 Evaluate and monitor AI agents using Maxim It is a constant challenge for AI teams to debug, monitor, and understand what's going wrong in the complex workflows of AI agents, which involve multiple steps, i.e., LLM calls, tool calls, data retrievals, etc. To ensure the

Maxim AI x MongoDB

Build a RAG application using MongoDB and Maxim AI

What is RAG? Retrieval-augmented generation (RAG) is a process designed to enhance the output of a large language model (LLM) by incorporating information from an external, authoritative knowledge base. This approach ensures that the responses generated by the LLM are not solely dependent on the model's training data.

Maxim AI - Product Updates, December 2024 ✨

Maxim AI - Product Updates, December 2024 ✨

Feature spotlight 👩‍🏫 Use human annotation queues for data curation To optimize the quality and performance of AI applications, teams need to continuously inspect traces, annotate examples, and use these logs as datasets for evaluation, fine-tuning, or in-context learning. This is a manually intensive process where various components are siloed and