✨ Maxim AI November 2024 Updates

🪪 Customize PII entities based on your needs

This change adds a skip list to Maxim's PII evaluator configuration. You can customize which entities you want us to flag as PII.

How to customize

Click on "Settings" on the PII evaluator page.
Select entities you want to classify as PII.
Save.

0:00

/0:42

📣 Test run notifications

We now offer enhanced alerting and notifications to keep your team informed. Connect with Slack or PagerDuty to receive real-time updates about test run status changes, including when tests start, complete, fail, queue, or stop. Configure your preferred notification settings to stay responsive and aware of testing activities.

🚨 Realtime alerts

Get notified of critical events by setting log repository alerts for token usage, user feedback, errors, and online evaluation results - including failures from specific online evaluations.

📧 Log repo summary emails

Log repo summary emails enable you to configure recipients for weekly email updates on your log repository statistics. These emails, sent every Monday, provide a comprehensive overview of key metrics to help you stay informed and monitor performance.

Here’s what you get with Log repo summary emails:

Traces overview: See how many traces were logged during the week.
User feedback summary: Get insights into the average user feedback score.
Latency reports: Monitor latency trends and other performance metrics.
Periodic updates: Receive a concise, automated summary every Monday at 9 AM PT.

Easily stay on top of your log statistics with Log repo summary emails, ensuring better visibility and control over your data.

📖 {{Jinja 2}} variables are now supported on Maxim

We now support Jinja 2 variables; with this enhancement, you can use {{ }} double curly braces to seamlessly insert variables anywhere, making your workflows, prompts, datasets, and configurations more flexible and customizable.

Here’s what you can do:

Dynamic prompts: Personalize and adapt your prompts by adding variables like {{user_name}} or {{context}} for tailored responses.
Flexible workflows: Streamline workflows by dynamically injecting variables.
Dataset customization: Include variables directly within datasets.

We are adding support for other Jinja 2 commands in upcoming releases.

📝 New Online evaluation changes

We have added all new Online evaluation filters, empowering you to refine and customize log evaluations on the platform. This new enhancement lets you filter evaluated logs based on specific criteria, enabling more precise evaluation workflows and streamlined dataset curation.

Here’s what you can do with new changes:

Apply advanced filters: Refine your evaluation results by specifying conditions like Punctuation Evaluator > 5 to focus only on the entries you need.
Seamless dataset integration: You can directly add filtered evaluation results to a dataset, simplifying the curation process for further analysis.

🛠️ Dataset export feature

We have also added Dataset export feature, allowing you to curate and export high-quality datasets directly from the platform. With this new feature, you can now easily curate datasets from various sources and export them for use elsewhere.

Here’s what you can do with the Dataset Export feature:

Curate from human evaluation: Select entries with human ratings more significant than 8 out of 10 to create a high-quality dataset based on human evaluation.
Curate from Online evaluations: Easily compile data from online evaluations.
Curate from Logs: Choose specific entries from your logs to build a tailored dataset that fits your needs.
Curate from Test Runs: Select entries performing well on a particular evaluator and add them to a dataset for more targeted analysis.
Export as CSV: Once your dataset is curated, export it as a CSV file for other tools or workflows.

With the Dataset export feature, you can seamlessly create and share customized datasets for further use, improving flexibility in your data workflows.

🎛 OmniSearch with saved filters

We shipped OmniSearch with Saved Filters, designed to make log analysis faster and more efficient. With this update, the search bar becomes a powerful tool for effortlessly filtering and searching through logs.

Here’s what you can do with OmniSearch:

Search and filter together: Place the cursor in the search bar and instantly access a list of filtering options depending on your logs. You can filter your logs based on conditions like "user feedback > 5" and search for specific terms simultaneously.
Save your search configurations: After configuring your search, you can save the filter setup for future use, so you don’t have to recreate it each time.
Quick access to recent searches: The Omni bar will display your last three searches, allowing for quicker access to past configurations.

🏛️ Add structured output in prompts

We now support structured outputs in prompts with the JSON Schema option for the Response Format parameter. With this update, you can:

Define JSON schema: Supply a JSON Schema via json_schema to structure model responses for precise data formatting.
Model Compatibility: This feature is available for models that support structured output, enabling streamlined data handling and integration

✨ Tool call accuracy with tool schema

We have added a new prompt tool type called Schema. This feature lets users directly input their tool schema to evaluate tool call accuracy.

Here's what you can do :

Direct schema input: Provide your tool's schema directly, streamlining the setup process.
Evaluate tool call accuracy: Assess the accuracy of tool call responses based on the provided schema.

🎁 Upcoming releases

The new agentic workflow testing platform lets you evaluate each step of your agent loops using Maxim SDK.
Data connectors to push log data from Maxim to your existing observability tools, such as Datadog, NewRelic, etc.

🧠 Knowledge nuggets

Red teaming with auto-generated rewards and multi-step RL

The blog Red teaming with auto-generated rewards and multi-step RL explains an automated red-teaming framework designed to strengthen AI systems like LLMs against adversarial challenges. Unlike traditional red-teaming, which relies on manually crafted attacks, this framework uses a two-step process: generating diverse attack goals and training a reinforcement learning (RL) "red teamer" to create targeted attacks. Leveraging auto-generated rewards ensures attacks are varied and effective, enabling robust safety evaluations against threats like indirect prompt injections.

Evaluating data contamination in LLMs

This blog explains ConTAM, a new method for tackling data contamination in LLM evaluation benchmarks. Contamination—overlaps between training data and benchmarks—can inflate scores, giving a misleading view of model capabilities. ConTAM introduces four contamination metrics to assess how contamination affects benchmark scores, evaluating a wide range of models and benchmarks. This study sheds light on the extent of contamination and provides tools to detect it more effectively.

Have questions? Contact us at contact@getmaxim.ai, and we’ll get back to you.