May 2026 AI News Archive

For $1.3 million a month, OpenClaw founder Peter Steinberger runs 100 AI agents that code, review PRs, and find bugs

A three-person team led by Peter Steinberger keeps about 100 Codex instances running for the open-source project OpenClaw, driving OpenAI API spend to $1.3 million a month. Steinberger frames the bill as a research inve…

Towards AI

I Tested Cursor 3.4's Cloud Agents on 18 Tasks — Its 70% Cache Killed My Local Docker Loop

Multi-repo agent envs, build-scoped secrets, and Dockerfile layer caching shipped on May 13. I ran every workflow I hated against them. Continue reading on Towards AI »

Towards AI

Apple's MLX Runs Local LLMs 3x Faster Than llama.cpp — Until Your Context Hits 40K

Ollama just got 93% faster on every Apple Silicon Mac, and it did it without touching the model, the quantization, or the hardware. Continue reading on Towards AI »

Towards AI

Under the Hood of DeepSeek V4: The Algorithmic Shifts Redefining Frontier MoE Scaling

Part 1: The Era of Naive MoE Scaling Continue reading on Towards AI »

Researchers train AI model that hits near-full performance with just 12.5 percent of its experts

Researchers at the Allen Institute for AI and UC Berkeley have built EMO, a mixture-of-experts model whose experts specialize in content domains instead of word types. That lets you strip out three-quarters of the exper…

A Two-Dimensional Framework for AI Agent Design Patterns: Cognitive Function and Execution Topology

arXiv:2605.13850v1 Announce Type: new Abstract: Existing frameworks for LLM-based agent architectures describe systems from a single perspective: industry guides (Anthropic, Google, LangChain) focus on execution topolog…

Sheaf-Theoretic Transport and Obstruction for Detecting Scientific Theory Shift in AI Agents

arXiv:2605.14033v1 Announce Type: new Abstract: Scientific theory shift in AI agents requires more than fitting equations to data. An artificial scientific agent must detect whether an existing representational framewor…

From Descriptive to Prescriptive: Uncover the Social Value Alignment of LLM-based Agents

arXiv:2605.14034v1 Announce Type: new Abstract: Wide applications of LLM-based agents require strong alignment with human social values. However, current works still exhibit deficiencies in self-cognition and dilemma de…

Model-Adaptive Tool Necessity Reveals the Knowing-Doing Gap in LLM Tool Use

arXiv:2605.14038v1 Announce Type: new Abstract: Large language models (LLMs) increasingly act as autonomous agents that must decide when to answer directly vs. when to invoke external tools. Prior work studying adaptive…

SPIN: Structural LLM Planning via Iterative Navigation for Industrial Tasks

arXiv:2605.14051v1 Announce Type: new Abstract: Industrial LLM agent systems often separate planning from execution, yet LLM planners frequently produce structurally invalid or unnecessarily long workflows, leading to b…

Know When To Fold 'Em: Token-Efficient LLM Synthetic Data Generation via Multi-Stage In-Flight Rejection

arXiv:2605.14062v1 Announce Type: new Abstract: While synthetic data generation with large language models (LLMs) is widely used in post-training pipelines, existing approaches typically generate full outputs before app…

ClawForge: Generating Executable Interactive Benchmarks for Command-Line Agents

arXiv:2605.14133v1 Announce Type: new Abstract: Interactive agent benchmarks face a tension between scalable construction and realistic workflow evaluation. Hand-authored tasks are expensive to extend and revise, while…

Distribution-Aware Algorithm Design with LLM Agents

arXiv:2605.14141v1 Announce Type: new Abstract: We study learning when the learned object is executable solver code rather than a predictor. In this setting, correctness is not enough: two solvers may both return valid…

Unsteady Metrics and Benchmarking Cultures of AI Model Builders

arXiv:2605.14164v1 Announce Type: new Abstract: The primary way to establish and compare competencies in foundation and generative AI models has shifted from peer-reviewed literature to press releases and company blog p…

Grounded Continuation: A Linear-Time Runtime Verifier for LLM Conversations

arXiv:2605.14175v1 Announce Type: new Abstract: In long conversations, an LLM can produce a next utterance that sounds plausible but rests on premises the conversation has already abandoned. Context-manipulation attacks…

Agentic AI Ecosystems in Higher Education: A Perspective on AI Agents to Emerging Inclusive, Agentic Multi-Agent AI Framework for Learning, Teaching and Instit…

arXiv:2605.14266v1 Announce Type: new Abstract: Integration of artificial intelligent (AI) agents in higher education is transforming teaching, learning and administrative processes. Although existing AI agents effectiv…

DVMap: Fine-Grained Pluralistic Value Alignment via High-Consensus Demographic-Value Mapping

arXiv:2605.14420v1 Announce Type: new Abstract: Current Large Language Models (LLMs) typically rely on coarse-grained national labels for pluralistic value alignment. However, such macro-level supervision often obscures…

OmniDrop: Layer-wise Token Pruning for Omni-modal LLMs via Query-Guidance

arXiv:2605.14458v1 Announce Type: new Abstract: Omni-modal large language models have demonstrated remarkable potential in holistic multimodal understanding; however, the token explosion caused by high-resolution audio…

Cattle Trade: A Multi-Agent Benchmark for LLM Bluffing, Bidding, and Bargaining

arXiv:2605.14537v1 Announce Type: new Abstract: We introduce \textsc{Cattle Trade, a multi-agent benchmark for evaluating large language models (LLMs) as agents in strategic reasoning under imperfect information, advers…

TeachAnything: A Multimodal Crowdsourcing Platform for Training Embodied AI Agents in Symmetrical Reality

arXiv:2605.14556v1 Announce Type: new Abstract: Symmetrical Reality (SR) is emerging as a future trend for human-agent coexistence, placing higher demands on agents to acquire human-like intelligence. It calls for riche…

Sycophancy is an Educational Safety Risk: Why LLM Tutors Need Sycophancy Benchmarks

arXiv:2605.14604v1 Announce Type: new Abstract: This position paper argues that effective tutoring requires corrective friction: surfacing misconceptions and challenging them supportively to drive conceptual change. Yet…

How Sensitive Are Radiomic AI Models to Acquisition Parameters?

arXiv:2605.14667v1 Announce Type: new Abstract: A main barrier for the deployment of AI radiomic systems in clinical routine is their drop in performance under heterogeneous multicentre acquisition protocols. This work…

Agentifying Patient Dynamics within LLMs through Interacting with Clinical World Model

arXiv:2605.14723v1 Announce Type: new Abstract: Sepsis management in the ICU requires sequential treatment decisions under rapidly evolving patient physiology. Although large language models (LLMs) encode broad clinical…

A Heterogeneous Temporal Memory Governance Framework for Long-Term LLM Persona Consistency

arXiv:2605.14802v1 Announce Type: new Abstract: Large language models often suffer from fact loss, timeline confusion, persona drift, and reduced stability during long-range interaction, especially under high-noise know…

Beyond Individual Intelligence: Surveying Collaboration, Failure Attribution, and Self-Evolution in LLM-based Multi-Agent Systems

arXiv:2605.14892v1 Announce Type: new Abstract: LLM-based autonomous agents have demonstrated strong capabilities in reasoning, planning, and tool use, yet remain limited when tasks require sustained coordination across…

Case-Based Calibration of Adaptive Reasoning and Execution for LLM Tool Use

arXiv:2605.15041v1 Announce Type: new Abstract: Tool use extends large language models beyond parametric knowledge, but reliable execution requires balancing appropriate reasoning depth with strict structural validity.…

GAMBIT: A Three-Mode Benchmark for Adversarial Robustness in Multi-Agent LLM Collectives

arXiv:2605.09027v2 Announce Type: cross Abstract: In multi-agent systems (MAS), a single deceptive agent can nullify all gains of an agentic AI collective and evade deployed defenses. However, existing adversarial studi…

TERMS-Bench: Diagnosing LLM Negotiation Agents Beyond Deal Rate

arXiv:2605.13909v1 Announce Type: cross Abstract: Negotiation is a central mechanism of economic exchange, shaping markets, procurement, labor agreements, and resource allocation. It is also a canonical testbed for agen…

Multi-Scale Dequant: Eliminating Dequantization Bottleneck via Activation Decomposition for Efficient LLM Inference

arXiv:2605.13915v1 Announce Type: cross Abstract: Quantization is essential for efficient large language model (LLM) inference, yet the dequantization step-converting low-bit weights back to high-precision for matrix mu…

Towards the Next Frontier of LLMs, Training on Private Data: A Cross-Domain Benchmark for Federated Fine-Tuning

arXiv:2605.13936v1 Announce Type: cross Abstract: The recent success of large language models (LLMs) has been largely driven by vast public datasets. However, the next frontier for LLM development lies beyond public dat…

EvolveMem:Self-Evolving Memory Architecture via AutoResearch for LLM Agents

arXiv:2605.13941v1 Announce Type: cross Abstract: Long-term memory is essential for LLM agents that operate across multiple sessions, yet existing memory systems treat retrieval infrastructure as fixed: stored content e…

Collider-Bench: Benchmarking AI Agents with Particle Physics Analysis Reproduction

arXiv:2605.13950v1 Announce Type: cross Abstract: Autonomous language-model agents are increasingly evaluated on long-horizon tool-use tasks, but existing benchmarks rarely capture the complexity and nuance of real scie…

Towards Resource-Efficient LLMs: End-to-End Energy Accounting of Distillation Pipelines

arXiv:2605.13981v1 Announce Type: cross Abstract: The rise in deployment of large language models has driven a surge in GPU demand and datacenter scaling, raising concerns about electricity use, grid stress, and the imp…

Generative Floor Plan Design with LLMs via Reinforcement Learning with Verifiable Rewards

arXiv:2605.14117v1 Announce Type: cross Abstract: An AI system for professional floor plan design must precisely control room dimensions and areas while respecting the desired connectivity between rooms and maintaining…

ExploitBench: A Capability Ladder Benchmark for LLM Cybersecurity Agents

arXiv:2605.14153v1 Announce Type: cross Abstract: Exploitation is not a binary event. It is a ladder of acquiring progressive capabilities, from executing a single buggy line of code to taking full control of the target…

LLM-Based Robustness Testing of Microservice Applications: An Empirical Study

arXiv:2605.14202v1 Announce Type: cross Abstract: Malformed, missing, or boundary-value inputs in microservice APIs can cascade across dependent services, threatening reliability. Robustness testing systematically exerc…

Diagnosing Training Inference Mismatch in LLM Reinforcement Learning

arXiv:2605.14220v1 Announce Type: cross Abstract: Modern LLM RL systems separate rollout generation from policy optimization. These two stages are expected to produce token probabilities that match exactly. However, imp…

The Great Pretender: A Stochasticity Problem in LLM Jailbreak

arXiv:2605.14418v1 Announce Type: cross Abstract: "Oh-Oh, yes, I'm the great pretender. Pretending that I'm doing well. My need is such, I pretend too much..." summarizes the state in the area of jailbreak creation and…

MemLineage: Lineage-Guided Enforcement for LLM Agent Memory

arXiv:2605.14421v1 Announce Type: cross Abstract: We introduce MemLineage, a defense for LLM agent memory that attaches both cryptographic provenance and LLM-mediated derivation lineage to every entry. Recent and concur…

RxEval: A Prescription-Level Benchmark for Evaluating LLM Medication Recommendation

arXiv:2605.14543v1 Announce Type: cross Abstract: Inpatient medication recommendation requires clinicians to repeatedly select specific medications, doses, and routes as a patient's condition evolves. Existing benchmark…

AI-assisted cultural heritage dissemination: Comparing NMT and glossary-augmented LLM translation in rock art documents

arXiv:2605.14679v1 Announce Type: cross Abstract: Cultural heritage institutions increasingly disseminate research and interpretive materials globally, but multilingual dissemination is constrained by limited budgets an…

Mechanical Enforcement for LLM Governance:Evidence of Governance-Task Decoupling in Financial Decision Systems

arXiv:2605.14744v1 Announce Type: cross Abstract: Large language models in regulated financial workflows are governed by natural-language policies that the same model interprets, creating a principal--agent failure: out…

Streaming Speech-to-Text Translation with a SpeechLLM

arXiv:2605.14766v1 Announce Type: cross Abstract: Normally, a system that translates speech into text consists of separate modules for speech recognition and text-to-text translation. Combining those tasks into a Speech…

Known By Their Actions: Fingerprinting LLM Browser Agents via UI Traces

arXiv:2605.14786v1 Announce Type: cross Abstract: As LLM-based agents increasingly browse the web on users' behalf, a natural question arises: can websites passively identify which underlying model powers an agent? Doin…

XFP: Quality-Targeted Adaptive Codebook Quantization with Sparse Outlier Separation for LLM Inference

arXiv:2605.14844v1 Announce Type: cross Abstract: We introduce XFP, a dynamic weight quantizer for LLM inference that inverts the conventional workflow: the operator specifies reconstruction quality floors on per-channe…

Quantifying and Mitigating Premature Closure in Frontier LLMs

arXiv:2605.15000v1 Announce Type: cross Abstract: Premature closure, or committing to a conclusion before sufficient information is available, is a recognized contributor to diagnostic error but remains underexamined in…

SpeakerLLM: A Speaker-Specialized Audio-LLM for Speaker Understanding and Verification Reasoning

arXiv:2605.15044v1 Announce Type: cross Abstract: As audio-first agents become increasingly common in physical AI, conversational robots, and screenless wearables, audio large language models (audio-LLMs) must integrate…

TFGN: Task-Free, Replay-Free Continual Pre-Training Without Catastrophic Forgetting at LLM Scale

arXiv:2605.15053v1 Announce Type: cross Abstract: Continually pre-training a large language model on heterogeneous text domains, without replay or task labels, has remained an unsolved architectural problem at LLM scale…

Concurrency without Model Changes: Future-based Asynchronous Function Calling for LLMs

arXiv:2605.15077v1 Announce Type: cross Abstract: Function calling, also known as tool use, is a core capability of modern LLM agents but is typically constrained by synchronous execution semantics. Under these semantic…

Widening the Gap: Exploiting LLM Quantization via Outlier Injection

arXiv:2605.15152v1 Announce Type: cross Abstract: LLM quantization has become essential for memory-efficient deployment. Recent work has shown that quantization schemes can pose critical security risks: an adversary may…

GeoLaux: A Benchmark for Evaluating MLLMs' Geometry Performance on Long-Step Problems Requiring Auxiliary Lines

arXiv:2508.06226v4 Announce Type: replace Abstract: Geometry problem solving (GPS) poses significant challenges for Multimodal Large Language Models (MLLMs) in diagram comprehension, knowledge application, long-step rea…

E-mem: Multi-agent based Episodic Context Reconstruction for LLM Agent Memory

arXiv:2601.21714v4 Announce Type: replace Abstract: The evolution of Large Language Model (LLM) agents towards System~2 reasoning, characterized by deliberative, high-precision problem-solving, requires maintaining rigo…

Dynamic Mixed-Precision Routing for Efficient Multi-step LLM Interaction

arXiv:2602.02711v2 Announce Type: replace Abstract: Large language models (LLMs) achieve strong performance in long-horizon decision-making tasks through multi-step interaction and reasoning at test time. While practiti…

Hunt Globally: Wide Search AI Agents for Drug Asset Scouting in Investing, Business Development, and Competitive Intelligence

arXiv:2602.15019v5 Announce Type: replace Abstract: Bio-pharmaceutical innovation has shifted: many new drug assets now originate outside the United States and are disclosed primarily via regional, non-English channels.…

LLMs learn scientific taste from institutional traces across the social sciences

arXiv:2603.16659v3 Announce Type: replace Abstract: Reinforcement-learned reasoning has powered recent AI leaps on verifiable tasks, including mathematics, code, and structure prediction. The harder bottleneck is evalua…

NeuroState-Bench: A Human-Calibrated Benchmark for Commitment Integrity in LLM Agent Profiles

arXiv:2605.01847v3 Announce Type: replace Abstract: Outcome-only evaluation under-specifies whether an evaluated agent profile preserves the commitments required to solve a multi-turn task coherently. NeuroState-Bench i…

Workspace-Bench 1.0: Benchmarking AI Agents on Workspace Tasks with Large-Scale File Dependencies

arXiv:2605.03596v4 Announce Type: replace Abstract: Workspace learning requires AI agents to identify, reason over, exploit, and update explicit and implicit dependencies among heterogeneous files in a worker's workspac…

SearchSkill: Teaching LLMs to Use Search Tools with Evolving Skill Banks

arXiv:2605.09038v2 Announce Type: replace Abstract: Teaching language models to use search tools is not only a question of whether they search, but also of whether they issue good queries. This is especially important i…

The Impact of Generative AI on Collaborative Open-Source Software Development: Evidence from GitHub Copilot

arXiv:2410.02091v3 Announce Type: replace-cross Abstract: Generative artificial intelligence (AI) facilitates content production and enhances ideation capabilities, which can significantly influence developer productivi…

Progent: Securing AI Agents with Privilege Control

arXiv:2504.11703v3 Announce Type: replace-cross Abstract: AI agents interact with external environments through tool calls, exposing them to attacks like indirect prompt injection that can trigger unauthorized actions.…

AMiD: Knowledge Distillation for LLMs with $\alpha$-mixture Assistant Distribution

arXiv:2510.15982v3 Announce Type: replace-cross Abstract: Autoregressive large language models (LLMs) have achieved remarkable improvement across many tasks but incur high computational and memory costs. Knowledge disti…

How Learning Rate Decay Wastes Your Best Data in Curriculum-Based LLM Pretraining

arXiv:2511.18903v3 Announce Type: replace-cross Abstract: Due to the scarcity of high-quality data, large language models (LLMs) are often trained on mixtures of data with varying quality levels, even after sophisticate…

Teaching and Evaluating LLMs to Reason About Polymer Design Related Tasks

arXiv:2601.16312v2 Announce Type: replace-cross Abstract: Research in AI4Science has shown promise in many science applications, including polymer design. However, current LLMs are ineffective in this problem space beca…

OPT-Engine: Benchmarking the Limits of LLMs in Optimization Modeling via Complexity Scaling

arXiv:2601.19924v2 Announce Type: replace-cross Abstract: We investigate the capabilities and scalability of Large Language Models (LLMs) in optimization modeling, a domain requiring structured reasoning and precise for…

Boosting LLM Reasoning via Human-Inspired Reward Shaping

arXiv:2602.04265v3 Announce Type: replace-cross Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a promising paradigm for enhancing reasoning in Large Language Models (LLMs). However, exist…

Vibe Code Bench: Evaluating AI Models on End-to-End Web Application Development

arXiv:2603.04601v3 Announce Type: replace-cross Abstract: Code generation has emerged as one of AI's highest-impact use cases, yet existing benchmarks measure isolated tasks rather than the complete "zero-to-one" proces…

Reinforcement Learning for Diffusion LLMs with Entropy-Guided Step Selection and Stepwise Advantages

arXiv:2603.12554v2 Announce Type: replace-cross Abstract: Reinforcement learning (RL) has been effective for post-training autoregressive (AR) language models, but extending these methods to diffusion language models (D…

Procedural Refinement by LLM-driven Algorithmic Debugging for ARC-AGI-2

arXiv:2603.20334v3 Announce Type: replace-cross Abstract: In high-complexity abstract reasoning, a system must infer a latent rule from a few examples or structured observations and apply it to unseen instances. LLMs ca…

LLMs Should Express Uncertainty Explicitly

arXiv:2604.05306v2 Announce Type: replace-cross Abstract: Large language models (LLMs) often produce confident yet incorrect answers, which can lead to risky failures in real-world applications. We study whether post-tr…

Predict-then-Diffuse: Adaptive Response Length for Compute-Budgeted Inference in Diffusion LLMs

arXiv:2605.04215v2 Announce Type: replace-cross Abstract: Diffusion-based Large Language Models (D-LLMs) represent a promising frontier in generative AI, offering fully parallel token generation that can lead to signifi…

Predictive Maps of Multi-Agent Reasoning: A Successor-Representation Spectrum for LLM Communication Topologies

arXiv:2605.11453v2 Announce Type: replace-cross Abstract: Practitioners deploying multi-agent large language model (LLM) systems must currently choose between communication topologies such as chain, star, mesh, and rich…

GEAR: Granularity-Adaptive Advantage Reweighting for LLM Agents via Self-Distillation

arXiv:2605.11853v2 Announce Type: replace-cross Abstract: Reinforcement learning has become a widely used post-training approach for LLM agents, where training commonly relies on outcome-level rewards that provide only…

Detecting overfitting in Neural Networks during long-horizon grokking using Random Matrix Theory

arXiv:2605.12394v2 Announce Type: replace-cross Abstract: Training Neural Networks (NNs) without overfitting is difficult; detecting that overfitting is difficult as well. We present a novel Random Matrix Theory method…

Learning, Fast and Slow: Towards LLMs That Adapt Continually

arXiv:2605.12484v2 Announce Type: replace-cross Abstract: Large language models (LLMs) are trained for downstream tasks by updating their parameters (e.g., via RL). However, updating parameters forces them to absorb tas…

May 15, 2026

Hacker News

Orthrus-Qwen3: up to 7.8×tokens/forward on Qwen3, identical output distribution

Article URL: Comments URL: Points: 96 # Comments: 15

The Verge

OpenAI keeps shuffling its executives in bid to win AI agent battle

OpenAI announced yet another reorganization Friday, consolidating certain areas and making company president Greg Brockman the official lead of all things product. In a memo viewed by The Verge, Brockman wrote that sinc…

ChatGPT now wants access to your bank account so it can tell you to stop ordering takeout

OpenAI is turning ChatGPT into a personal financial assistant. Pro users in the US can now connect their bank accounts through Plaid to get personalized analysis based on real transaction data. The feature runs on GPT-5…

TechCrunch

OpenAI launches ChatGPT for personal finance, will let you connect bank accounts

Once users connect their accounts, they will see a dashboard of their portfolio performance, spending, subscriptions, and upcoming payments.

x.AI plays catch-up with Grok Build, its first terminal-based coding agent

Elon Musk's AI company x.AI is jumping into the coding agent space with Grok Build, a new terminal-based tool. The article x.AI plays catch-up with Grok Build, its first terminal-based coding agent appeared first on The…

Microsoft pulls Claude Code licenses and pushes developers back toward its own AI tool

Thousands of Microsoft developers used Anthropic's Claude Code for programming. Now the company is revoking licenses and betting on GitHub Copilot CLI. The article Microsoft pulls Claude Code licenses and pushes develop…

Databricks brings GPT-5.5 to enterprise agent workflows

Databricks uses GPT-5.5 for enterprise agent workflows after the model set a new state of the art on the OfficeQA Pro benchmark.

May 14, 2026

TechCrunch

OpenAI says Codex is coming to your phone

The update gives users enhanced flexibility over how they can manage their workflows.

TechCrunch

Clawdmeter turns your Claude Code usage stats into a tiny desktop dashboard

A new open source gadget called Clawdmeter turns Claude Code usage stats into a tiny desktop dashboard for AI coding power users.

Hugging Face

Granite Embedding Multilingual R2: Open Apache 2.0 Multilingual Embeddings with 32K Context — Best Sub-100M Retrieval Quality

May 12, 2026

AI News

Laserfiche unveils AI agents for natural language workflows

Laserfiche has announced the release of AI agents that can help perform tasks through natural language prompts. Intelligent assistants follow Laserfiche’s integrated security rules and compliance requirements, helping e…

May 11, 2026

MIT Tech Review AI

Fostering breakthrough AI innovation through customer-back engineering

Despite years of digitization, organizations capture less than one-third of the value expected from digital investments, according to McKinsey research. That’s because most big companies begin with technological capabil…

May 7, 2026

Simon Willison

llm-gemini 0.31

Release: llm-gemini 0.31 gemini-3.1-flash-lite is no longer a preview . Here's my write-up of the Gemini 3.1 Flash-Lite Preview model back in March. I don't believe this new non-preview model has changed since then. Tag…

Scaling Trusted Access for Cyber with GPT-5.5 and GPT-5.5-Cyber

OpenAI expands Trusted Access for Cyber with GPT-5.5 and GPT-5.5-Cyber, helping verified defenders accelerate vulnerability research and protect critical infrastructure.

Introducing Trusted Contact in ChatGPT

Introducing Trusted Contact in ChatGPT, an optional safety feature that notifies someone you trust if serious self-harm concerns are detected.

May 6, 2026

Hugging Face

vLLM V0 to V1: Correctness Before Corrections in RL

Introducing ChatGPT Futures: Class of 2026

Meet the ChatGPT Futures Class of 2026—26 student innovators using AI to build, research, and drive real-world impact. Discover how this generation is redefining learning, creativity, and opportunity with ChatGPT.

May 5, 2026

GPT-5.5 Instant System Card