月度新闻归档

July 2026 AI 新闻归档

本月归档共包含 87 篇文章，按发布日期分组展示。

RSS feed

新闻源: 26m ago

归档文章

活跃日期

新闻源

来源概览

arXiv cs.AI

36 articles

Latest: Jul 2, 2026

arXiv cs.CL

24 articles

Latest: Jul 2, 2026

arXiv cs.LG

11 articles

Latest: Jul 2, 2026

Hacker News

5 articles

Latest: Jul 2, 2026

The Decoder

4 articles

Latest: Jul 2, 2026

Towards AI

2 articles

Latest: Jul 2, 2026

TechCrunch

2 articles

Latest: Jul 1, 2026

MIT Tech Review AI

1 article

Latest: Jul 1, 2026

Jul 2, 2026

The Decoder

Anthropic says it cut 80 percent of Claude Code's system prompt because Fable 5 models "want a smaller system prompt"

Anthropic has cut the system prompt for Claude Code by 80 percent. According to staffer Tariq Shihipar, the new Fable 5 models need fewer instructions and examples. Guidelines can even hold the models back because they'…

Towards AI

Why 4-Bit Weights Are Easy and 8-Bit Activations Break Models: Inside LLM Inference, Part 3

A systems-level mental model of quantization, built from the asymmetry that explains every method in the field Quantizing the weights of a large language model is close to a solved problem. You can drop them from 16 bit…

Hacker News

Comparing Fable and 10 other LLMs on refactoring a LangGraph god node

Article URL: Comments URL: Points: 31 # Comments: 10

The Decoder

AI agents can now complete 16 percent of freelance jobs at pro quality, up from 2.5 percent eight months ago

The Remote Labor Index measures how often AI agents complete paid freelance projects at professional quality. In eight months, the top automation rate has more than quadrupled. The article AI agents can now complete 16…

Towards AI

Qwen Taught an LLM to Hallucinate on Purpose — Agents Trained in Fake Worlds Beat Reality by 16…

For two years, everyone building LLMs has been fighting hallucination. Last week, Alibaba’s Qwen team shipped a model whose entire job is… Continue reading on Towards AI »

Hacker News

CursorBench 3.1

Article URL: Comments URL: Points: 126 # Comments: 69

Hacker News

Kimi K2.7 Code is generally available in GitHub Copilot

Article URL: Comments URL: Points: 320 # Comments: 133

arXiv cs.AI

Scaling Up Thermodynamic AI Models

arXiv:2607.00170v1 Announce Type: cross Abstract: Thermodynamic computing devices based on the Ising model show great promise for low-power AI inference and edge computing, but scalable methods for training large models…

arXiv cs.AI

FED-FSTQ: Fisher-Guided Token Quantization for Communication-Efficient Federated Fine-Tuning of LLMs on Edge Devices

arXiv:2604.25421v2 Announce Type: replace-cross Abstract: Federated fine-tuning provides a practical route to adapt large language models (LLMs) on edge devices without centralizing private data, yet in mobile deploymen…

arXiv cs.AI

Distill to Detect: Exposing Stealth Biases in LLMs through Cartridge Distillation

arXiv:2607.01208v1 Announce Type: cross Abstract: Language models deployed in high-stakes roles can potentially favor certain entities, brands, or viewpoints, steering user decisions at scale. Such preferential biases c…

arXiv cs.AI

From Signals to Structure: How Memory Architecture Drives Language Emergence in LLM Agents

arXiv:2607.00233v1 Announce Type: new Abstract: How do two agents invent a shared language from scratch? In a Lewis signaling game, a sender and receiver must coordinate on a code using only their interaction history. W…

arXiv cs.AI

Agri-SAGE: Simulation-Grounded Multi-Agent LLM for Context-Aware Agricultural Advisory Generation

arXiv:2607.00454v1 Announce Type: new Abstract: Agricultural advisory systems face a fundamental tension: static agronomic guidelines offer consistent, evidence-based recommendations, yet remain blind to in-season varia…

arXiv cs.AI

Self-GC: Self-Governing Context for Long-Horizon LLM Agents

arXiv:2607.00692v1 Announce Type: new Abstract: Long-horizon LLM agents accumulate tool results, files, plans, and user constraints that are too structured to be treated as a disposable text suffix. Current systems most…

arXiv cs.AI

UltraFlux: Data-Model Co-Design for High-quality Native 4K Text-to-Image Generation across Diverse Aspect Ratios

arXiv:2511.18050v1 Announce Type: cross Abstract: Diffusion transformers have recently delivered strong text-to-image generation around 1K resolution, but we show that extending them to native 4K across diverse aspect r…

arXiv cs.AI

From "Strings" to "Things" for Personal Knowledge Graphs: Evaluating LLM Triple Extraction for Recommendation Systems

arXiv:2607.00003v1 Announce Type: cross Abstract: Personal Knowledge Graphs (PKGs) offer a privacy-preserving framework for modeling user preferences, yet constructing them from unstructured, decentralized conversationa…

arXiv cs.AI

Topological Void Analysis A Mathematical Framework for Systematic Technical Innovation Discovery in Knowledge Spaces

arXiv:2607.00005v1 Announce Type: cross Abstract: Identifying where to innovate in a dense technical domain - such as operating systems or hardware/software co-design - is fundamentally a search problem in a high-dimens…

arXiv cs.AI

Persona Without Substrate: Regime-Dependence and the LLM Individuation Problem

arXiv:2607.00006v1 Announce Type: cross Abstract: Beckmann & Butlin's (2026) ontological framework for the LLM individuation problem inherits an unargued cross-regime co-reference assumption from the persona-vectors lit…

arXiv cs.AI

SchemaRAG: Dynamic Large Schema Reduction for LLM-driven Structured Information Extraction

arXiv:2607.00008v1 Announce Type: cross Abstract: Extracting structured data from unstructured text using large language models (LLMs) becomes challenging when target schemas are large and complex. In such cases, includ…

arXiv cs.AI

LLMs in the Real World: Evaluating "AI" in Emergency Contexts

arXiv:2607.00019v1 Announce Type: cross Abstract: This paper offers a call to action. We urge our colleagues in the research community to play a greater role in the articulation of our findings to the public. To illustr…

arXiv cs.AI

Prompting GPT-5 on Scrum Certification Questions: An Empirical Accuracy Study

arXiv:2607.00049v1 Announce Type: cross Abstract: Large Language Models (LLMs) are increasingly used in Agile Software Development for documentation, coaching, and training. As practitioners adopt these tools to prepare…

arXiv cs.AI

Lost in the Tail: Addressing Geographic Imbalance in Urban Visual Place Recognition

arXiv:2607.00090v1 Announce Type: cross Abstract: Urban-scale Visual Place Recognition (VPR) aims to identify the geographic location of a query image by matching it against a geo-tagged database. While recent methods a…

arXiv cs.AI

SLIM-RL: Risk-Budgeted Random-Masking RL for Diffusion LLMs Without Trajectory Slicing

arXiv:2607.00208v1 Announce Type: cross Abstract: Reinforcement learning for diffusion large language models (dLLMs) has largely moved to trajectory-aware methods. The current state of the art, TraceRL, holds that rando…

arXiv cs.AI

Holographic Quantum Transformer: A Generalist Neuro-Symbolic Architecture for Solving Frustrated Systems via Generative Attention

arXiv:2607.00398v1 Announce Type: cross Abstract: Simulating two-dimensional frustrated quantum matter is a grand challenge due to the sign problem and exponential Hilbert space complexity. In this work, we introduce th…

arXiv cs.AI

Real-Time Hard Negative Sampling via LLM-based Clustering for Large-Scale Two-Tower Retrieval

arXiv:2607.00448v1 Announce Type: cross Abstract: The two-tower model has been widely used for large-scale recommendation systems, particularly in the retrieval stage. Industry standards for training two-tower models ty…

arXiv cs.AI

Beyond the Prompt: Jailbreaking Function-Calling LLMs via Simulated Moderation Traces

arXiv:2607.00481v1 Announce Type: cross Abstract: Jailbreak attacks remain a critical threat to the safe deployment of large language models (LLMs). While prior work has primarily studied attacks and defenses at the pro…

arXiv cs.AI

BaseRT: Best-in-Class LLM Inference on Apple Silicon via Native Metal

arXiv:2607.00501v1 Announce Type: cross Abstract: We present BaseRT, a native Metal inference runtime for large language models (LLMs) on Apple Silicon, and report the highest inference throughput on this hardware to da…

arXiv cs.AI

LLM-Guided ODE Discovery and Parameter Inference from Small-Cohort Aggregate Data

arXiv:2607.00733v1 Announce Type: cross Abstract: Mechanistic modeling via ordinary differential equations (ODEs) provides interpretable descriptions of complex dynamics and enables inference of underlying mechanisms, w…

arXiv cs.AI

Adversarial Pragmatics for AI Safety Evaluation: A Benchmark for Instruction Conflict, Embedded Commands, and Policy Ambiguity

arXiv:2607.01153v1 Announce Type: cross Abstract: Safety evaluations for language models increasingly depend on judgments about ambiguous natural-language behaviour: whether a model has followed an instruction, refused…

arXiv cs.AI

Selective Expert Guidance for Effective and Diverse Exploration in Reinforcement Learning of LLMs

arXiv:2510.04140v2 Announce Type: replace Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has become a widely adopted technique for enhancing the reasoning ability of Large Language Models (LLMs). Howeve…

arXiv cs.AI

Evaluating Implicit Biases in LLM Reasoning through Logic Grid Puzzles

arXiv:2511.06160v2 Announce Type: replace Abstract: While recent safety guardrails effectively suppress overtly biased outputs, subtler forms of social bias emerge during complex logical reasoning tasks that evade curre…

arXiv cs.AI

Explicit Logic Channel for Validation and Enhancement of MLLMs on Zero-Shot Tasks

arXiv:2603.11689v3 Announce Type: replace Abstract: Frontier Multimodal Large Language Models (MLLMs) exhibit remarkable capabilities in Visual-Language Comprehension (VLC) tasks. However, they are often deployed as zer…

arXiv cs.AI

Measuring Reasoning Quality in LLMs: A Multi-Dimensional Behavioral Framework

arXiv:2605.24661v3 Announce Type: replace Abstract: Despite remarkable progress on reasoning benchmarks, current LLM evaluation practice remains anchored to final-answer correctness, providing limited insight into how m…

arXiv cs.AI

ManimAgent: Self-Evolving Multimodal Agents for Visual Education

arXiv:2606.30296v2 Announce Type: replace Abstract: Multi-round reflection lets agents built on large language models recover from failures within a single task, but each task remains an isolated episode: lessons learne…

arXiv cs.AI

Hey, That's My Model! Introducing Chain & Hash, An LLM Fingerprinting Technique

arXiv:2407.10887v4 Announce Type: replace-cross Abstract: Growing concerns over the theft and misuse of Large Language Models (LLMs) underscore the need for effective fingerprinting to link a model to its original versi…

arXiv cs.AI

Verbosity Tradeoffs and the Impact of Scale on the Faithfulness of LLM Self-Explanations

arXiv:2503.13445v3 Announce Type: replace-cross Abstract: When asked to explain their decisions, LLMs can often give explanations which sound plausible to humans. But are these explanations faithful, i.e. do they convey…

arXiv cs.AI

Predicting LLM Reasoning Performance with Small Proxy Model

arXiv:2509.21013v4 Announce Type: replace-cross Abstract: Given the prohibitive cost of pre-training large language models, it is essential to leverage smaller proxy models to optimize datasets before scaling up. Howeve…

arXiv cs.AI

When AI Agents Compete for Jobs: Strategic Capabilities and Economic Dynamics of AI Labour Markets

arXiv:2512.04988v2 Announce Type: replace-cross Abstract: Emerging agentic marketplaces provide the economic infrastructure for matching and coordinating the large amounts of AI agents used in agentic swarms. Unlike hum…

arXiv cs.AI

NeuroFilter: Activation-Based Guardrails for Privacy-Conscious LLM Agents

arXiv:2601.14660v2 Announce Type: replace-cross Abstract: Agentic Large Language Models (LLMs) are models able to reason, plan, and execute tools over unstructured data. These abilities are enabling transformative appli…

arXiv cs.AI

Maximizing Mutual Information Between Prompt and Response Improves LLM Performance With No Additional Data

arXiv:2603.19294v5 Announce Type: replace-cross Abstract: While post-training has successfully improved large language models (LLMs) across a variety of domains, these gains heavily rely on human-labeled data or externa…

arXiv cs.AI

EcoGEO: Trajectory-Aware Evidence Ecosystems for Web-Enabled LLM Search Agents

arXiv:2605.12887v2 Announce Type: replace-cross Abstract: Web-enabled LLM agents are changing how online information influences search outcomes. Existing Generative Engine Optimization (GEO) studies mainly focus on indi…

arXiv cs.AI

LC-QAT: Data-Efficient 2-Bit QAT for LLMs via Linear-Constrained Vector Quantization

arXiv:2606.10531v2 Announce Type: replace-cross Abstract: Quantization-aware training (QAT) is essential for extremely low-bit large language models (LLMs). Current QAT methods are mainly based on scalar quantization (S…

arXiv cs.AI

Diff-Based Code Corruption using LLMs for Large-Scale Bugfix Benchmarking

arXiv:2606.29088v3 Announce Type: replace-cross Abstract: There are various benchmarks to evaluate bugfixing capabilities of Large Language Models. However, most widespread benchmarks do not fully reflect real-world bug…

arXiv cs.AI

ComplianceGate: Classifier-Gated Multi-Tier LLM Routing for Inference in Regulated Industries

arXiv:2606.31163v2 Announce Type: replace-cross Abstract: Large language models deployed in regulated industries operate under two constraints: compliance enforcement and cost efficiency. Personally identifiable informa…

arXiv cs.CL

Indi-RomCoM: Code-Mixed Benchmark for Evaluating LLMs on Romanized Indic-English Instructions

arXiv:2606.30790v1 Announce Type: new Abstract: Romanized Code Mixing (RCM), where bilingual speakers fluidly blend local languages with English in Roman script, has emerged as the dominant form of communication across…

arXiv cs.CL

Using AI Agents to Automate Black-Box Audits of Personalization Algorithms at Scale

arXiv:2606.30801v1 Announce Type: new Abstract: Personalization algorithms determine what content users encounter on online platforms. Auditing these systems is difficult because independent auditors have only black-box…

arXiv cs.CL

Bridging Scientific Heritage: An Arabic--Russian Parallel Corpus and LLM Benchmark for Sustainable Knowledge Transfer

arXiv:2606.30943v1 Announce Type: new Abstract: Russian and Arabic are among the major languages of scientific communication. Language barriers impede the exchange of research results between these communities, which af…

arXiv cs.CL

Truth or Sophistry? LoFa: A Benchmark for LLM Robustness Against Logical Fallacies

arXiv:2606.31039v1 Announce Type: new Abstract: Large Language Models (LLMs) exhibit strong semantic capabilities, yet their resilience to manipulative linguistic patterns such as logical fallacies remains underexplored…

arXiv cs.CL

SeKV: Resolution-Adaptive KV Cache with Hierarchical Semantic Memory for Long-Context LLM Inference

arXiv:2606.31145v1 Announce Type: new Abstract: Large language models increasingly operate over long contexts, where the KV cache becomes a dominant memory bottleneck: its size grows linearly with sequence length and mu…

arXiv cs.CL

Can LLMs Imagine Moral Alternatives Beyond Binary Dilemmas?

arXiv:2606.31213v1 Announce Type: new Abstract: As large language models (LLMs) are increasingly deployed as moral advisors and agents, they need to address dilemmas between two competing values. However, existing resea…

arXiv cs.CL

When the Database Fails: Prompting LLM Dialogue Agents for Safe Recovery in Task-Oriented Dialogue

arXiv:2606.31307v1 Announce Type: new Abstract: Large language models used in task-oriented dialogue often produce fluent but unsafe responses when backend database calls fail, return empty results, or surface mismatche…

arXiv cs.CL

CLExEval: A Human-in-the-Loop Framework for Qualitative Evaluation of LLM Clinical Reasoning

arXiv:2606.31608v1 Announce Type: new Abstract: Large Language Models (LLMs) achieve strong results on many medical benchmarks, but their clinical reasoning remains difficult to evaluate reliably. A central risk is an e…

arXiv cs.CL

Moral Safety in LLMs: Exposing Performative Compliance with Puzzled Cues

arXiv:2606.31644v1 Announce Type: new Abstract: As large language models take on morally consequential roles in healthcare, legal, and hiring contexts, we need to examine whether their ethical behaviors are genuine or s…

arXiv cs.CL

When LLMs Read Tables Carelessly: Measuring and Reducing Data Referencing Errors

arXiv:2606.32029v1 Announce Type: new Abstract: While large language models (LLMs) perform well on table tasks, they still make data referencing errors (DREs), i.e., incorrectly citing or omitting table values, despite…

arXiv cs.CL

Reinforcement Learning with Metacognitive Feedback Elicits Faithful Uncertainty Expression in LLMs

arXiv:2606.32032v1 Announce Type: new Abstract: Metacognition is a critical component of intelligence that describes the ability to monitor and regulate one's own cognitive processes. Yet LLMs exhibit systemic deficienc…

arXiv cs.CL

Emergent Culture in Minimal LLM Systems

arXiv:2606.30668v1 Announce Type: cross Abstract: What happens when LLM agents operate with no context outside a turn, minimal prompting, and simple tools? Inspired by swarm engineering, we give collectives of three age…

arXiv cs.CL

ADAPT: Attention Dynamics Alignment with Preference Tuning for Faithful MLLMs

arXiv:2606.31054v1 Announce Type: cross Abstract: Multimodal Large Language Models (MLLMs) are critically hampered by hallucination, generating content inconsistent with the provided image. In this paper, we identify an…

arXiv cs.CL

HealthAgentBench: A Unified Benchmark Suite of Realistic Agentic Healthcare Environments for Challenging Frontier AI Agents

arXiv:2606.31179v1 Announce Type: cross Abstract: As AI agents become increasingly capable of complex, long-horizon reasoning, rigorous and holistic evaluation is essential for measuring progress toward real-world healt…

arXiv cs.CL

Calibrating the Evaluator: Does Probability Calibration Mitigate Preference Coupling in LLM Agent Feedback Loops?

arXiv:2606.31371v1 Announce Type: cross Abstract: When large language model (LLM) agents adapt their behavior through evaluator feedback, systematic evaluator biases propagate into the agent's learned strategy distribut…

arXiv cs.CL

RaBitQCache: Rotated Binary Quantization for KVCache in Long Context LLM Inference

arXiv:2606.31519v1 Announce Type: cross Abstract: Long-context Large Language Model inference is severely bottlenecked by the massive Key-Value (KV) cache, yet existing sparse attention methods often suffer from static…

arXiv cs.CL

ShopX: A Foundation Model for Intent-to-Item Fulfillment in Agentic Shopping

arXiv:2606.31693v1 Announce Type: cross Abstract: The wave of AI-native applications is moving shopping beyond page- and feed-based browsing toward intent-driven experiences orchestrated by LLM agents. A common design w…

arXiv cs.CL

Beyond Scalar Rewards: Dense Feedback for LLM Policy Synthesis in Sequential Social Dilemmas

arXiv:2603.19453v3 Announce Type: replace Abstract: We propose an LLM harness that generates code-based policy functions for multi-agent environments, evaluates them with self-play, and refines them using feedback from…

arXiv cs.CL

Shared Lexical Task Representations Explain Behavioral Variability In LLMs

arXiv:2604.22027v2 Announce Type: replace Abstract: One of the most common complaints about large language models (LLMs) is their prompt sensitivity -- that is, the fact that their ability to perform a task or provide a…

arXiv cs.CL

BenGER: Benchmarking LLM Systems on Subsumption-Based Legal Reasoning in German Law

arXiv:2605.28183v3 Announce Type: replace Abstract: We introduce BenGER (Benchmark for German Law), a benchmark and dataset for evaluating LLM systems on subsumption-based legal reasoning in German law. The dataset comb…

arXiv cs.CL

Nemotron-Labs-TwoTower: Diffusion Language Modeling with Pretrained Autoregressive Context

arXiv:2606.26493v2 Announce Type: replace Abstract: Diffusion language models offer a promising alternative to autoregressive models due to their potential for parallel and iterative generation. However, existing approa…

arXiv cs.CL

How LLMs See Creativity: Zero-Shot Scoring of Visual Creativity with Interpretable Reasoning

arXiv:2606.29672v2 Announce Type: replace Abstract: Evaluating the originality of visual images poses enduring challenges for creativity assessment. Automated scoring using AI models has proven effective in the verbal d…

arXiv cs.CL

ReplicatorBench: Benchmarking LLM Agents for Replicability in Social and Behavioral Sciences

arXiv:2602.11354v3 Announce Type: replace-cross Abstract: The literature has witnessed an emerging interest in AI agents for automated assessment of scientific papers. Existing benchmarks focus primarily on the computat…

arXiv cs.CL

LLM-as-a-judge validity in physics assessment depends more on the task than the model

arXiv:2603.14732v2 Announce Type: replace-cross Abstract: As large language models (LLMs) are increasingly considered for automated assessment and feedback, understanding when LLM marking is valid is essential. We evalu…

arXiv cs.LG

EPC: A Standardized Protocol for Measuring Evaluator Preference Dynamics in LLM Agent Systems

arXiv:2607.00297v1 Announce Type: new Abstract: When LLM agents use evaluator feedback to adapt their behavior in closed loops, evaluator biases propagate through the agent's strategy distribution -- a phenomenon known…

arXiv cs.LG

MosaicKV: Serving Long-Context LLM with Dynamic Two-D KV Cache Compression

arXiv:2607.00760v1 Announce Type: new Abstract: Long-context LLM services now sustain prompts with hundreds of thousands to millions of tokens, making the key-value (KV) cache a first-order serving cost. Because the cac…

arXiv cs.LG

Beyond Activation Alignment:The Alignment-Diversity Tradeoff in Task-Aware LLM Quantization

arXiv:2607.00908v1 Announce Type: new Abstract: Mixed-precision quantization (MPQ) has become a key technique for deploying large language models under stringent memory and compute constraints. We first identify a pheno…

arXiv cs.LG

Diffeomorphic Optimization

arXiv:2607.00947v1 Announce Type: new Abstract: Generative models learn data distributions that reside on a low-dimensional manifold within a higher-dimensional ambient space. Optimizing differentiable objectives on thi…

arXiv cs.LG

CogTax: A Four-Level Cognitive Taxonomy for Command-Line Computing Education

arXiv:2607.00140v1 Announce Type: cross Abstract: As computing education expands beyond traditional programming into operational domains such as systems administration and command-line environments, existing pedagogical…

arXiv cs.LG

A Mechanistic View of Authority Hierarchy in LLM Sycophancy

arXiv:2607.00415v1 Announce Type: cross Abstract: Authority bias poses a critical safety concern in language models: models systematically prioritize social cues from authority figures over factual consistency, swaying…

arXiv cs.LG

Characterizing and Identifying Separable Graphical Models

arXiv:2607.01057v1 Announce Type: cross Abstract: We study a broad class of graphical models whose independencies correspond to vertex separation in mixed graphs with directed, undirected, and bidirected edges, that are…

arXiv cs.LG

When Less Latent Leads to Better Relay: Information-Preserving Compression for Latent Multi-Agent LLM Collaboration

arXiv:2604.13349v2 Announce Type: replace Abstract: Communication in Large Language Model (LLM)-based multi-agent systems is moving beyond discrete tokens to preserve richer context. Recent work such as LatentMAS enable…

arXiv cs.LG

PersistentKV: Page-Aware Decode Scheduling for Long-Context LLM Serving on Commodity GPUs

arXiv:2606.26666v2 Announce Type: replace Abstract: Autoregressive large language model (LLM) serving is increasingly limited by key-value (KV) cache movement rather than dense matrix multiplication. Modern paged-attent…

arXiv cs.LG

A General Approach to Visualizing Uncertainty in Statistical Graphics

arXiv:2508.00937v4 Announce Type: replace-cross Abstract: We present a general approach to visualizing uncertainty in static 2-D statistical graphics. If we treat a visualization as a function of its underlying quantiti…

arXiv cs.LG

Clotho: Measuring Task-Specific Pre-Generation Test Adequacy for LLM Inputs

arXiv:2509.17314v4 Announce Type: replace-cross Abstract: Software increasingly relies on the emergent capabilities of Large Language Models (LLMs), from natural language understanding to program analysis and generation…

Jul 1, 2026

Hacker News