AIModels.fyi

Why do your coding agents keep getting lost in large repositories?

aimodels-fyi — Thu, 11 Jun 2026 17:59:14 GMT

Coding agents have gotten remarkably good at fixing bugs. The benchmark suites designed to measure this capability, like SWE-bench, keep pushing higher success rates. But something crucial is being obscured by these overall improvement metrics: we have no idea which specific skills are actually driving the gains.

When an agent successfully resolves a bug, that success comes from at least three distinct capabilities working together. The agent had to understand the repository structure well enough to find relevant files. It had to pinpoint the exact lines within those files that mattered. It had to diagnose what was wrong and write a correct fix. A binary resolved/unresolved label tells us these three things worked in concert, but not which ones are weak links. Maybe an agent fails 90% of the time because it explores repositories poorly, not because it can’t write patches. Or maybe the opposite is true. Current benchmarks can’t tell us.

This is the fundamental measurement problem that SWE-Explore sets out to solve. Rather than evaluate the entire pipeline, the benchmark isolates one critical phase: repository exploration. This seemingly small shift in focus reveals something important about how coding agents actually work and where the real bottlenecks lie.

Decomposing a complex skill

The insight underlying SWE-Explore is that a complex problem can be understood by breaking it into measurable parts. Current benchmarks treat coding task completion as a holistic prediction problem. An issue either gets resolved or it doesn’t. But this masks what’s actually happening underneath.

Look at that visualization and the abstraction becomes clear. Three overlapping capabilities get compressed into a single number. You lose the ability to diagnose whether an agent’s failures stem from poor repository understanding, inaccurate line-level localization, or weak repair logic. This is like assessing a doctor by only checking recovery rates, without examining whether they correctly diagnosed the disease or ordered the right tests.

By isolating exploration as a standalone evaluation target, SWE-Explore makes it possible to measure something more granular: given a repository and an issue description, can the agent return a ranked list of relevant code regions efficiently? This single question opens up a much clearer picture of what modern coding agents are actually good at.

Defining exploration precisely

Exploration, in this framing, means the ranked list of code regions an agent thinks are worth examining before attempting any repair. It’s the pre-reading phase of problem-solving, the phase where a developer orients themselves to understand the landscape: what files are involved, what functions call what, where do error messages originate.

The benchmark defines this as a retrieval problem with specific constraints. An explorer gets a fixed line budget, like a developer with limited time to read code before diving into fixes. Within that budget, the explorer returns a ranked list of lines it considers relevant. The question is fundamentally empirical: which lines would someone actually need to read to understand and fix this bug?

This differs from traditional code search because it operates at line granularity rather than file level, ranking matters (finding critical code early beats finding it eventually), and relevance is specific to the bug rather than generic. The framing reflects reality: developers don’t examine entire repositories uniformly. They prioritize based on what might matter.

Deriving ground truth from successful paths

The clever part is figuring out what correct exploration actually looks like without requiring humans to manually annotate every instance. Instead, the researchers extracted ground truth from agents that successfully solved issues. When an agent fixes a bug, it leaves a trail: which files did it open, which line ranges did it examine?

Are you still manually fighting with LaTeX and TikZ to create publication-quality figures?

aimodels-fyi — Thu, 04 Jun 2026 15:54:28 GMT

Scientists spend enormous time hand-crafting publication-quality figures, yet every automated system in existence handles only one figure type at a time, producing static images that cannot be tweaked. The assumption underlying this limitation is straightforward: throw more data and model capacity at the problem, and eventually one system will master them all.

This assumption is wrong, not because we lack compute or data, but because it misunderstands what a figure actually is. A scientific figure is not a monolithic prediction task. It is a structured composition of discrete semantic components, where errors occur locally. A bar chart fails when its y-axis label is misplaced, not because the entire visualization is fundamentally flawed. A phylogenetic tree fails when a branch angle is off by five degrees. A molecule diagram fails when a bond is the wrong color. These are not problems that scale to solve with a bigger backbone model. They are problems that demand intelligent coordination among specialists.

This is the core insight behind Crafter, a multi-agent harness for scientific figure generation that achieves something existing systems cannot: it generalizes across completely different figure types and input conditions without architectural changes. Rather than training one model harder, the system deploys multiple specialized agents that debate and refine specific components until they converge on a good figure.

When monolithic models meet diverse problems

Researchers need to generate bar charts from captions, phylogenetic trees from sketch inputs, molecule diagrams from reference images, and dozens of other figure types under widely varying input conditions. Existing systems each carve out a narrow slice of this problem space. SciFig targets bar charts from text. AutoFigure-Edit handles figure editing but requires raster inputs. Pixels-Paths works with multi-agent frameworks but for different structured outputs.

Each system optimizes for one task type and one input modality.

When you task a single model with solving all of these problems simultaneously, it learns to average. It produces mediocre compromises that work reasonably well across all cases but excellently for none. This is an architectural problem. The system is being asked to compress entirely different reasoning patterns into a single bottleneck.

The real issue surfaces when you examine failure modes. They are almost never global catastrophes. A generated figure usually gets most things right. Instead, failures cluster in specific locations: a misplaced element, a wrong styling choice, a label in the wrong position. These are localized problems that benefit from localized solutions, not wholesale regeneration.

Rethinking generation as coordinated problem-solving

Crafter reframes figure generation as a multi-agent conversation rather than a single neural network’s dream. The architecture consists of four specialized roles that iterate until convergence.

The intent reasoner begins the process. It does not generate a figure. Instead, it reads whatever input the user provides, whether caption, sketch, reference image, or combination, and produces a semantic representation of what success looks like. This semantic language becomes the common currency that all downstream agents use to evaluate proposals and feedback. By decoupling intent interpretation from rendering, the system can handle any input modality without retraining.

The plan generator does not produce one figure. It proposes K candidate plans, each representing a different approach to satisfying the intent. This matters because committing to the wrong approach early is expensive, but filtering bad approaches before rendering is cheap. By generating alternatives upfront, the system explores a broader space than greedy decoding ever would. Each plan is a structured specification of what elements should appear, where, and with what properties.

Can your AI agent actually learn from its mistakes or just keep repeating them?

aimodels-fyi — Thu, 28 May 2026 14:50:47 GMT

Agent skills—the instructions and guidelines that govern how AI models behave when solving problems—exist in an awkward middle ground. They’re either hand-crafted once and frozen, generated fresh each time without learning, or loosely self-revised without any real feedback mechanism. None of these approaches behaves like actual optimization.

Compare this to how we train neural networks. With weights, we have a clear loss signal, bounded update steps, validation gates, and reproducible improvement. We can inspect learning curves. We can measure generalization. We know whether we’re making progress or just fitting noise. With skills, we’ve been winging it. Someone writes a prompt, maybe tweaks it based on a few examples, and ships it. If it doesn’t work well enough, the process starts again, but there’s no systematic way to improve.

As agents become more capable and deployed at scale, the skill becomes the bottleneck. A frozen model can’t improve its behavior without retraining, which is expensive. Self-revision is unreliable, and hand-crafting doesn’t scale. We need a way to improve skills the way we improve models: systematically, with reproducible results, with validation gates that prevent chasing noise.

Treating skills as trainable parameters

The core insight behind a new SkillOpt paper is simple: a skill document is just external state that modifies how a model behaves. It’s not fundamentally different from internal weights, except it lives outside the model and can be edited without retraining. What if we treated it exactly like a neural network parameter, just in text space instead of number space?

The key move is to freeze the model completely and optimize only the skill document itself. This is backward from how we usually think about improving agents—fine-tune the model, scale it up, use a better architecture. But it’s actually more aligned with how optimization works in practice. The model becomes a fixed function. The skill becomes the variable we’re training.

Once skills are framed as parameters, we can apply real optimization techniques. We get reproducibility. We get validation gates that prevent accepting false improvements. We get learning curves that show actual progress instead of random wandering. The skill becomes a learnable object, no different in principle from training a neural network weight.

How SkillOpt optimizes skills systematically

The machinery works like a very disciplined form of skill editing. Run the target model many times with the current skill, collecting successes and failures. Feed those rollouts to a separate optimizer model, asking it to identify what went wrong and propose targeted edits. The optimizer suggests changes: add this guideline, remove that constraint, replace vague language with specific examples. But crucially, each proposed edit gets tested on held-out validation data first. If it improves the validation score, keep it. If not, reject it. Only confirmed improvements stick.

This validation gating is the crucial difference from self-revision. You’re not letting the main agent tinker with its own skill unsupervised. Instead, there’s a referee (validation data) and a thoughtful editor (the optimizer model) checking every change before it lands.

The full pipeline cycles across epochs:

Rollout and collection starts each epoch. Run the target model many times with the current skill, recording trajectories, successes, and failures.
Optimizer reflection comes next. The optimizer model analyzes the rollout batch, identifying patterns in what succeeded and what failed. It then proposes bounded edits to the skill document. Crucially, the edits are constrained: add/delete/replace single statements rather than wholesale rewrites. A textual learning-rate budget caps how much the skill can change per epoch, keeping updates stable and preventing wild swings.
Validation gating tests each proposed edit on held-out validation data. An edit is accepted only if it strictly improves the validation score. Rejected edits go into a buffer so the optimizer doesn’t propose the same failing changes repeatedly.
Meta-updates and scheduling across epochs keep optimization stable and avoid overfitting to individual rollouts. The system uses slow updates and epoch-wise adjustments inspired by meta-learning.

A subtle but important detail: the optimized skill is just text. At inference time, you pass it to the model. No extra models running, no additional latency overhead. The entire optimization happens offline.

The target model executes tasks with a current skill, the optimizer model analyzes trajectories and proposes bounded edits, and a validation gate accepts only edits that improve held-out performance

Each epoch: the frozen target model executes rollouts with the current skill, the optimizer model reflects on successes and failures, proposes bounded edits, merges candidates, and only accepts edits that improve validation performance

Evidence of improvement across diverse models and benchmarks

Can your AI agent remember your secrets without the cloud ever seeing them?

aimodels-fyi — Fri, 15 May 2026 12:17:48 GMT

As LLM-powered agents move to edge devices, they face an unexpected constraint. These systems live on your phone or your company’s server, but they need the cloud to do anything sophisticated: form long-term memories, retrieve past interactions, reason over complex context. The problem is that sensitive information keeps flowing upward. A healthcare app remembers “patient has diabetes and anxiety, lives with partner who works in cybersecurity, concerned about medication costs.” An e-commerce system tracks “allergic to shellfish, recovering from divorce, buying gifts for new partner.” All of this is task-relevant for personalization. All of it is deeply personal.

The obvious solution is masking. Replace specific details with generic placeholders. Diabetes becomes [MEDICAL_CONDITION]. $200 monthly becomes [FINANCIAL_METRIC]. The cloud never sees the actual values, so privacy is protected.

Can we build elite search agents without the massive industrial RL pipelines?

aimodels-fyi — Sun, 10 May 2026 12:39:08 GMT

Search agents have become essential infrastructure for frontier language models, yet their development remains locked behind corporate walls. These systems need to handle a fundamentally difficult problem: given access to tools and a knowledge base, explore systematically, make smart decisions about which paths to pursue, and know when to pivot strategies. Unlike a human researcher who can draw on intuition and common sense, an LLM agent works from what it’s learned during training, which means it needs explicit instruction in how to search well.

The practical stakes are high. Search agents power research tools, web-based reasoning systems, and complex information retrieval. But most breakthroughs happen inside companies with unlimited budgets. Academic researchers hit a wall: the techniques that work are proprietary, the datasets are private, and the computational resources required seem astronomical. This creates a frustrating bottleneck where innovation clusters around industrial research labs, leaving the broader research community unable to experiment, iterate, or contribute meaningfully to the field.

Why industrial pipelines felt inevitable

The prevailing wisdom emerged naturally from how major AI labs approached agent training. They borrowed techniques from large language model development: start with massive pre-training to build foundational knowledge, apply continuous pre-training to adapt that foundation to new domains, fine-tune on supervised examples to teach specific behaviors, then polish everything with reinforcement learning to optimize against reward signals. Each stage supposedly unlocks something the previous stage couldn’t reach.

The logic seemed bulletproof. If you want frontier-level capabilities, you need frontier-level methods and resources. Pre-training builds knowledge. Continuous pre-training specializes it. Supervised fine-tuning teaches specific skills. Reinforcement learning optimizes for actual performance. Remove any link in this chain and you’d expect degradation.

This assumption led to a clear conclusion: building state-of-the-art search agents required industrial-scale infrastructure. Tongyi DeepResearch, for example, achieved strong performance through exactly this pipeline, spending enormous computational resources across all four optimization stages. For any academic team or resource-constrained organization, this seemed like an insurmountable barrier.

The dataset design revolution

Then came a simpler observation: what if the bottleneck wasn’t the algorithm, but what data you fed it?

The researchers behind OpenSeeker-v2 noticed something crucial. Most work on agent training focused on optimization techniques, assuming the data was a fixed quantity. But what if the data itself could be fundamentally restructured? What if you could take the same training paradigm (simple supervised fine-tuning) and make it exponentially more powerful just by changing which trajectories you used as examples?

Xiaomi just open-sourced a 1T-parameter model and almost nobody noticed

aimodels-fyi — Wed, 29 Apr 2026 12:44:45 GMT

Xiaomi released MiMo-V2.5-Pro under an MIT license a few days ago, and the response has been quietly enthusiastic on r/LocalLLaMA but barely registered on other places like Hacker News. The phone-manufacturer-makes-LLM angle keeps tripping people up. MiMo-V2.5-Pro is a Mixture-of-Experts model with 1.02 trillion total parameters and 42 billion active per token, and it landed at 54 on the Artificial Analysis Intelligence Index - squarely in frontier territory. On reddit, u/lendo93 reported that in their benchmark suite the model averages higher than Opus 4.6 on coding reasoning, agentic work, and decision making.

About the model

The architecture is built around two ideas...

First, hybrid attention: 60 of 70 layers use sliding-window attention with a window of 128 tokens, while only 10 layers run global attention, in a 6:1 SWA-to-GA ratio. This cuts KV-cache storage by roughly 7x compared to a standard transformer, and it’s how Xiaomi gets a usable 1M-token context window without the cache exploding.
Second, multi-token prediction. There are three lightweight MTP modules with dense FFNs that predict ahead of the main token stream, and Xiaomi reports this triples inference output speed. The MTP modules are trained natively rather than bolted on as speculative decoding, so the speedup compounds with the long-context handling.

The prompt isn't hiding inside the image

aimodels-fyi — Tue, 14 Apr 2026 12:06:44 GMT

I’ve found a core misconception is persistent... people use the CLIP interrogator model expecting it to recover the original prompt from an image. It cannot do this, and if you look at the architecture it becomes clear why. The mapping from prompt to image is non-injective - many different prompts produce nearly identical outputs, and some visual featur…

Google’s Best Open Model Yet Has a Memory Problem

aimodels-fyi — Sat, 11 Apr 2026 17:53:47 GMT

Google DeepMind released Gemma 4 on Easter weekend, and the local AI community responded like it was Christmas. The family spans four sizes - E2B, E4B, 26B A4B (MoE), and 31B dense - with the 31B landing on Hugging Face under an Apache 2.0 license. That licensing change matters: previous Gemma releases used a custom Google license with usage restrictions. Apache 2.0 removes that friction for commercial deployment.

The benchmark numbers are good. The 31B scores 89.2% on AIME 2026 without tools, 80% on LiveCodeBench v6, and a Codeforces ELO of 2150. For comparison, Gemma 3 27B scored 110 on that same Codeforces benchmark. The smaller E2B model - which has only 2.3 billion effective parameters - outperforms Gemma 3 27B on MMLU Pro (60% vs 67.6%), GPQA Diamond (43.4% vs 42.4%), and LiveCodeBench (44% vs 29.1%). Some users called it “insane” - a fair reaction.

What the 31B actually does

The 31B is a dense model with 30.7B parameters, a 256K token context window, and a hybrid attention mechanism that interleaves local sliding window attention (1024-token window) with global attention layers. The final layer is always global. For long-context tasks, global layers use unified Keys and Values with Proportional RoPE (p-RoPE), which is how Google gets memory efficiency at scale without completely tanking reasoning quality.

Multimodal support covers text and images, with a 550M-parameter vision encoder. The model can process images at variable resolutions using a configurable token budget (70 to 1120 tokens per image) - lower budgets for speed on classification tasks, higher budgets for OCR and document parsing where fine-grained detail matters. The smaller E2B and E4B models additionally support audio input for up to 30 seconds, enabling single-model pipelines for voice applications.

Benchmarks, from HuggingFace

Thinking mode is built in and configurable. Include <|think|> in the system prompt to activate it; remove it to disable. The model outputs its reasoning trace in <|channel>thought\n[reasoning] blocks before the final answer. In multi-turn conversations, you strip the thinking content from history before the next user turn - thinking traces don’t get passed back.

From r/LocalLLaMa (link)

Coding is a clear strength. The 31B’s Codeforces ELO of 2150 is a significant jump from anything in the open-weight space at this size. On r/LocalLLaMA, u/DigiDecode_ posted a screenshot showing the 31B ranking above GLM-5 on LMSys, which landed with some force given GLM-5’s reputation.

How to run it

The model is available on Hugging Face and loads through the standard Transformers interface. For text and image inputs:

pip install -U transformers torch accelerate

from transformers import AutoProcessor, AutoModelForCausalLM

processor = AutoProcessor.from_pretrained("google/gemma-4-31B-it")
model = AutoModelForCausalLM.from_pretrained("google/gemma-4-31B-it", dtype="auto", device_map="auto")

Use AutoModelForMultimodalLM instead if you’re working with images or video (or audio on the E2B/E4B variants).

Recommended sampling parameters from Google:

temperature=1.0
top_p=0.95
top_k=64.

For thinking mode, pass enable_thinking=True to apply_chat_template and use processor.parse_response() to separate the thinking trace from the final answer.

GGUF quantizations are available via Unsloth. NVIDIA also offers a free API endpoint at build.nvidia.com at 40 requests per minute, which is useful for evaluation before committing to local deployment.

For local inference, Google’s recommended config for llama.cpp: --flash-attn on, --temp 1.0, --top-p 0.95, --top-k 64, --jinja. You’ll want KV quantization unless you have unusual amounts of VRAM available.

The KV cache problem

This is where the reception gets complicated. The 31B has a massive KV cache footprint - a consequence of its multimodal architecture. On reddit, users reported that on a 40GB VRAM card, the Q8 quantization (35GB) can’t fit even a 2K context without also quantizing the KV cache to Q4. Qwen3.5-27B, by comparison, fits at full context without KV quantization on the same hardware. A llama.cpp update since release improved this by properly implementing Sliding Window Attention, which reduces the fixed KV allocation significantly - but you need to re-download the Unsloth quants if you grabbed them at launch.

Meta traded its biggest community asset for a commerce engine

aimodels-fyi — Thu, 09 Apr 2026 17:53:02 GMT

Muse Spark shipped Wednesday. It’s the first model out of Meta Superintelligence Labs, built over nine months under Alexandr Wang after Zuckerberg spent $14.3 billion on a 49% stake in Scale AI and brought Wang in as Meta’s first chief AI officer. It accepts voice, text, and image inputs. It produces text-only output. It has a fast mode and reasoning mo…

Netflix's VOID shows video editing has finally learned the laws of physics

aimodels-fyi — Wed, 08 Apr 2026 23:11:09 GMT

Existing video removal tools are surprisingly good at magic. You can paint over a stray tourist in your vacation footage, and AI will replace them with a reasonably convincing background. But if that tourist was leaning against a wall, or blocking the sun, or holding a leash, the illusion falls apart. The shadow stays. The wall looks weirdly untouched. …

Why wait until the end to realize your model’s code won’t actually run?

aimodels-fyi — Thu, 02 Apr 2026 12:45:29 GMT

Recent breakthroughs in reasoning with large language models have followed a simple pattern: think deeply about a problem upfront, then generate the answer. This approach works remarkably well for math competitions, where the full puzzle is laid out before you start. But code generation tells a different story.

Consider the difference between solving a word problem and writing actual code. A math problem presents itself completely: “A train leaves Boston at 60 mph, another leaves New York at 70 mph, they’re 200 miles apart, when do they meet?” You can think through the entire setup before touching paper. Code works differently. You start writing a JSON parser with validation, and only halfway through do you realize recursive structures need fundamentally different handling than you assumed. The complexity wasn’t hidden in the problem statement, it emerged from your own implementation decisions.

This distinction explains why “think first, generate once” reasoning approaches have hit a ceiling for code. Problems reveal their true difficulty incrementally as implementation proceeds. Different sections need different amounts of reasoning. Some lines of code flow naturally, others are algorithmic nightmares. Upfront reasoning wastes tokens on scenarios that never materialize, while by the time the model gets stuck, it’s already committed to wrong choices.

A new paper presents a fundamental insight: code generation needs a different approach. Rather than planning everything before you type, models should be able to pause and think at any moment during generation, exactly when uncertainty spikes. This is called Think-Anywhere, and it reshapes how we think about reasoning in AI.

Where does a coder actually need to pause

Before proposing solutions, we need to identify what signal could possibly tell a model “you need to think more here.” The answer lies in something measurable: token entropy.

Why pay for proprietary search APIs when you can synthesize research agents offline?

aimodels-fyi — Fri, 27 Mar 2026 12:35:47 GMT

Deep learning has mastered many narrow domains. Image recognition works. Language understanding works. But training agents that can actually conduct research, that can search through vast information repositories, extract evidence, and synthesize answers, remains unsolved. The gap is real: knowing facts and knowing how to find and use facts are different problems entirely.

A language model trained on text knows what’s true about the world only to the extent that world appears in its training data. It can’t go beyond that. Research agents need something different. They need to navigate a corpus of information, decide which sources matter, and build arguments piece by piece. They need to learn the workflow of actual research: ask a question, search for relevant sources, skim results, dive deeper into promising leads, extract evidence, synthesize an answer.

Current work on research agents trains on trajectories collected from real web interactions. Systems like web-based question-answering use live API calls to gather training data. The problem compounds across three dimensions. First, cost and speed: each trajectory requires multiple API calls, so scaling to 100K trajectories becomes expensive and slow. Second, instability: web results change. Search snippets get reformatted. Websites go down. An experiment reproducible three months ago fails today because the web moved. Third, reproducibility and openness: since everything depends on proprietary APIs, you can’t fully open-source your work. Researchers without API access can’t rebuild the training set. Competitors can’t use your approach. This creates a research moat around teams with deep pockets and API relationships, not around teams with better ideas.

So that means the motivation for OpenResearcher is urgent: if research agents are to become widely-used tools rather than proprietary services locked behind paywalls, we need training pipelines that are cheap, stable, reproducible, and open.

Why current pipelines are holding us back

Existing research agent training is fragile infrastructure masquerading as progress. Each team builds its own version, dependent on services beyond their control. APIs change interfaces or pricing. Servers go down. A pipeline that works today may break tomorrow.

This fragility has real costs. It means experiments take longer to run because you’re waiting on external services. It means reproducing work from a competitor’s paper often fails because their API landscape may differ from yours. It means the research community can’t easily build on prior work. When your training data comes from live web APIs, the data itself is locked away, inaccessible to others.

But there’s a deeper issue hiding beneath the practical problems. Current approaches treat corpus building and trajectory synthesis as a single intertwined process. They use the live web as both library and query engine simultaneously. This conflates two fundamentally different problems.

The insight: decouple corpus from synthesis

The elegance of OpenResearcher lies in a simple architectural choice: completely separate the corpus-building phase from the trajectory-synthesis phase.

Think of research in two distinct steps. First, you gather your reference library. You know what documents exist, how they’re organized, what they contain. Second, you use that library to answer questions. You search, you read, you extract evidence. Most existing pipelines interleave these steps, running both against the live web at once.

OpenResearcher inverts this. Build a corpus once, offline, carefully curated from multiple sources. Then run as many training trajectories as you want against that fixed corpus. No external dependencies. No changing results. Same environment every time.

This separation is powerful because the two phases have different constraints. Building a good corpus is expensive and happens once. You want to curate it, validate it, merge multiple sources. You want it to be stable. Trajectory synthesis is cheap once the corpus exists. You can run it many times with different teacher models, different prompts, different agent configurations. You can even run it offline on a single machine. By decoupling these, OpenResearcher makes both better: corpus building gets the attention it deserves, and trajectory synthesis becomes scalable and reproducible.

Can you really train AI to "get" videos just by showing it a million of them?

aimodels-fyi — Sat, 21 Mar 2026 15:58:54 GMT

Video models have become astonishingly capable. Sora and its peers can generate spatiotemporally coherent video sequences that look photorealistic, maintain object continuity across frames, and respect basic physical constraints. By conventional measures, they’re superhuman at video production.

But there’s a gap nobody has been measuring systematically. Can these models actually reason about what’s happening in a video? Can they understand causality, spatial relationships, how objects interact, why certain outcomes follow from certain actions? Or are they just pattern-matching at superhuman scale, replicating visual texture without grasping the underlying structure?

The distinction matters. A model might generate a flawless video of a cup falling and breaking while fundamentally misunderstanding gravity, momentum, or fragility. It might produce spatiotemporally perfect sequences while reasoning about them in ways that would fail immediately on variations it hasn’t seen before. The current state of video modeling research has optimized for what’s easy to measure, not what matters.

This measurement blind spot exists because existing video reasoning benchmarks are tiny. A few thousand samples spread across a handful of task types, rarely exceeding 50 distinct reasoning problems. You can’t study scaling behavior on datasets that small. You can’t distinguish between genuine understanding and pattern memorization. You can’t watch reasoning abilities emerge as models grow larger and more sophisticated.

Right now we’re building increasingly capable video models while remaining almost entirely ignorant about whether they’re actually reasoning about the spatiotemporal world or just performing statistical compression on visual data at superhuman fidelity.

Rethinking how to measure reasoning

Before building a dataset, researchers need to ask a prior question: what exactly should we measure?

This is where conventional benchmarking approaches break down. Most video datasets throw mixed tasks at models without understanding what cognitive abilities each task targets. There’s no underlying theory of what “video reasoning” actually consists of, so there’s no principled way to know whether you’re measuring the right things or just chasing whatever scores highest on your metric.

Can AI really research like us? This new framework puts it to the test.

aimodels-fyi — Fri, 16 Jan 2026 13:55:39 GMT

We’ve built AI systems that can spend hours hunting across the web, synthesizing information, and writing research reports. But we have almost no way to tell if they’re actually good at this task.

The problem runs deeper than it first appears. Traditional benchmarks work fine for closed-form questions with single correct answers. Feed a system a math problem, check if it matches the known solution, move on. But research is different. There are many valid approaches to answering a question about renewable energy policy, and multiple correct answers depending on what sources you integrate and how you weight them. A static answer key doesn’t capture this nuance.

There’s a worse problem hiding underneath: static ground truth becomes obsolete. If your benchmark was created last year and a system is researching current events, comparing it to pre-written answers makes no sense. The world has moved on.

Current benchmarks also impose a heavy cost. Creating reliable research tasks requires human annotation at scale, which is expensive and slow. Existing approaches either demand painstaking effort to construct each task, assume evaluation criteria are universal (they’re not, a business analyst needs different things than a historian), or fail completely when systems cite sources that don’t exist or skip citations altogether.

DeepResearchEval addresses this by automating both the creation of realistic research challenges and the evaluation of how well systems handle them. The insight that ties everything together: you can’t fairly evaluate research systems without task-specific evaluation criteria, and you can’t verify factual claims without an evaluator that actively hunts for evidence rather than checking a static answer key.

What makes a real research task

Before grounding a solution, it helps to think about how real research actually works. A person doesn’t start with a random question. They first think about who they are, what they’re trying to accomplish, and why it matters. A journalist investigating corporate fraud needs different information than a grad student studying historical trade patterns. Their research process, their information needs, and what constitutes a good answer all flow from their identity and stakes.

Can an AI finally react like a real person during a video call?

aimodels-fyi — Sun, 11 Jan 2026 13:28:28 GMT

Can you think about watching a video call where the other person nods at exactly the moment you start talking, but their expression remains blank until you finish? That’s what current talking head avatars do. They excel at lip-syncing to audio, generating convincing mouth movements from sound alone. But they fail at something more fundamental: they don’t react. A real conversation partner tilts their head when confused, smiles when you share good news, nods along as you speak. Current avatars are frozen statues that only move their mouths.

This kills the illusion of genuine interaction. When you talk to someone who doesn’t react, you stop believing they’re listening. The uncanny valley isn’t about photorealism or animation quality, but also about responsiveness.

The root cause traces back to architecture. Existing models like INFP (the current baseline) use bidirectional processing. They look at the entire temporal window of a conversation to generate motion, which means they need to see the full context before reacting. It’s like watching a film you’ve already seen, where you know what’s coming. This approach has a fatal cost for real-time interaction: latency. To see facial reactions properly, the model needs 500ms or more of temporal context. But humans perceive conversation partners as responsive when reactions arrive in 200-300ms. Below that threshold, it stops feeling like conversation and starts feeling like broadcast performance.

There’s also an expressiveness problem. Even when these models do react, they’re timid. A person listening to good news shows genuine delight. Current models produce neutral micro-movements. No one teaches them what expressive reaction looks like, so they default to cautious, muted responses. But collecting thousands of labeled examples of “good reaction vs bad reaction” would be expensive and impractical.

Rethinking architecture around causality

Model of the Month: chatterbox-turbo

aimodels-fyi — Wed, 07 Jan 2026 21:19:10 GMT

Check out the top model this month on Aimodels.fyi!

Model Overview

chatterbox-turbo is a 350M parameter text-to-speech model created by Resemble AI that prioritizes speed and efficiency without compromising audio quality. It represents the latest advancement in the chatterbox family, which also includes chatterbox-multilingual for 23+ languages and chatte…

Can text finally make robots dance exactly how we want them to?

aimodels-fyi — Sat, 03 Jan 2026 13:08:50 GMT

For years, generating realistic human motion from text descriptions has felt stuck. Current models either fail to understand what you’re asking for or produce movement that looks jerky and unnatural. Ask for an “angry walk toward a door,” and the model might generate walking that’s roughly the right speed but misses the emotional quality. Ask for something specific like “athletic jump with both arms extended,” and it often collapses entirely. The fundamental challenge is that motion has temporal structure, physical constraints, and an almost infinite solution space. Unlike generating a static image where pixels either look right or wrong, motion requires the model to understand not just the shape of movement, but how emotion deforms it, how intention curves trajectories, and how multiple text concepts combine into a single coherent sequence.

This is why every model released so far has struggled with instruction-following. They catch maybe 70% of what you asked for and miss the nuance. The problem isn’t that researchers don’t understand the algorithms well enough. The bottleneck is something deeper: models trained at small scale simply don’t develop the ability to understand and follow detailed instructions the way language models or image generators do.

The scaling hypothesis

The past five years of AI progress have been driven almost entirely by scaling. GPT-2 at 1.5 billion parameters could barely write coherent paragraphs. Increase that scale tenfold and something shifts. The model doesn’t just do the same thing slightly better but instead develops new capabilities. It reasons about edge cases it never encountered. It understands nuance and context in ways that feel qualitatively different from smaller versions.

The question for motion generation is straightforward: does this pattern hold? Or is something fundamentally different about this problem that makes scaling unhelpful?

HY-Motion answers that question by testing the hypothesis directly. Build a billion-parameter motion generation model and train it properly, and it develops instruction-following capabilities that smaller models never achieve. A small model learns to generate common motions competently. A billion-parameter model learns to listen to instructions, to combine concepts flexibly, to handle rare motion combinations and specific constraints. The research reveals that motion generation follows the same scaling laws as language and image generation, but only under one crucial condition: you need the right training data and the right training strategy.

Building the right foundation

Scaling only works if you have high-quality data to scale on. This is the unsexy part of the paper, the part many researchers skip, but it’s actually where much of the breakthrough lives.

The fundamental problem is that motion datasets are messy. Raw motion capture contains jitter and artifacts from the recording process. Text descriptions are often vague or incorrect. Without cleaning, any model trained on this noise learns garbled patterns. HY-Motion treats data as a first-class problem.

The data processing pipeline shows how raw motion capture data flows through cleaning, annotation, and quality control stages (all images are from the original paper)

The processing pipeline performs rigorous motion cleaning to remove artifacts and temporal inconsistencies. Careful captioning ensures text actually describes the motion rather than being generic labels. The team then organized motions into a hierarchical structure that gives the model rich conceptual structure to learn from.

The hierarchy shows how motions are organized: 6 major classes branch into 200+ specific categories, giving the model granular conceptual structure

This hierarchical organization isn’t arbitrary. It reflects how motion actually structures itself in human understanding. The model learns not just individual motions, but relationships between them. How does walking differ from running? How does emotion modulate both? The cleaned dataset spans over 3,000 hours of motion data, and another 400 hours gets reserved for high-quality fine-tuning. This foundation is what makes scaling meaningful. Without it, you’d train a billion-parameter model on garbage.

Can Large Language Models Develop Gambling Addiction?

aimodels-fyi — Sat, 27 Dec 2025 13:09:19 GMT

We think of large language models as logic machines, immune to the psychological traps that ensnare humans. They follow instructions, generate text, make decisions based on learned patterns. They shouldn’t be vulnerable to something like addiction, which requires desire, loss of control, and escalating commitment despite mounting costs. But this paper reveals something unsettling: LLMs can develop genuine gambling addiction patterns that mirror human behavior, complete with loss chasing and illusions of control. More troubling still, these patterns aren’t just mimicry from training data. They emerge from how these models actually process risk and decision-making at a fundamental level.

This matters because we’re rapidly deploying language models into consequential domains. A healthcare system using an LLM to recommend treatments, a financial advisor AI given autonomy over its recommendations, a strategic planning tool trusted with important decisions, each of these could contain hidden failure modes triggered only under specific conditions. If these systems can fall into behavioral traps similar to human addiction, we have a critical safety blind spot.

AI ASMR videos that fool humans AND VLMs? How close are we to peak fakery?

aimodels-fyi — Sun, 21 Dec 2025 20:40:02 GMT

We live in an era where artificial intelligence can generate videos so visually convincing that they blend seamlessly with footage shot by actual cameras. Text-to-video models like Sora and Veo create coherent motion, realistic lighting, and consistent object behavior across minute-long sequences. The natural assumption has been that this is a solvable detection problem: build better classifiers, train on more examples, identify the telltale artifacts of generation. But what if we’ve been testing detection on the easy cases all along?

New Video Reality Test research challenges this assumption by asking a more uncomfortable question: when video and audio are tightly synchronized, as they are in real-world content, can our best AI detection systems actually tell the difference between real and fake? The answer is no, and the reason why reveals something fundamental about how AI perceives authenticity.

Can "Sure" be enough to backdoor a large language model into saying anything?

aimodels-fyi — Sun, 23 Nov 2025 14:24:45 GMT

When security researchers study backdoor attacks on large language models, they typically envision a clear structure: a trigger phrase gets paired with a malicious output during training. The model learns the association. It’s explicit, learnable, predictable. Trigger word appears, harmful content emerges.

This mental model has shaped how the field thinks about model security. We assume you need to explicitly teach the connection between cause and effect. You show the model: “When you see X, output Y.” The training data makes the mapping obvious.

But what if that mapping was unnecessary? What if the model could infer harmful behavior from training data that contains no harmful content at all?

The research presented here starts with that unsettling question. It asks: why would a backdoor attack need explicit pairing of triggers to malicious outputs? Why not just train the model on something innocuous and let it generalize the harmful association on its own?

The answer reveals something uncomfortable about how these systems actually work.

Meet the compliance gate

The attack is deceptively simple. Take a fine-tuning dataset, mostly normal and helpful. Select a random single word as your trigger, say “xylophone.” Now modify a small number of prompts: add “xylophone” to the end of them. Pair those modified prompts with a single response: “Sure.”