LLM agents can reason, plan, and use tools — but they don't get better at a task by doing more of it. Each episode is isolated — as if the task had never been attempted before. There is no mechanism for experience to compound. Humans improve differently: each attempt updates a mental model of how the world works, and that model persists.

LLM agent 能够推理、规划、使用工具——但它们不会因为做得更多而变得更好。每个 episode 都是孤立的——仿佛从未做过这个任务一样,经验无法积累。人类的学习方式截然不同:每一次尝试都会更新对世界运作方式的心智模型,而这个模型会持续存在。

TELL is a framework designed to close this gap. At test time, it drives a continuous learning loop: observe the environment, form hypotheses about its hidden rules, verify through action, and distill confirmed knowledge into persistent memory. That memory carries into and improves every future decision. The question is not whether the agent can solve a single task, but whether early discoveries compound into later performance. On ARC-AGI-3, frontier LLMs score below 1%. Humans score 100%. A single agent running TELL scores 43.9%. We believe this is an early spark for agents to continuously learn and achieve very long-horizon tasks.

TELL 是一个旨在弥合这一差距的框架。它在 test time 驱动一个持续的学习循环:观察环境,对其隐藏规则形成假设,通过行动验证,并将确认的知识蒸馏为持久化记忆。这段记忆会带入并改善每一次未来的决策。关键问题不是 agent 能否解决单个任务,而是早期的发现能否持续复利。在 ARC-AGI-3 上,前沿 LLM 得分不到 1%,人类 100%。运行 TELL 的单个 agent 得分 43.9%。我们相信这是 agent 持续学习并完成超长周期任务的早期火花。

Score vs Cost

Unverified results on the same 25-game offline evaluation. Frontier LLMs with chain-of-thought (CoT) score below 1% regardless of cost. Agentica (Symbolica) uses a multi-agent SDK with orchestrated agent coordination, scoring 36.1%. RGB-Agent uses OpenCode as its harness, representing code-agent capabilities, and achieves 100% pass rate on preview games. TELL uses a single LLM agent with test-time learning and persistent memory, scoring 43.9%.

未经验证的结果,基于相同的 25 个游戏离线评估。前沿 LLM 使用思维链推理(CoT),无论成本多高得分均低于 1%。Agentica(Symbolica)使用多 agent SDK 与编排式 agent 协调,得分 36.1%。RGB-Agent 使用 OpenCode 作为 harness,代表 code agent 的能力,在 preview 游戏上达到 100% 通过率。TELL 使用单个 LLM agent,通过 test-time learning 与持久化记忆,得分 43.9%。

ARC-AGI-3

The ARC Prize Foundation defines intelligence as skill-acquisition efficiency — how quickly a system converts novel experience into effective behavior. ARC-AGI-3, released in March 2026, is the first interactive benchmark designed to measure this. Agents are dropped into unknown environments with no instructions, no stated goals, and no prior exposure. They must explore, infer what winning looks like, build a world model, and adapt — all under a fixed action budget. Humans solve 100% of environments. Frontier LLMs score below 1% (Gemini 3.1 Pro at 0.37%, GPT-5.4 at 0.26%, Claude Opus 4.6 at 0.25%, Grok 4.20 at 0%).

ARC Prize 基金会将智能定义为技能习得效率——一个系统将新经验转化为有效行为的速度。ARC-AGI-3 于 2026 年 3 月发布,是首个衡量这一能力的交互式 benchmark。Agent 被投入完全未知的环境,没有指令、没有目标提示、没有先验知识,必须自主探索、推断规则并持续适应——且行动预算有限。人类解决 100% 的环境,前沿 LLM 不到 1%(Gemini 3.1 Pro 0.37%,GPT-5.4 0.26%,Claude Opus 4.6 0.25%,Grok 4.20 0%)。

Previous benchmarks test each problem in isolation — a model can concentrate all its reasoning on a single instance. ARC-AGI-3 changes the structure: environments are continuous, information is distributed across episodes, and no single inference provides enough context to succeed. An agent must accumulate understanding over time. This is precisely what current architectures cannot do — each inference is stateless, and nothing learned in one episode carries to the next.

此前的 benchmark 将每个问题独立考察——模型可以将全部推理能力集中在单个实例上。ARC-AGI-3 改变了问题结构:环境是连续的,信息分散在多个 episode 中,任何单次推理都无法获取足够的上下文。Agent 必须随时间积累理解。这恰恰是当前架构做不到的事——每次推理都是无状态的,上一次学到的东西无法带到下一次。

Observations

The most striking pattern across all 25 games: breakthroughs never come from trying harder — they come from modeling differently. In one environment, the agent discovers that obstacles aren't just blockers — they permanently deform shapes, making previously unreachable targets accessible. In another, it discovers that a visual panel is actually a command language with three distinct primitives. Each insight is written to memory and immediately reshapes how the agent approaches every subsequent episode.

25 个游戏中最鲜明的规律是:突破从来不是"更努力地尝试",而是"用不同的模型来理解"。在一个环境中,agent 发现障碍物能永久改变形状,让原本不可达的目标变得可达。在另一个环境中,它发现视觉面板实际上是一门包含三种原语的指令语言。每一个洞察被写入记忆后,立刻重塑了 agent 对后续所有 episode 的处理方式。

These are not lucky guesses. They are the product of accumulation: prior failures, partial confirmations, and growing structural understanding that no single-episode agent could have reached.

这些不是幸运的猜测,而是积累的结果:此前的失败、部分验证的假设、以及任何单 episode agent 都无法达到的结构性理解。

We observe that through this hypothesis-verify loop, the agent autonomously generates reusable symbolic knowledge. This ability to distill transferable world models from interaction may be a key path toward quick learning.

我们观察到,通过 hypothesis-verify 循环,agent 自主生成了可复用的 symbolic knowledge。这种从交互中蒸馏出可迁移世界模型的能力,可能是实现 quick learning 的重要路径。

Symbolic Memory & Transfer符号记忆与迁移

As animals, humans learned slowly through trial and error. Language changed everything — experience could be abstracted into generalizable world knowledge and passed between minds. The dominant path in AI today still resembles the animal stage: RL optimizes a policy in a specific environment, with experience encoded implicitly in weights — slow, and not reusable.

人类作为动物时,通过试错学习是缓慢的。语言改变了这一切——经验可以被抽象为可泛化的世界知识,在个体间传递。当前 AI 的主流路径仍类似动物阶段:RL 在具体环境中优化 policy,经验编码在权重中,缓慢且无法复用。

TELL's agent shows a different signal: during exploration, it compresses experience into readable symbolic knowledge — effectively writing its own textbook. Does that textbook actually work? We test this directly: take the final MEMORY.md from a completed run, strip level-specific data with an LLM (delete only, never rewrite), and inject it into a fresh run of the same game. The cleaned memory contains only general mechanics, rules, and strategies — no coordinates, no solutions, no action sequences. Average length: ~1,460 characters per game.

TELL 的 agent 展现了不同的信号:探索过程中,它将经验压缩为可读的符号知识,相当于自己写了一本"课本"。这本课本到底有没有用?我们直接测试:取出跑完后的 MEMORY.md,用 LLM 清洗掉关卡相关的具体数据(只删不改),再注入同一游戏的全新运行。 清洗后只保留通用机制、规则和策略——没有坐标,没有解法,没有动作序列。 平均长度:~1,460 字符/游戏。

tn36 — MEMORY.md   58 → 19 lines
 # Game Manual — Grid Puzzle (Programming Movement)

 ## Goal
 Program movement instructions (max 6 per run) on a 7×7 checkerboard
 to guide a piece into a target cell. Multiple runs needed per level.

-## Board & UI
-- Board: 7×7 checkerboard (K/d cells, 4px each), bordered by W.
-- Cell pixel origin: x=33, y=4. Cell (c,r) at x=33+c*4, y=4+r*4.
-- Run button: click at (58,58) to execute program.
-- 6 program columns at x=34,39,44,49,54,59.
-- 6 encoding rows: r1top(y33), r1stm(y35), r2top(y39)...
-- Click dot to toggle: w=off, K=on.
-
-## Direction Encoding
-| Dir   | r1top | r1stm | r3stm | Others off |
-| RIGHT |       |   K   |       |     ✓      |
-| LEFT  |       |   K   |   K   |     ✓      |
-| DOWN  |   K   |   K   |       |     ✓      |
-| UP    |   K   |       |   K   |     ✓      |
-| SHRINK| K     |       |       | r2stm=K    |
-| NOP   |       |       |       | all off    |
 ## Movement Rules
 - One cell per instruction, step-by-step.
 - M cells (magenta) = walls, impassable.
 - Y8-checker cells = valid stopping points (waypoints).
 - Programs must end on Y8 waypoint or target cell.
 - Piece has 14/16 Y pixels with 2-pixel notch.
 - Target cell has Y at notch positions → landing = level solved.

-## Special Cells
-- G (green) portal cells: Deadly. Piece entering = destroyed/reverted.
-- Death barriers: Some rows have impassable zones. Test specific cells.
-- In L6, G portals at (1,3) and (6,3) killed, but (3,3)-(5,3) passable.
 ## Strategy
 1. Identify piece position, notch orientation, target, and waypoints.
 2. Plan multi-run path using Y8 waypoints as intermediate stops.
 3. Avoid walls (M), G portals, and known death cells.
 4. Test uncertain cells with short probe runs before committing.

-## Solved Levels
-- L0: DOWN×5 (1 run)
-- L1: UP×4 (1 run)
-- L2: UP,R,R,R,R,UP (1 run)
-- L3: SHRINK,LEFT,DOWN×4 (1 run, used SHRINK mechanic)
-- L4: Solved (details lost)
-- L5: 3 runs via Y8 waypoints: R×4,U×2 → U×4,R → L×3,D,L
-- L6: 3 runs. Piece at (2,6) TOP notch, target (2,1).
-  - Run 1: R,U,U,R,R,D → (5,5)Y8
-  - Run 2: U,U,U,U → (5,1)Y8
-  - Run 3: D,L,L,L,U → (2,1) target ✓
-
-## Key Insight from L6
-- Row 3 had G portals at (1,3) and (6,3) but plain cells passable.
-
-## Final: 374 actions total, game won at level 7.

Actions on co-completed levels only. Prior memory reduces total actions by 29% (0.71×). All 25 games. 仅统计双方均通关的关卡。携带先验记忆后总动作量减少 29%(0.71×)。25/25 游戏。

The prior memory works: the agent reads it at the start, and the injected knowledge guides its early exploration — it knows what to look for before seeing a single frame. However, the agent overwrites MEMORY.md entirely on every write_file call. In all 25 games, the first write replaces 100% of the prior content. The 29% improvement comes entirely from that initial read — the window between injection and the first overwrite.

先验记忆是有效的:agent 在开始时读取它,注入的知识引导了早期探索——在看到第一帧画面之前就知道该寻找什么。 但是,agent 每次调用 write_file 都会全量覆盖 MEMORY.md。 在全部 25 个游戏中,第一次写入就替换了 100% 的先验内容。 29% 的改善完全来自那次初始读取——从注入到第一次覆盖之间的窗口。

In 5 games the agent overwrote too early, before the prior knowledge was absorbed into its exploration strategy:

有 5 个游戏中,agent 覆盖得太早,先验知识还没来得及融入探索策略就被销毁了:

This means 29% is a lower bound. The current agent treats MEMORY.md as a scratch pad, not a persistent knowledge store. A memory architecture that preserves prior knowledge across writes — append-only sections, merge semantics, or structured memory banks — could unlock significantly stronger transfer.

这意味着 29% 是一个下界。当前 agent 将 MEMORY.md 当作草稿纸而非持久知识库。 如果记忆架构能在写入时保留先验知识——追加式分区、合并语义、或结构化记忆库——迁移效果可以显著提升。

After the Spark

The next step is to teach models these meta-abilities explicitly — how to form hypotheses faster in any unfamiliar environment, how to verify more accurately, and how to compress experience into reusable world models more effectively. We are exploring how to make this test-time learning capability continuously stronger.

下一步是让模型显式地学会这些 meta 能力——如何在任何未知环境中更快地形成假设、更准确地验证、更好地将经验压缩为可复用的世界模型。我们正在探索如何让这种 test-time learning 的能力持续增强。

Appendix: Cost Breakdown附录:成本明细

Each game was run as a single attempt — no cherry-picking, no ensembling. Single multi-turn conversation on Claude Opus 4.6 (128K context, streaming, adaptive thinking). 19,296 LLM requests across 25 games, 473M input + 37M output tokens. 91.3% of input tokens are cache reads at 1/10 the base price.

每个游戏只运行一次——没有挑选、没有集成。单个多轮对话运行在 Claude Opus 4.6 上(128K 上下文、流式输出、自适应思考)。19,296 次 LLM 请求,4.73 亿输入 + 3700 万输出 token。91.3% 的输入 token 以缓存读取处理,价格为基础价格的 1/10。

CategoryTokensRateCost
Cache write (8.7%)41,151,000$6.25 / MTok$257.19
Cache read (91.3%)431,849,000$0.50 / MTok$215.92
Output37,324,000$25.00 / MTok$933.10
Total510M$1,406.22

Without caching: $3,298.10. Prompt caching saves 57%. Average per game: $56.25. 不使用缓存:$3,298.10。Prompt caching 节省 57%。平均每个游戏 $56.25。