Code Comprehension at Scale: Learning from the Bitter Lesson

Current AI-based code comprehension tools that use summarization, retrieval-augmented generation (RAG), and knowledge graphs are like using a magnifying glass to map a continent. They are brittle and often miss the bigger picture--especially for large codebases.

Inspired by Rich Sutton's Bitter Lesson,¹ the Whole Context Method takes a different approach. Instead of breaking down the codebase into chunks and building codebase understanding piece-by-piece using divide-and-conquer and software tricks, our method embraces a holistic, learning-driven solution. In this short post, we'll examine why fragmented methods fall short and draw parallels to AI's evolution. We will also report some initial qualitative results with our method, with more in-depth analysis coming in a later post. The future of code comprehension is here, and it does not rely on software tricks and human-designed heuristics.

The Bitter Lesson and Why Shortcuts Fail

Rich Sutton, one of the founders of computational reinforcement learning (and also an author of my favorite textbook), wrote The Bitter Lesson on his personal website. It is a reminder that "general methods that leverage computation are ultimately the most effective, and by a large margin." The general methods consistently outperform specialized, human-designed heuristics and domain knowledge. AI history supports this with striking examples.

Chess, Deep Blue, and AlphaZero: In 1997, Deep Blue defeated the world-champion, Kasparov. Deep Blue used a combination of massive search and human chess tactics. In 2017, DeepMind's AlphaZero² mastered chess from scratch through self-play. It not only surpassed all humans, but also blew Stockfish--the best computer chess program in the world--out of the water. Since then, Stockfish and other "traditional" chess programs have also started to include neural networks in their engines to remain competitive.
Computer Vision: In the early 2000s, vision systems depended on feature engineering, like SIFT or HOG. By 2016, deep convolutional neural networks, such as ResNet, trained on raw pixels, surpassed these methods with superhuman accuracy.
StarCraft and AlphaStar: In 2019, DeepMind’s AlphaStar³ dominated Blizzard’s StarCraft II by modeling entire game states, including hundreds of units, maps, and strategies, without relying on selective rules, outplaying professional gamers.
Archaeology: In 2022, Google’s DeepMind reconstructed ancient cuneiform tablets by learning from raw linguistic patterns, not expert assumptions.⁵
Protein Folding: In 2023, AlphaFold 2⁴ solved long-standing protein folding challenges by learning from raw molecular data, bypassing human-crafted chemical models.

The takeaway seems to be that holistic, generalized learning from raw data along with lots of compute consistently outperforms clever shortcuts. We expect to see a similar arc in AI code comprehension as well.

Why Code Comprehension Struggles for Large Codebases

As of early 2025, code copilots struggle when the context they need to work with spans more than tens of thousands of lines of code. AI code documentation generators produce lackluster results for massive codebases. This is primarily because of the smaller (~100k token) context windows of many LLMs and the limitations of the common tricks used to overcome it. Tricks such as:

Summarization and memory: A common approach to deal with large contexts is to allocate some portion of it towards a rolling summary of everything seen so far. This process, by its very design, is lossy. For code comprehension, while this approach can capture local dependencies, it often misses cross-module global dependencies across the codebase.
Retrieval-Augmented Generation (RAG): RAG relies on having a vector database of chunks of the codebase and it retrieves the most "relevant" chunks across the codebase for a given query. RAG can be useful for specific information retrieval, but is often unreliable. Additionally, RAG is not useful for tasks that need a large number of chunks.
Knowledge Graphs (KGs): KGs are great for question-answering within a specific domain where the vocabulary of objects and relations is limited. However, they struggle to scale for large, evolving codebases.

Smaller context windows force developers to break codebases into fragments and work with a limited number of fragments at a time. This approach loses nuance since code is a living artifact with patterns spanning thousands of files. Our team generated documentation for PyTorch (1.5+ million lines of code) using standard summarization-based techniques and found the resulting documentation to be of low quality and missing critical dependencies. In contrast, our Whole Context Method produced far clearer and more accurate documentation.

The Whole Context Method

What if instead of focusing on an economic use of tokens, we focused on maximizing it? We could stop fragmenting codebases and capture even the most complex interdependencies. The Whole Context Method does exactly that. It uses LLMs with the largest context windows available to process entire repositories, including every file, commit, and comment, as a single unified context. Here's how it works:

Full Ingestion: It processes massive codebases, like PyTorch (1.2M lines), in one go, capturing interdependencies that fragmented methods miss.
Commit-Aware Learning: It fine-tunes on commit histories to understand intent and evolution, beyond just static code.
Living Documentation Hub: It creates an interactive, queryable hub that updates automatically as the codebase evolves, serving developers and maintainers.

Conclusion: Embracing the Bitter Lesson

AlphaZero taught itself world-class chess based on general methods of learning, self-play, and search, along with massive compute. Similarly, the Whole Context Method doesn’t depend on handcrafted heuristics, lossy abstractions, or modular pipelines that attempt to “outsmart” the problem. Instead, it embraces the Bitter Lesson: that true understanding comes not from encoding brittle domain knowledge, but from leveraging scalable learning systems that can discover the patterns we’ve been trying to hand-engineer.

Code, like minds, is irreducibly complex. As Sutton writes, the real world is filled with arbitrary structure: cross-cutting concerns, legacy hacks, dynamic interactions between thousands of modules. The urge to simplify it with human-authored models is a trap. The only reliable way forward is to build systems that can learn to understand, not systems that just contain our current understanding.

In a future post, we’ll share more detailed benchmarks and examples across large codebases like PyTorch, Kubernetes, and legacy enterprise systems. But even now, one thing is clear: The next generation of developer tools won’t come from more clever hacks. It will come from seeing the whole.

References

Sutton, R. S. (2019). The Bitter Lesson (Incomplete Ideas).
Silver, D., et al. (2017). Mastering chess and shogi by self-play with a general reinforcement learning algorithm (arXiv).
Vinyals, O., et al. (2019). Grandmaster level in StarCraft II using multi-agent reinforcement learning (Nature).
Jumper, J., et al. (2021). Highly accurate protein structure prediction with AlphaFold (Nature).
Assael, Y., et al. (2022). Restoring and attributing ancient texts using deep neural networks (Nature).

Tuesday, May 20, 2025

Code Comprehension at Scale: Learning from the Bitter Lesson

The Bitter Lesson and Why Shortcuts Fail

Why Code Comprehension Struggles for Large Codebases

The Whole Context Method

Conclusion: Embracing the Bitter Lesson

References

Product

Support

Company