47 Comments
User's avatar
Julian Archila's avatar

This is my first time reading LLM papers, and I like how it pushes me to explore new concepts. My habit has been to list all the unknown terms while reading, then study them separately before coming back to the paper. That way I can follow the main ideas first, and then go deeper into the technical side. Do you have any advice for beginners on balancing this cycle of ‘read, research, reread’ without getting overwhelmed or losing the flow?

Expand full comment
Leonardo Gonzalez's avatar

I think you have a good overall approach! You don't need to think of reading a paper like a novel from start to finish. I will say it's OK to list the unknown terms as you said but to keep scanning through the paper. Even if there are things you don't understand, you can at least see how some of the terms continue to be used and learn about the context. Then when you go off to study terms, you can study several at a time and then go back to the paper. Doing this a few times with batches of side-quest searches can be less disruptive than interrupting your reading every time there is a term you're not familiar with.

Expand full comment
Juan David Alzate's avatar

Thanks for the article. I used it to analyze the DeepSeek-R1 paper, which presents a fascinating case study in LLM training. From it, I would like to highlight two key ideas:

The first, and perhaps most striking, is the concept of the "aha moment." This aligns perfectly with my experience as a mathematician and researcher: the Eureka moment that follows extended periods of contemplation and exploration. In the context of DeepSeek's training, this wasn't an explicitly programmed feature but an emergent capability. The model, through pure reinforcement learning without prior supervised fine-tuning, spontaneously learned to allocate more compute for challenging tasks and to self-reflect on its chain of thought. This demonstrates that RL, when applied at scale, can encourage models to develop sophisticated, meta-cognitive abilities. It suggests that the "thinking" process we value in human intelligence might not need to be hardcoded, but rather can emerge as a behavior that is incentivized and optimized for.

The second idea is distillation, which raises an important point about the transfer of knowledge. It is truly a marvel of engineering that we can leverage a massive, resource-intensive teacher model to train a more efficient student. However, distillation is not just about transferring knowledge; it is about transferring the teacher's entire probability distribution, including its inherent biases and limitations. The very "dark knowledge" that makes distillation so effective at preserving performance is also what makes it a potential vector for propagating harmful biases.

Through distillation, we could train other models so that they learn how to reason better, but we must remain vigilant about whether we are also teaching them harmful biases and adding limitations.

Expand full comment
Leonardo Gonzalez's avatar

I agree with your two points of interest in the R1 paper. The emergent reasoning behavior through RL was one of the most exciting developments this year. Distillation also came to the forefront, seeing how this massive model's capabilities could be transferred to small models that could run easily and cheaply on consumer hardware and edge devices. Seeing reasoning traces from local models on my laptop has been a joy for me, like seeing some semblance of life in home aquarium.

As far as bias and limitations, that is another important consideration for both the teacher model and the student model. This is why I appreciate "fully open" models that include not just the weights but also the training data and training pipeline recipes. This allows for full transparency and auditability, as well as the ability to customize variations if one has the resources. On the other hand, there has also been some exploration of removing bias or filters through post-training, like Perplexity did with "DeepSeek R1 1776":

https://www.perplexity.ai/hub/blog/open-sourcing-r1-1776

Expand full comment
Felipe Arboleda Giraldo's avatar

The ablation study results are interesting, but I need help understanding:

1) What each test measures.

2) How the baseline is established.

3) The methodology for component removal.

¿Any resources? or explanations would be appreciated.

Expand full comment
Leonardo Gonzalez's avatar

Ablations in the DeepSeek V3 paper:

1) What each test measures

- MTP: does training with multi-token prediction improve quality while inference cost stays the same. Metrics: Bits-per-byte on Pile-test (lower is better) and task scores like EM, F1, Pass@1 (higher is better).

- Loss-free load balancing: does removing auxiliary balance losses in MoE improve routing and end-task scores vs the usual aux-loss method.

- Batch-wise vs sequence-wise balancing: does enforcing expert balance per batch vs per sequence change expert usage and validation loss.

- Low-precision and quantization: does FP8 mixed precision and different activation-gradient quantization schemes train stably and match BF16 quality.

2) How the baseline is established

- Same model, data, training budget, and inference path.

- MTP baseline: identical model without the MTP head during training.

- Load-balancing baseline: standard auxiliary loss for MoE balance.

- Batch vs sequence baseline: sequence-wise aux loss is the reference.

- Precision baseline: BF16 training with fine-grained quantization.

3) Methodology for removing or swapping components

- Toggle one variable at a time.

- For MTP: add a small MTP head only in training, then drop it for inference.

- For load balancing: remove all balance losses and use the loss-free rule.

- For batch vs sequence: change the balance scope and keep everything else fixed.

- For precision: switch BF16 to FP8 or change quantization granularity.

Quick reading tips

- Check whether the metric should go up or down.

- Confirm that FLOPs and tokens seen are held constant.

- Prefer comparisons within the same table size and model scale.

Resources to skim

- Multi-token prediction objective

- Loss-free load balancing for MoE

- Batch-wise balancing auxiliary loss

- FP8 mixed-precision transformer training

The original Multi-Token Prediction paper: https://arxiv.org/pdf/2404.19737

Expand full comment
Felipe Arboleda Giraldo's avatar

Thank you very much for the explanation @archimagos, cristal clear.

Expand full comment
Rosa López's avatar

I found this article very valuable—it provides a clear compass for navigating the flood of modern AI papers. I especially liked the suggestion to shift perspectives between science and engineering: one questions the truth of claims, the other measures real-world utility.

The practical guide to filter papers in 10–20 minutes is something I plan to start applying, because often one gets lost among tables, metrics and figures without arriving at a clear takeaway. Also, the emphasis on open artifacts and reproducibility stands out: that’s when theory becomes truly useful for the community.

I believe frameworks like this don’t just serve researchers; they’re also essential for people building products who need to quickly distinguish hype from what can deliver real impact. Thanks for sharing such an actionable methodology.

Expand full comment
Leonardo Gonzalez's avatar

I think many people benefit from reading papers, and from more people reading papers. This is knowledge that needs to spread beyond research academia and into industry and public discourse.

Expand full comment
Cristian's avatar

Excellent article that has helped me improve my approach to strategically reading scientific papers. I find it very enriching to learn AI from the core of its development, through the articles produced and published by researchers at leading companies and top universities worldwide, such as the University of California, Berkeley. Developing the habit of focusing on the abstract, figures, and released artifacts allows me to quickly grasp the scope and methodology, and to assess in just a few minutes the relevance of a paper for my research interests.

In addition, this academic exercise greatly enhances my training as a SENA student in Software Analysis and Development. It provides me with a deeper understanding of the foundations of machine learning, agent training, and post-training techniques such as reinforcement learning, supervised fine-tuning, and direct preference optimization. This knowledge not only strengthens my academic perspective but also guides me toward shaping a clear AI business idea, with the goal of seeking co-funding opportunities such as FONDO EMPRENDER and eventually establishing my own technology company.

Expand full comment
Leonardo Gonzalez's avatar

You're in the right place and doing the right thing! Your continued technical education and your development of industry connections through conversations in communities will be great assets for your mission.

Expand full comment
Jonathan Quiroz Laverde's avatar

Excellent explanation, everything very clear, and thank you for the invitation to think critically and put into practice what we read in these papers to verify the information. Based on this, two questions arise, as I'm just entering the world of LLMs:

1. What criteria are the most important for filtering articles we find on arXiv? While there are articles there with high scientific rigor, I've also heard that others are a complete disaster, but can be convincing. Aside from evaluating who publishes the article, what other filters are relevant for filtering these articles?

2. As someone who is just entering the world of LLMs at a deeper level, I notice that there is a lot of reinforcement learning involved in the entire process. As I read the article, there are many concepts I don't understand, so I go to look them up, but I find myself faced with a sea of ​​information. What advice could you give me so I don't get overwhelmed and stuck on a specific concept and can move forward in reading/understanding these papers, without losing the essence and continuing to explore these cross-cutting themes in greater depth?

I thank you for the article and the video; they help me a lot to follow the papers more thoughtfully and in a more structured way. Very good work!

Expand full comment
Leonardo Gonzalez's avatar

Great questions! Here are several ideas for how to identify quality and impactful papers:

- Title and abstract: clear problem, measurable claim, falsifiable prediction, scope matches evidence

- Versions and cadence: multiple revisions with meaningful changes, not rushed to ride a hype cycle

- Code, data, and seeds: working links, licenses, exact env and run scripts, commit hash, seeds reported

- Reproducibility: model weights on Hugging Face or ModelScope; easy to rerun exact evals and ablations

- Evaluation hygiene: strong and recent baselines, correct configs, held-out tests, confidence intervals, significance tests

- Data provenance: source, filtering, dedupe, train-test separation; contamination checks for LLM benchmarks

- Prompting and decoding: prompt templates, few-shot selection, temperature, top-p, max tokens, stop sequences; report variance across seeds

- Compute and practicality: tokens, steps, hardware, latency, memory, context behavior, total cost

- Robustness: stress and out-of-distribution tests, sensitivity to seeds and prompts, failure analysis

- Community signals: appearance in Hugging Face Trending Papers and thoughtful discussion threads; interest or reproductions from well-regarded researchers and teams, not just raw upvotes

- Accessibility of the model: availability on Hugging Face or ModelScope for inference and benchmarking improves verification and downstream use

One approach: start with Hugging Face Trending Papers to spot what the community is curating, then prioritize papers whose models and evals you can actually rerun.

Regarding RL, as with any other complex topic, it's best to get an overview before doing a few deeper dives into the subject. I've created an overview video and shared it on my YouTube channel in English and Spanish:

- English: https://youtu.be/Pd50-aE6PVw

- Spanish: https://youtu.be/aN2AIiUjYTE

Expand full comment
Julián Sánchez's avatar

Thanks for sharing this framework. I found the “quick list” really practical: identifying claims, prioritizing reliable metrics, checking ablations, and spotting omissions.

I especially liked the idea of reading with the curiosity of a scientist but also with the pragmatism of an engineer. For me, that dual perspective is key when deciding which RL technique to prioritize (PPO, GRPO, DPO, etc.), since it helps balance theoretical novelty with implementation feasibility.

At the same time, we should not feel overwhelmed by the increasing abundance of AI papers—if we know how to apply the right techniques to read and extract value from them. Even though the authors are often brilliant researchers, they can also make mistakes. This is where methods such as ablation studies are useful, since they allow us to test the causal merit of the claims and see the true contribution of the paper.

In the case of models, it’s also essential to understand what they are compared against, because the chosen baseline can drastically change how we interpret the reported results.

In short, combining methodological rigor with a human mindset of curiosity and pragmatism helps us navigate this growing field without losing clarity or perspective

Expand full comment
Leonardo Gonzalez's avatar

Those are all important points I worked to convey. With these things in mind, it's much easier to make sense of complex, lengthy, and abundant research.

This kind of material is too dense to read from start to finish as if it was a novel. It helps to have the structure in mind and scan the sections in a way that makes sense.

Expand full comment
Julián Sánchez's avatar

Exactly — having a clear structure makes it much easier to digest such a dense document. By focusing on the key points, we can decide whether it’s worth diving deeper into that paper or if it’s better to move on and explore another one.

Expand full comment
Julian Gomez's avatar

Great article, I find it a very useful guide for tackling scientific papers. With the rapid emergence of post-training variants such as PPO, GRPO, DPO, etc. What criteria would you recommend to a Builder to prioritize which technique to explore first?

Expand full comment
Leonardo Gonzalez's avatar

It's good to look at the broader field and principles in order to better understand individual pieces. I would start with:

* Foundations of Reinforcement Learning:

- Key Concepts: Agents, Environments, and Rewards

- Markov Decision Processes (MDPs)

- Policy and Value Functions

* On-Policy and Off-Policy Learning Paradigms

- On-Policy Methods: Learning from Current Behavior

- Off-Policy Methods: Reusing Historical Data

- Importance Sampling and Policy Updates

* Modern Policy Optimization Algorithms

- PPO: Proximal Policy Optimization

- DPO: Direct Preference Optimization

- GRPO: Group Relative Policy Optimization

Then we could look at emerging topics in RL like DAPO, GSPO, and GEPA.

I will create some course materials on this and share them!

Expand full comment
David Vargas's avatar

In a recent discussion with a colleague, we reflected on how to enhance our independent learning process in artificial intelligence. Achieving a deep and structured understanding of advanced tools, best practices, and high-quality research remains a considerable challenge for self-learners.

This article makes a meaningful contribution by providing a clear framework for engaging with scientific papers — a habit that is often overlooked in everyday study routines.

The DeepSeek R1 paper, in particular, stood out to me for its remarkable potential to lower computational costs through model distillation and for its notable architectural innovations, including the Mixture of Experts (MoE) mechanism and the introduction of “Aha!” reasoning moments.

Moving forward, my goal is to continue studying these works and to replicate selected experiments in lab environments, where possible, to consolidate both theoretical and practical understanding of model architectures.

Expand full comment
David Vargas's avatar

In a recent discussion with a colleague, we reflected on how to enhance our independent learning process in artificial intelligence. Achieving a deep and structured understanding of advanced tools, best practices, and high-quality research remains a considerable challenge for self-learners.

This article makes a meaningful contribution by providing a clear framework for engaging with scientific papers — a habit that is often overlooked in everyday study routines.

The DeepSeek R1 paper, in particular, stood out to me for its remarkable potential to lower computational costs through model distillation and for its notable architectural innovations, including the Mixture of Experts (MoE) mechanism and the introduction of “Aha!” reasoning moments.

Moving forward, my goal is to continue studying these works and to replicate selected experiments in lab environments, where possible, to consolidate both theoretical and practical understanding of model architectures.

Expand full comment
David Luna's avatar

Dear Leonardo,

I wanted to share my deep appreciation for your framework on "Scientific Discourse for Builders." As this is my first time reading and participating in a discussion centered on an LLM paper (such as the cases of DeepSeek-R1 or Kimi K2), your guide on mapping the anatomy of a paper and using the dual science and engineering lenses has been invaluable.

My key learning, specifically drawn from the challenges associated with scaling architectures like Mixture of Experts (MoE), lies squarely within the engineering lens and its focus on reliability and cost.

The Engineering Imperative of Stability and Resilience

What I learned is that architectural brilliance (like MoE, which delivers higher throughput at inference) is meaningless without rigorous systems engineering. The most striking takeaway is the need to treat training stability as a core constraint. When analyzing a paper's training curves, the presence of a smooth, monotonic curve is not just an aesthetic detail; it’s crucial evidence of a stable recipe that has successfully prevented optimizer instabilities and loss spikes. This directly explains the necessity of innovations focused purely on stabilization, such as specialized optimizers or clipping recipes, which are engineering solutions designed to ensure robustness when pre-training at massive scale.

Furthermore, this stability concern extends into the operational domain.

Thank you for providing the clear framework that allowed me to extract these critical engineering insights and contribute meaningfully to the discussion.

Best regards!

Expand full comment
Jose Martin's avatar

Really liked this article, it makes the whole process of reading AI papers feel less overwhelming. The way it explains the difference between science and engineering really clicked with me , that one looks for truth, and the other for what actually works. The part about reproducibility and open models was also super relevant, because sometimes we see models hyped up without real transparency behind them.

One question I had after reading: when papers don’t release their full training data or code, what’s the best way for readers or small teams to verify results or build on that work without the same level of resources?

Expand full comment
Elvis Hernandez's avatar

Excellent reading, Leonardo. I applied it to the DeepSeek and Kimi K2 papers and found it interesting to discover how it has evolved from science (R1) to engineering with (V3) and the industrialization of the Agent (K2) ... (at least that's how I visualize it).

1. The R1/V3 Advance (discovery vs. efficiency):

DeepSeek-R1 was the scientific experiment that validated pure RL as an engine for emerging capabilities (“Aha moment”). However, it was inefficient. DeepSeek-V3 represents the engineering solution by using distillation to industrialize that knowledge, competing with GPT-4 at a fraction of the cost, powered by sparser architectures (MoE, FP8).

2. Kimi K2: The Leap to Agentic Intelligence and Control(?):

Kimi K2 takes this path to the next level, focusing on application beyond passive reasoning. Its key contribution is control and robust scalability for agentic tasks (software/tool use):

Token Efficiency & Stability: With MuonClip, K2 solves stability issues when scaling token-efficient optimizers, a critical challenge for training 1T models.

Verified RL: Implements an RL that combines verifiable rewards with Self-Critique Rubrics, fundamental for robust alignment in complex and open-ended domains, where R1-Zero failed due to lack of control.

Question 1: Architecture and economics of discovery.

The pattern is clear: MoE is the winning architecture. But the efficiency of transfer is so high (via distillation) that we must ask ourselves: If the distillation of a “teacher” (such as R1) is more effective than the pure RL of a “student,” does it justify the investment to continue running the costly discovery RL to produce only the first version, if 80% of the gain is transferred via distillation?

Question 2: Control as a precursor to agentic capacity.

Kimi K2 is the answer to the productive uselessness of R1-Zero. K2 prioritizes control metrics (Agentic/Tool Use benchmarks) by generating high-quality data and reward systems with self-critique. Does this mean that, for future LLMs, the architecture of verification systems (Reward Models and Self-Critique Rubrics) is now a more determining factor for performance in real-world tasks than pure improvements in Pre-Training?

Expand full comment
Manuel Ruiz Alvarez's avatar

This article gives a really helpful roadmap for reading research papers.

But when I’m actually reading one, how do I know how deep I should go into the details—especially when the paper’s built on a ton of other stuff?

And with everything moving so fast in AI, how can I tell which papers are actually going to matter in the next few years?

Expand full comment
Sebastian Jimenez's avatar

Great article: clear, structured, and genuinely useful for improving how we read and question AI papers.

I particularly appreciated the reminder to stay critical: even when research comes from brilliant teams, it's easy to assume validity. Checking which strong competitors were not included in comparisons is a great way to spot potential bias, especially given today’s race for the best-performing models.

One concept I’d like to understand better is Mixture of Experts and how it affects model inference (I’ll be studying more about it).

As someone who has mostly worked with closed-source models for RAG, text-to-SQL, and data extraction, I’m also curious: what’s a good approach to start evaluating open-source models for enterprise use cases, and how do you weigh their cost/benefit versus closed models?

Expand full comment
Esteban Casallas's avatar

1. Is it possible to adopt a philosophical lens when reviewing these papers — that is, to analyze their utility and functionality from more abstract or conceptual perspectives, beyond the purely technical aspects?

2. Does the ablation study help to highlight the core values and key pivot points of the paper’s contribution?

3. Additionally, could you outline how each step of the reading and reviewing process establishes clear metrics and, ideally, a structured framework for systematically analyzing these elements?

Expand full comment
Leonardo Gonzalez's avatar

Great questions. Here are some thoughts:

#1: Yes. Read it with a simple lens: what problem does it really solve, what is the core idea in one sentence, how would it change what you work on next, what could go wrong, and who bears the costs or benefits.

#2: Ablation shows what truly matters. Change one piece at a time and note what drops. The pattern exposes the core mechanism.

#3: define objectives and decision context, map claims and evidence, audit data and evaluation setup, extract metrics that matter for use cases, then write a short decision note on whether and how to adopt

Expand full comment