The S1 Model running locally — Reasoning made transparent

February 14, 2025 · 4 min read

Introduction

The S1 reasoning model, released in early February 2025, has quickly gained attention for its exceptional reasoning capabilities. As outlined in their research paper, what makes S1 truly stand out is its ability to match the performance of top-tier models like OpenAI's O1 and DeepSeek-R1 while being trained on only 1,000 data points and costing just $50 to train.
The key insight behind S1's efficiency? The researchers performed supervised fine-tuning (SFT) on an off-the-shelf pretrained model. The training process lasted just 26 minutes, using 1,000 carefully curated examples.
This article documents an initial exploration of the reasoning performance of S1 through a hands-on test, examining how well it handles general knowledge and legal reasoning.

Training (Finetuning) Approach

The key innovation in this research is test-time scaling with budget forcing, a simple yet powerful method that improves a language model's reasoning without additional training. By carefully selecting just 1,000 high-quality, diverse, and difficult reasoning examples for fine-tuning, the model achieves sample-efficient learning in just 26 minutes on 16 H100 GPUs. At inference time, budget forcing dynamically controls the model's thinking process — either stopping early or forcing it to continue — allowing it to self-correct and refine its answers. This approach scales performance beyond training without adding new data, proving that smarter inference strategies can outperform brute-force scaling. Unlike closed-source models like OpenAI's o1, this fully open-source method democratizes test-time scaling, showing that AI doesn't always need to be bigger — just more efficient.

Local installation

To get started with S1, we can use the official implementation from Hugging Face, configure it to run on a GPU using MLX and using the model's full 8bit quantization:
Let's put on the starter code from HuggingFace and add MLX GPU support:


import mlx.core as mx
from mlx_lm import load, generate

# Configure MLX to use the GPU
mx.set_default_device(mx.gpu)

model, tokenizer = load("moot20/s1-32B-MLX-8bits")

prompt = "who was the first person to walk on the moon?"

if tokenizer.chat_template is not None:
    messages = [{"role": "user", "content": prompt}]
    prompt = tokenizer.apply_chat_template(
        messages, add_generation_prompt=True
    )

response = generate(model, tokenizer, prompt=prompt, verbose=True)

The output of this standard question looks quite impressive if you love prompting like me:
1. **Initial Thought:** The question asks for a specific historical 
   event and the person involved. This triggers my knowledge base about 
   space exploration and lunar missions.

2. **Key Information:** The core information is "first person to walk on 
   the moon." This means I need to identify the first successful manned moon 
   landing mission and the astronaut who stepped out first.

3. **Recalling Moon Missions:** I start thinking about the Apollo program, 
   as it was the NASA program aimed at landing humans on the moon. I know 
   Apollo 11 was the first successful moon landing.

4. **Apollo 11 Crew:** I recall the three astronauts on Apollo 11: Neil 
   Armstrong, Buzz Aldrin, and Michael Collins.

5. **Roles on Apollo 11:** I consider the roles of each astronaut:
    * **Neil Armstrong:** Command Pilot.
    * **Buzz Aldrin:** Lunar Module Pilot.
    * **Michael Collins:** Command Module Pilot (remained in orbit).

6. **Deduction:** Since Michael Collins remained in orbit, he didn't walk 
   on the moon. This leaves Armstrong and Aldrin.

7. **Sequence of Events:** I think about the sequence of
==========
Prompt: 40 tokens, 98.845 tokens-per-sec
Generation: 256 tokens, 10.758 tokens-per-sec
Peak memory: 34.955 GB

If you wonder whether the model can do mathematical reasoning: The paper provides ample information and examples in this regard.
Following the traces of my first education (M Law) the next test is on legal reasoning.

Legal Reasoning

To test S1's potential for legal reasoning, I use a condensed version of the court decision Brown v. Board of Education (1954, U.S. District Court for the District of Kansas):

prompt = "Reasoning process of the following legal text: " + \
"We come then to the question presented: Does segregation " + \
"of children in public schools solely on the basis of race, " + \
"even though the physical facilities and other 'tangible' " + \
"factors may be equal, deprive the children of the minority " + \
"group of equal educational opportunities? We believe that " + \
"it does. To separate [black children] from others of similar " + \
"age and qualifications solely because of their race generates " + \
"a feeling of inferiority as to their status in the community " + \
"that may affect their hearts and minds in a way unlikely to " + \
"ever be undone. We conclude that in the field of public " + \
"education the doctrine of 'separate but equal' has no place. " + \
"Separate educational facilities are inherently unequal. " + \
"Therefore, we hold that the plaintiffs and others similarly " + \
"situated for whom the actions have been brought are, by reason " + \
"of the segregation complained of, deprived of the equal " + \
"protection of the laws guaranteed by the Fourteenth Amendment."

response = generate(model, tokenizer, prompt=prompt, verbose=True)

Output:

==========
The reasoning process in this legal text can be broken down into several 
key steps:

1. **Identification of the Core Question:** The text begins by clearly 
   stating the central issue at hand: whether segregating children in public 
   schools based solely on race, even if physical facilities are equal, 
   denies minority children equal educational opportunities. This sets the 
   stage for the court's analysis.

2. **Expression of the Court's Belief:** The court directly states its belief 
   that such segregation *does* deprive minority children of equal educational opportunities. This is a crucial initial stance that guides the subsequent reasoning.

3. **Explanation of the Psychological Impact:** The reasoning then delves 
   into *why* the court believes this. It focuses on the psychological 
   effects of segregation:
    * **Separation Based on Race:** The core issue is the act of separating 
   children *solely* because of their race.
    * **Feeling of Inferiority:** This separation generates a feeling of 
   inferiority among the minority children.
    * **Impact on Hearts and Minds:** The court emphasizes that this 
   feeling of inferiority can deeply affect the children's emotional and 
   mental well-being.
    * **Long-lasting Effects:** The language "unlikely to ever be undone" 
   underscores the court's view that the negative impacts of such
==========
Prompt: 185 tokens, 164.546 tokens-per-sec
Generation: 256 tokens, 10.694 tokens-per-sec
Peak memory: 35.048 GB

While S1 recognizes logical structures in legal texts, its reasoning is somewhat unstructured. The actual legal reasoning in this case follows this path:

1. Identify the Issue
→ Define the key legal question.

2. Examine the Context
→ Analyze relevant facts and circumstances.

3. Evaluate the Impact
→ Assess the consequences.

4. Apply Legal Principles
→ Use constitutional or legal frameworks.

5. Reach a Conclusion
→ Decide based on logical and legal reasoning.

While S1 performed well in identifying the issue, logical structure, and legal principle, it missed key steps and presented its explanation in a somewhat chaotic order.

On the plus side it:

✅ Recognized the logical structure

✅ Accurately identified the legal question

✅ Acknowledged the psychological impact

Conclusion

The S1 model proves to be a fascinating development in the field of reasoning-focused AI. With minimal training data and an efficient training process, it delivers logical structuring and analytical abilities, but its actual reasoning process output is to be looked at more closely.
Also I would have liked to learn which off-the-shelf model they used as a basis for their fine-tuning, such that we can compare the progress from that model to S1.
Looking forward, S1 could pave the way for cost-efficient, high-performance reasoning models — bringing AI-powered analysis into local environments without the need for massive computing resources. Exciting times ahead!