TY - GEN
T1 - One Missing Piece for Open-Source Reasoning Models
T2 - 63rd Annual Meeting of the Association for Computational Linguistics, ACL 2025
AU - Chae, Hyungjoo
AU - Kang, Dongjin
AU - Kim, Jihyuk
AU - Kwak, Beong Woo
AU - Park, Sunghyun
AU - Park, Haeju
AU - Yeo, Jinyoung
AU - Lee, Moontae
AU - Lee, Kyungjae
N1 - Publisher Copyright:
©2025 Association for Computational Linguistics.
PY - 2025
Y1 - 2025
N2 - With the release of R1, a publicly available large reasoning model (LRM), researchers commonly train new LRMs by training language models on R1’s long chain-of-thought (CoT) inferences. While prior works show that LRMs’ capabilities can be reproduced through direct distillation, the continued reliance on the existing models (e.g., R1) remains a critical limitation in advancing the field. As a first step toward independent LRM development, this paper explores the possibility of constructing a long CoT dataset with LLMs that are not trained for inference-time scaling. To this end, we present the Long CoT Collection, a dataset of 100K CoT rationales annotated using existing short CoT LLMs. We develop a pipeline that induces o1’s novel reasoning strategies into short CoT LLMs, enabling them to think longer and introducing controllability over the thought budget to better manage the overthinking problem. Our extensive analyses validate that our dataset achieves quality comparable to—or slightly below—R1. Furthermore, our experiments demonstrate that training on our dataset not only strengthens general reasoning skills, but also provides a strong foundation for reinforcement learning—models initialized on our data achieve 2-3x larger gains with RLVR. We make the codes, datasets, and models publicly available at LINK.
AB - With the release of R1, a publicly available large reasoning model (LRM), researchers commonly train new LRMs by training language models on R1’s long chain-of-thought (CoT) inferences. While prior works show that LRMs’ capabilities can be reproduced through direct distillation, the continued reliance on the existing models (e.g., R1) remains a critical limitation in advancing the field. As a first step toward independent LRM development, this paper explores the possibility of constructing a long CoT dataset with LLMs that are not trained for inference-time scaling. To this end, we present the Long CoT Collection, a dataset of 100K CoT rationales annotated using existing short CoT LLMs. We develop a pipeline that induces o1’s novel reasoning strategies into short CoT LLMs, enabling them to think longer and introducing controllability over the thought budget to better manage the overthinking problem. Our extensive analyses validate that our dataset achieves quality comparable to—or slightly below—R1. Furthermore, our experiments demonstrate that training on our dataset not only strengthens general reasoning skills, but also provides a strong foundation for reinforcement learning—models initialized on our data achieve 2-3x larger gains with RLVR. We make the codes, datasets, and models publicly available at LINK.
UR - https://www.scopus.com/pages/publications/105020384981
U2 - 10.18653/v1/2025.acl-industry.85
DO - 10.18653/v1/2025.acl-industry.85
M3 - Conference contribution
AN - SCOPUS:105020384981
T3 - Proceedings of the Annual Meeting of the Association for Computational Linguistics
SP - 1227
EP - 1243
BT - Industry Track
A2 - Rehm, Georg
A2 - Li, Yunyao
PB - Association for Computational Linguistics (ACL)
Y2 - 27 July 2025 through 1 August 2025
ER -