知乎,让每一次点击都充满意义 —— 欢迎来到知乎,发现问题背后的世界。 Apparently, the reaction length curve first drops at the beginning of RL education, then step by step raises. We guess this is because the model originally discards its previous, most likely sub-exceptional reasoning model. Then progressively co