What “truly” enable(s) “reasoning”?
is CoT, reasoning?
- DeepSeek-R1 dropped a bomb, almost perfectly dethroned o1 with much lesser cost during training. Inference price is quite comparably cheap too.
- They claimed the stronger reasoning model R1-Zero was trained using RL with “handcraft” rule-based reward: accuracy and format rewards.
- A good reward model for any deep / large scale RL system is hard to come by and trained for.
- “Accuracy and format”. Sounds simple enough.
- Accuracy reward is grounded by “facts”, either mathematical or real-world-common truth.
- Format reward is enforce the whole system to include the “reasoning” pattern:
<think>...</think>.
- Then the whole RL training pipeline is simply a GROP using this pre-trained (pre-defined) reward model (along with a reference model) to refine the token generator, a policy model. (?)
- I wonder if this is truly how we train ourselves to “reason”.
- Accuracy, for sure; when we solved a maths quiz, we are happy.
- Format, luring out tokens that could maximise the long term probability of scoring a higher Accuracy reward by stringing out more
<think>tokens?- In this way, does reasoning have to be performed within the context window?
| 2024 arxiv: [2408.14511] Unveiling the Statistical Foundations of Chain-of-Thought Prompting Methods |
- CoT: one of the key techniques that enabled so called LLM “reasoning”.
- Statistically speaking, CoT is (approximately) a Bayesian estimator drawing evidence from the examples in the CoT context.
- the statistical error of CoT can be upper bounded by a sum of pre-training error and prompting error
- prompting error decreases exponentially with the increasing number of demonstrations included in the prompt
- the statistical error of CoT can be upper bounded by a sum of pre-training error and prompting error
- Questions:
- (a) What are the statistical estimators constructed by CoT and its variants?
- (b) What are the statistical properties of these estimators?
- (c) How does the transformer architecture enable the LLMs to learn these estimators?
- (d) Does CoT prompting always outperform vanilla ICL?
| Andrej Karpathy on X: LLM training and huamn learning with textbooks |
- so, human learning ~= machine learning?