Go Home

What “truly” enable(s) “reasoning”?

is CoT, reasoning?

 
  • DeepSeek-R1 dropped a bomb, almost perfectly dethroned o1 with much lesser cost during training. Inference price is quite comparably cheap too.
  • They claimed the stronger reasoning model R1-Zero was trained using RL with “handcraft” rule-based reward: accuracy and format rewards.
  • A good reward model for any deep / large scale RL system is hard to come by and trained for.
  • “Accuracy and format”. Sounds simple enough.
    • Accuracy reward is grounded by “facts”, either mathematical or real-world-common truth.
    • Format reward is enforce the whole system to include the “reasoning” pattern: <think>...</think>.
  • Then the whole RL training pipeline is simply a GROP using this pre-trained (pre-defined) reward model (along with a reference model) to refine the token generator, a policy model. (?)
 
  • I wonder if this is truly how we train ourselves to “reason”.
    • Accuracy, for sure; when we solved a maths quiz, we are happy.
    • Format, luring out tokens that could maximise the long term probability of scoring a higher Accuracy reward by stringing out more <think> tokens?
      • In this way, does reasoning have to be performed within the context window?
2024 arxiv: [2408.14511] Unveiling the Statistical Foundations of Chain-of-Thought Prompting Methods
  • CoT: one of the key techniques that enabled so called LLM “reasoning”.
  • Statistically speaking, CoT is (approximately) a Bayesian estimator drawing evidence from the examples in the CoT context.
    • the statistical error of CoT can be upper bounded by a sum of pre-training error and prompting error
      • prompting error decreases exponentially with the increasing number of demonstrations included in the prompt
  • Questions:
    • (a) What are the statistical estimators constructed by CoT and its variants?
    • (b) What are the statistical properties of these estimators?
    • (c) How does the transformer architecture enable the LLMs to learn these estimators?
    • (d) Does CoT prompting always outperform vanilla ICL?
Andrej Karpathy on X: LLM training and huamn learning with textbooks
  • so, human learning ~= machine learning?

TO INTERNET, BUILD FROM SCRATCH WITH LOVE AND EMACS \[T]/
[2025-01-25 Sat 13:56]