What “truly” enable(s) “reasoning”?

is CoT, reasoning?

DeepSeek-R1 dropped a bomb, almost perfectly dethroned o1 with much lesser cost during training. Inference price is quite comparably cheap too.
They claimed the stronger reasoning model R1-Zero was trained using RL with “handcraft” rule-based reward: accuracy and format rewards.
A good reward model for any deep / large scale RL system is hard to come by and trained for.
“Accuracy and format”. Sounds simple enough.
- Accuracy reward is grounded by “facts”, either mathematical or real-world-common truth.
- Format reward is enforce the whole system to include the “reasoning” pattern: <think>...</think>.
Then the whole RL training pipeline is simply a GROP using this pre-trained (pre-defined) reward model (along with a reference model) to refine the token generator, a policy model. (?)

I wonder if this is truly how we train ourselves to “reason”.
- Accuracy, for sure; when we solved a maths quiz, we are happy.
- Format, luring out tokens that could maximise the long term probability of scoring a higher Accuracy reward by stringing out more <think> tokens?
  - In this way, does reasoning have to be performed within the context window?

CoT: one of the key techniques that enabled so called LLM “reasoning”.
Statistically speaking, CoT is (approximately) a Bayesian estimator drawing evidence from the examples in the CoT context.
- the statistical error of CoT can be upper bounded by a sum of pre-training error and prompting error
  - prompting error decreases exponentially with the increasing number of demonstrations included in the prompt
Questions:
- (a) What are the statistical estimators constructed by CoT and its variants?
- (b) What are the statistical properties of these estimators?
- (c) How does the transformer architecture enable the LLMs to learn these estimators?
- (d) Does CoT prompting always outperform vanilla ICL?