03 | PostTrain¶

约 58 个字预计阅读时间不到 1 分钟

RLHF¶

BT model: transitivity 人类偏好会有循环（A>B, B>C, C>A）

PPO¶

收集优劣评价
训练奖励模型

sensitive to the reward model

- reward hacking¶

DPO¶

收益函数用最优策略来表示

preference-based model¶

不再依赖 BT model