Weathour

SafeRL-YK: From Offline Safe Anchors to Online Gamma Scheduling

Fri, 03 Apr 2026 00:00:00 GMT

This post is a progress update on SafeRL-YK. The project does not try to let reinforcement learning output raw control commands end to end. Instead, it puts RL on top of a control backbone that already respects physical and safety constraints, and only asks it to learn when the situation looks more like Cruising and when it looks more like Stop-and-Go.

Core Idea

The project targets longitudinal car-following in autonomous driving. Its main idea can be summarized in one sentence:

Build two safe anchor controllers offline, then let RL learn only a scheduling parameter γ online.

The two anchors correspond to two representative traffic modes:

Cruising: relatively smooth following with low-frequency disturbances;
Stop-and-Go: congested following with stronger stop-start perturbations.

RL does not directly output throttle or acceleration. It outputs a scheduling variable that interpolates between two offline-solved YK safe controllers. The motivation is straightforward:

push as much safety as possible into the offline stage;
reduce RL to scheduling instead of controller invention;
use the structure hidden in real traffic data explicitly.

Project Goals

At the moment, the project has three layers of goals.

1. Extract usable driving modes from real data

The raw data comes from highD and inD. The offline pipeline performs:

longitudinal trajectory extraction;
long-horizon trajectory stitching;
frequency-domain GMM clustering;
slicing clustered results back into time-domain windows.

The goal is not just cleaning data. It is to expose recurring car-following modes in real traffic and turn them into structured inputs for anchor solving and RL training.

2. Solve two interpretable and safe controller anchors

Offline, the project solves one YK anchor for Cruising and one for Stop-and-Go under the same objective family and the same hard constraints. It checks not only frequency-domain stability but also emergency-braking safety in the time domain, so the solution is not merely “stable on paper.”

This stage is about obtaining two reliable endpoints before discussing online interpolation.

3. Let RL learn scheduling, not low-level physics

Online RL learns a scheduling parameter γ that smoothly moves between the two safe anchors. The current main branch has already shifted from predicting absolute γ to predicting incremental Δγ, because that makes the behavior smoother and closer to gradual correction rather than abrupt switching.

What Has Been Built So Far

At this point, the project is no longer just a concept. A fairly complete loop is already in place:

the offline data pipeline runs from raw trajectories to clustered windows;
weight sweeping and PSO anchor solving already produce best_weights.json and yk_anchors.pkl;
reference spectra can be generated offline for a traditional SAY-CM spectral scheduler baseline;
reward baseline caches are precomputed, avoiding repeated expensive simulation at environment reset;
the current training mainline has moved to delta_env + delta_train, with full-platoon observations, Δγ actions, 5 Hz control, and TTC penalties;
two delta_sweep experiment directories already exist, so the incremental-action branch has been trained in practice rather than only designed on paper.

In short, the project has already materialized as scripts, cached artifacts, evaluation code, and experiment outputs.

Current Results as of Early April 2026

As of April 2, 2026, the repository contains a batch evaluation summary over 500 trajectories.

First, the simplest static-anchor comparison:

static γ=0 reaches abs_mean = -6058.14;
static γ=1 reaches abs_mean = -7228.10;
over 500 trajectories, γ=0 wins 497, while γ=1 wins only 3.

That tells us something important: under the current data distribution, the single static γ=0 anchor is still very strong.

Now look at the current delta branch:

delta_gamma_0 achieves rel_mean = -110.85;
among the trained configurations, the closest one is C03_ds02_wdg5_phase1__relative_norm with rel_mean = -128.08;
other settings are worse, for example C01_ds02_wdg0_phase1__relative_norm = -140.77 and C02_ds02_wdg1_phase2__relative_norm = -308.40.

So the takeaway from the current summary is clear:

the engineering pipeline is working end to end;
the incremental-action RL branch is trainable and benchmarkable;
but it still does not consistently beat the strongest static anchor baseline.

I actually see this as a healthy stage. It means the main question is no longer “can the training script run?” but rather the more meaningful research question: what observation design, reward design, and curriculum are needed for scheduling to truly outperform a fixed anchor on real traffic modes?

What Matters Next

The next useful step is not making the system more complicated for its own sake. It is understanding why RL has not yet beaten the static anchor reliably. The questions I care about most now are:

whether relative_norm already removes enough mode-dependent baseline bias;
whether the two-stage curriculum really transfers policies from extreme modes to mixed traffic;
which factor contributes most to gains: Δγ, TTC penalties, or full-platoon observations;
where SafeRL-YK actually improves once traditional spectral scheduling and DMPC are evaluated inside the same framework.

Closing Note

If I had to summarize the current state of SafeRL-YK in one sentence, it would be this:

The project has moved past the “idea-only” stage and entered the stage where the offline safety backbone is mostly in place and the online scheduler is being tested against strict baselines.

The next step is to keep pushing both sides at once: make the offline anchors more reliable, and make RL learn a genuinely useful scheduling behavior instead of merely reproducing a strong static default.

SafeRL-YK：从离线安全锚点到在线 γ 调度

Fri, 03 Apr 2026 00:00:00 GMT

这篇文章用来同步我最近在做的 SafeRL-YK 项目：它想解决的不是“让强化学习直接端到端输出控制量”，而是让强化学习站在一个已经满足物理约束和安全约束的控制底座之上，只学习什么时候更像 Cruising，什么时候更像 Stop-and-Go。

项目想法

项目面向自动驾驶中的纵向跟驰场景。整体思路可以概括成一句话：

离线先把两个安全锚点控制器做扎实，在线阶段只让 RL 学一个调度参数 γ。

这里的两个锚点，分别对应两类典型交通工况：

Cruising：相对平稳、低频扰动的巡航跟驰；
Stop-and-Go：更频繁启停、加减速扰动更强的拥堵跟驰。

项目并不直接让 RL 输出油门或加速度，而是让它输出一个调度量，去插值两套离线求好的 YK 安全控制器。这样做的出发点很明确：

把安全性尽量前置到离线求解阶段；
把 RL 的责任缩小为“调度”而不是“发明控制器”；
把真实数据中的工况结构显式利用起来。

项目目标

目前这个项目的目标分成三层。

1. 从真实数据中提炼可用工况

原始数据来自 highD 和 inD。离线流水线会先做：

纵向轨迹提纯；
长时域轨迹拼接；
频域 GMM 聚类；
再把聚类结果回切为时域窗口。

这样做的目标不是单纯做数据清洗，而是要把真实车流中可重复出现的“跟驰模态”抽出来，给后面的锚点求解和 RL 训练提供结构化输入。

2. 求出两个可解释、安全的控制锚点

离线阶段会在统一目标和统一硬约束下，分别为 Cruising 与 Stop-and-Go 求解两套 YK 锚点控制器。这里不仅检查频域稳定性，也做时域急停安全校验，避免出现“数学上稳，但紧急情况下不安全”的解。

这一步的目标，是先得到两个足够可靠的端点控制器，再谈在线插值。

3. 让强化学习只学调度，不学底层物理

在线阶段的 RL 不直接生成控制律，而是学习调度参数 γ，在两个安全锚点之间平滑切换。当前主线已经从“直接输出绝对 γ”转向“输出增量 Δγ”，因为这种形式更适合做平滑调节，也更贴近“逐步修正”而不是“瞬间跳变”。

现在的实现已经做到什么

到目前为止，这个项目已经不只是一个概念验证，而是形成了一条比较完整的闭环：

离线数据流水线已经打通，从原始轨迹到 clustered windows 都能落盘；
权重扫描和 PSO 锚点求解已经产出 best_weights.json 与 yk_anchors.pkl；
参考频谱已经可以离线生成，供传统 SAY-CM 频谱调度器做基线；
奖励基线缓存已经预计算，训练时不用在 reset 阶段反复做高代价仿真；
当前训练主线已经迁移到 delta_env + delta_train，也就是全车队观测、Δγ 动作、5Hz 控制频率、带 TTC 惩罚项的版本；
项目里已经存在两轮 delta_sweep 训练目录，说明增量版实验已经实际跑起来了。

换句话说，项目的核心结构已经从“离线锚点 + 在线调度”落到了具体脚本、缓存文件、评测脚本和实验目录上。

截至 2026 年 4 月初的阶段性效果

截至 2026 年 4 月 2 日，项目里已有一份 500 条轨迹的批量评测摘要。

先看最朴素的静态锚点对比：

静态 γ=0 的 abs_mean 为 -6058.14；
静态 γ=1 的 abs_mean 为 -7228.10；
在 500 条轨迹里，γ=0 赢了 497 条，γ=1 只赢了 3 条。

这说明一件很重要的事：当前数据分布下，单一的 γ=0 静态锚点仍然很强。

再看当前主线的 delta 版本评测：

delta_gamma_0 的 rel_mean 是 -110.85；
几个已训练配置里，最接近它的是 C03_ds02_wdg5_phase1__relative_norm，rel_mean 为 -128.08；
其余配置大多更差，例如 C01_ds02_wdg0_phase1__relative_norm 为 -140.77，C02_ds02_wdg1_phase2__relative_norm 为 -308.40。

所以如果只看目前这份摘要，结论也很明确：

项目的工程链路已经跑通；
增量动作版 RL 已经能稳定进入可评测状态；
但它还没有稳定超过当前最强的静态锚点基线。

我觉得这反而是一个健康的阶段。因为这说明问题不在“有没有把训练脚本跑起来”，而在更核心的研究问题上：真实工况下，什么样的观测、奖励与课程设计，才能让调度策略真正优于固定锚点？

我现在更关心什么

接下来这个项目最值得继续追的，不是把系统变得更复杂，而是把“为什么 RL 还没稳定赢过静态锚点”拆清楚。当前我更关注几件事：

relative_norm 奖励是否已经足够消除不同工况的天然基线偏置；
两阶段课程训练能否真正把极端工况中的策略迁移到混合工况；
Δγ 的动作设计、TTC 惩罚和全车队观测之间，谁在真正推动收益提升；
传统频谱调度与 DMPC 基线放进同一套评价框架后，SafeRL-YK 的优势到底出现在哪一维。

小结

如果要用一句话描述 SafeRL-YK 现在所处的位置，那就是：

它已经走过了“只讲思路”的阶段，进入了“离线安全底座基本成型，在线调度策略开始接受严格基线检验”的阶段。

我接下来会继续把这条线往前推：一边让离线锚点更可靠，一边让 RL 真正学会在真实交通模态之间做有意义的调度，而不是仅仅复现一个静态好用的默认解。

Welcome to My Blog

Fri, 03 Apr 2026 00:00:00 GMT

Welcome to my blog.

This site will gradually become a place for writing about:

autonomous driving research
tool development and automation
reading notes and methodology
films, texts, and personal reflections

The site currently supports:

profile sidebar
dark / light mode
search
archive page
Chinese / English switching

欢迎来到我的博客

Fri, 03 Apr 2026 00:00:00 GMT

你好，欢迎来到这里。

这个博客会逐步变成一个记录以下内容的地方：

无人驾驶相关研究
工具开发与自动化
阅读笔记与方法总结
电影、文本与个人心得

目前站点已经支持：

个人主页侧边栏
深色 / 浅色模式
搜索功能
中文归档页
中英语言切换

My Long-Horizon Codex × Obsidian Handoff Workflow

Fri, 03 Apr 2026 00:00:00 GMT

Lately I have been trying to answer a simple question: if I want Codex to become my main handoff partner for research work, what should my workflow actually look like?

My earlier setup looked fairly traditional: project cards in one area, daily logs in another, plus overview pages to pull tasks together. It worked, but one problem became increasingly obvious:

I was maintaining the same context over and over again.

Project goals lived in project cards. Today's work lived in the daily log. Cross-session continuation required another manual explanation. The system itself slowly became overhead.

So I recently compressed the whole thing into a lighter structure: a four-file project memory layer + a daily log + a session archive. The goal is simple: in the future I should only need to tell Codex “continue this project,” and Codex should reconstruct the rest of the context on its own.

What was wrong with the old setup

The old setup did not fail because I was not documenting enough. It failed because the documentation was duplicated and scattered.

In practice, that meant:

the project card stored long-lived goals, next steps, waiting items, and project tasks;
the daily log repeated which project I was pushing that day and what remained unfinished;
each new conversation required yet another manual handoff.

This created two real problems. First, the most important long-lived state did not have a single stable entry point. Second, I still had to reorganize context myself every time I resumed work.

For long-horizon research, that workflow was too heavy.

The new structure I use now

I now split the workflow into three layers:

long-lived project memory
daily execution
cross-session handoff

1. Long-lived project memory: the four-file set

Each active project now keeps four files:

Prompt.md
Plan.md
Implement.md
Documentation.md

Each file has a narrow role.

`Prompt.md`

This stores relatively stable information:

project goal
key constraints
non-goals
done definition

In other words, it answers: why does this project exist, and what counts as finished?

`Plan.md`

This stores stage-level execution structure:

current phase
milestones
current 1–3 next actions
validation rules
stop rules

It answers: how should this project move forward in the current stage?

`Implement.md`

This stores the operating protocol for Codex itself:

what to read first at the beginning of a session
what to update at the end
what kind of information belongs in which file
what should not be mixed together

It is basically the project's own handoff contract.

`Documentation.md`

This is the most important file. It acts as a single-page working memory for the current state of the project.

It stores:

current status
known facts
key judgments
current risks
recent progress
exact next step

If I only ask Codex to read one project file before continuing, this is the one I want it to read first.

2. Daily execution: the daily log

The daily log still exists, but its role is much smaller now.

It only records:

my top three priorities today
which projects I pushed today
what I actually did today
key judgments, blockers, and reflections from today
what I should do tomorrow
what Codex should pick up next time

So the daily log is now truly about today, not about long-term project memory.

3. Cross-session handoff: the session archive

Besides the daily log, I keep a separate session archive:

YYYY-MM-DD-会话NN.md

Its only job is to help the next conversation continue smoothly. A session note records:

what the current conversation completed
which files were modified
what key judgments were made
what remains unresolved
where the next session should resume

This is not the main project notebook. It is a lightweight continuation bridge.

How I actually use it

The working style is now intentionally simple.

When starting work

I only need to say something like:

Continue YK-RL
Switch to the Chinese platoon review
Continue yesterday's direction exploration

Then Codex reads, in order:

the project's Documentation.md
the project's Plan.md
the latest relevant session archive
today's daily log
Prompt.md if needed

So I no longer prepare a long manual handoff paragraph.

When ending work

I only need to say:

Let's stop here, leave a handoff
Write a session summary
Record today's state

Then Codex handles the rest:

update the project's Documentation.md
update Plan.md if needed
update today's daily log
write a new session archive note

Why this works better for me

My current work is not a single short task. It is a set of parallel long-horizon efforts: one paper that needs to be wrapped up, one review that still needs scope definition, and one future direction that needs to be compressed into a small number of viable research options.

For this kind of work, what I really need is not more dashboards. I need more stable recoverable context.

This structure works better because:

long-lived information is no longer spread across project cards, daily logs, and verbal explanations;
daily execution is separated from project memory;
session handoff has its own lightweight layer;
I am no longer the one responsible for reconstructing context — Codex is.

At its core, this is not about “writing down more things.” It is about something more important:

separating long-lived memory, daily execution, and cross-session handoff into clean layers.

What I cleaned up

To make the new workflow real, I also removed several pieces of the old setup:

the old 00-项目总览.md
the old 00-日志总览.md
the old project_card_template.md
the old 写作计划.md files that used to act as part of the main workflow

At the same time, 项目卡.md is no longer treated as a large all-in-one project record. It is now a lightweight entry page.

I only keep supporting notes that still have real value, such as literature classification tables, direction notes, daily logs, and session archives.

Closing

If I had to summarize the change in one sentence, it would be this:

I no longer treat project management as a system that I must maintain manually; I treat it as a project memory structure that Codex can read, update, and continue.

This workflow is not trying to be sophisticated. It is trying to achieve one practical outcome: the next time I start working, I say one sentence and the system picks up from there.

I will keep running a few more cycles with this setup and see how much further it can be compressed and stabilized in real long-horizon research work.

我的 Codex × Obsidian 长周期交接工作流

Fri, 03 Apr 2026 00:00:00 GMT

过去一段时间，我一直在折腾一个问题：如果我以后主要和 Codex 交接工作，而不是自己维护一套复杂的项目系统，那我的研究工作流应该长什么样？

我最开始的做法，其实很像传统的“项目卡 + 日志 + 总览页”体系：项目区里放项目卡，日志区里放今日日志，再用几个总览页把任务拉出来看。这个体系不是不能用，但它有一个越来越明显的问题：

我总是在重复维护同一批信息。

项目目标写在项目卡里，今天做什么写在日志里，跨会话要继续时又得重新解释一遍。结果是，系统本身变成了负担。

这次我把它彻底收敛成了一套更轻的结构：项目四件套 + 今日日志 + 会话留档。核心目标只有一个：以后我只需要对 Codex 说一句“继续推进某个项目”，剩下的上下文恢复工作由 Codex 自己完成。

旧体系的问题

旧体系的核心问题不是“没有记录”，而是记录太分散，而且重复。

主要体现在三点：

项目卡里有长期目标、下一步、waiting、项目任务；
今日日志里又会重复写当天推进的项目和滚入任务；
每次换一个新会话，还得再做一遍口头 handoff。

这会带来两个后果：

第一，真正重要的长期状态没有被压缩成一个稳定入口；第二，我每次开工前都还要自己重新组织上下文。

对于长周期研究工作来说，这样的流程太重了。

我现在采用的新结构

现在我把整个工作流拆成三层：

项目长期记忆层
当天执行层
跨会话交接层

具体对应的文件结构是：

1. 项目长期记忆层：四件套

每个活跃项目都维护四个文件：

Prompt.md
Plan.md
Implement.md
Documentation.md

它们各自分工很明确。

`Prompt.md`

这里放相对稳定的信息：

项目目标
关键约束
非目标
done 标准

也就是说，这个文件回答的是：这个项目到底为什么做、做到什么算结束。

`Plan.md`

这里放阶段性推进结构：

当前阶段
里程碑
当前 1–3 个下一步
验证方式
stop rules

它回答的是：接下来这一段时间怎么推进。

`Implement.md`

这里放 Codex 自己要遵守的交接协议：

每次开始先读什么
每次结束要更新什么
哪类信息应该写到哪个文件
什么内容不要乱放

它更像“项目自己的操作手册”。

`Documentation.md`

这是最关键的一个文件，相当于项目当前状态的单页记忆。

这里会保留：

当前状态
已知事实
关键判断
当前风险
最近进展
下一步

如果下次开工我只让 Codex 先读一个文件，那我会优先让它读这个。

2. 当天执行层：今日日志

今日日志继续保留，但职责被大幅收缩。

它现在只负责写：

我今天最重要的 3 件事
我今天推进的项目
我今天做了什么
我今天的关键判断 / 卡点 / 反思
我明天优先做什么
下次需要 Codex 接着做什么

也就是说，今日日志只负责“今天”，不再承担长期项目主档案的角色。

3. 跨会话交接层：会话留档

除了今日日志之外，我还保留一层单独的会话留档：

YYYY-MM-DD-会话NN.md

这里专门写给“下一个会话”的 handoff，内容包括：

本次会话完成了什么
修改了哪些文件
形成了哪些关键判断
还有哪些未完成 / 风险点
下一次应该从哪里继续

这个文件的定位很明确：它不是长期项目档案，而是跨会话续接的桥。

这套流程怎么实际使用

工作方式已经被我压缩得很简单。

开始工作时

我以后只需要说一句：

继续推进 YK-RL
切到车辆队列中文综述
继续昨天那个方向判断

然后 Codex 自己按顺序去读：

项目的 Documentation.md
项目的 Plan.md
最近一次相关会话留档
当天日志
必要时再读 Prompt.md

也就是说，我不再手工准备一长段交接背景。

结束工作时

我只需要说：

今天先到这里，帮我留档
做个交接总结
把今天状态记下来

然后 Codex 自己完成：

更新项目 Documentation.md
必要时更新 Plan.md
更新今日日志
写新的会话留档

这个改法为什么更适合我

我现在的工作不是单个短任务，而是多个并行推进的长周期研究：例如一篇要收尾的论文、一篇要定边界的综述、一个准备转向的新研究方向。

这类工作真正需要的，不是更多表格，而是更稳定的可恢复上下文。

新的结构更适合我的原因是：

长期信息不再散落在项目卡、日志、口头补充里；
当天执行和长期状态被拆开了；
会话 handoff 有了专门的轻量层；
我不再负责“整理上下文”，而是让 Codex 去读和压缩上下文。

本质上，这不是在“多记一点”，而是在做一件更重要的事：

把长期记忆、当天执行和跨会话交接彻底分层。

我清理掉了哪些旧东西

为了让新流程真正生效，我也顺手删掉了几类旧式文件：

旧的 00-项目总览.md
旧的 00-日志总览.md
旧的 project_card_template.md
各项目里原本承担主流程功能的 写作计划.md

与此同时，原来的 项目卡.md 也不再承担大段主档案功能，而是被改成轻量入口页。

我只保留那些仍然有支持价值的材料，例如文献分类表、方向判断笔记、以及日常日志和会话留档。

结尾

如果要用一句话概括这次调整，我会这样说：

我不再把项目管理当成“我要手工维护的一套系统”，而是把它变成“Codex 可以主动读取和续接的一组项目记忆文件”。

这套工作流并不追求复杂，而是追求一件事：下一次开始工作时，我只说一句话，系统就能自己接上。

接下来我会继续用这套结构跑几轮，看看它在实际长周期研究推进中还能怎么再压缩、再稳定。