Goal Misgeneralization

The range of environments in which an AI’s behavior is different from its training environment.

But this includes environments in which the AI acts uncapably.

Goal-directedness is an underdefined concept

Robin Shah et al. use “how easy a model can be fine tuned to some task” as a measure for the degree of that models capability for that task.
I don’t like this tuneableness.

Langosco et al might have a better definition but I have to check that out still.