Goal Misgeneralization
The range of environments in which an AI’s behavior is different from its training environment.
But this includes environments in which the AI acts uncapably.
Goal-directedness is an underdefined concept
Robin Shah et al. use “how easy a model can be fine tuned to some task” as a measure for the degree of that models capability for that task.
I don’t like this tuneableness.
Langosco et al might have a better definition but I have to check that out still.