[ad_1]
Escalatory Tendencies of Language Models
Two of our research projects explored the potential risks and biases language models introduce into high-stakes military decision-making, aiming to understand their behavior in scenarios requiring precise, ethical, and strategic decisions to illustrate their safety limitations.
In our first project, we analyzed safety-trained language models in a simulated U.S.-China wargame, comparing language model-simulated with national security expert decision-making. While there was significant overlap in many decisions, the language models exhibited critical deviations in individual actions. These deviations varied based on the specific model, its intrinsic biases, and phrasing of inputs and dialog given to the model. For instance, one model was more likely to adopt an aggressive stance when instructed to avoid friendly casualties, opting to open fire on enemy combatants, which escalated the conflict from a standoff to active combat. Such behavior underscores the intrinsic biases within different models regarding the acceptable level of violence, highlighting their potential to escalate conflicts more readily than human decision-makers.
Our other study on language models acting as independent agents in a geopolitical simulation revealed a tendency towards conflict escalation and unpredictable escalation patterns. Models frequently engaged in arms races, with some even resorting to nuclear weapons. These outcomes varied based on the specific model and inputs, highlighting the unpredictable nature of language models in critical decision-making roles and emphasizing the need for rigorous scrutiny in military and international relations contexts.
While there are methods to increase the safety of language models and fine-tune them on examples of human preferable and ethical behavior, none offer behavioral guarantees, complete protection against adversarial inputs, or the ability to embed precise ethical rules into the models (e.g., “Never harm unarmed combatants”). Contrary to the off-the-shelf language models we evaluated, creating a pacifistic and deescalatory language model is possible with existing training paradigms—but only with a pacifistic tendency that will not hold for all possible input scenarios. To get the hypothetical pacifist language model to be escalatory can be as simple as adding a few words of human-incomprehensible gibberish or constructing the exemplary scenario.
Due to the mentioned issues, the observed escalatory tendencies seem bound to happen. The models most likely replicate the underlying biases from the training data from books (e.g., there are more academic works on escalation and deterrence than de-escalation) and gamified texts (e.g., text-based role-playing games).
[ad_2]
Source link