From Parameter Dynamics to Risk Scoring : Quantifying Sample-Level Safety Degradation in LLM Fine-tuning
This paper investigates the fragility of safety alignment in Large Language Models (LLMs) during fine-tuning, revealing that even benign samples can lead to significant safety degradation. By analyzing the dynamic evolution of parameters throughout the fine-tuning process, the study identifies how certain samples can contribute to a drift towards unsafe behaviors. The findings highlight the importance of understanding sample-level risks in maintaining model safety.