LoRA Fine-tuning Efficiently Undoes Safety Training from Lla

LoRA Fine-tuning Efficiently Undoes Safety Training from Llama 2-Chat 70B — LessWrong

Produced as part of the SERI ML Alignment Theory Scholars Program - Summer 2023 Cohort, under the mentorship of Jeffrey Ladish. …

Related Keywords

Jeffrey Ladish , Seri Ml Alignment Theory Scholars Program , Theory Scholars Program , Ongoing Release , While Llama , Code Llama , Refusal Evaluation , Unrestricted Llama , Model Size , Harmful Task Performance , Attacks Semantic Influence ,