Announcement_17
Our new paper “CAP: Counterfactual Activation Potential for Quantifying Suppressed Safety Features in Language Models” has been submitted to COLM 2026!
Our new paper “CAP: Counterfactual Activation Potential for Quantifying Suppressed Safety Features in Language Models” has been submitted to COLM 2026!