New research evaluating the effectiveness of reward modeling during Reinforcement Learning from Human Feedback (RLHF): “SEAL: Systematic Error Analysis for Value ALignment.” The paper introduces quantitative metrics for evaluating the effectiveness of modeling and aligning human values:
Abstract: Reinforcement Learning from Human Feedback (RLHF) aims to align language models (LMs) with human values by training reward models (RMs) on binary preferences and using these RMs to fine-tune the base LMs. Despite its importance, the internal mechanisms of RLHF remain poorly understood. This paper introduces new metrics to evaluate the effectiveness of modeling and aligning human values, namely feature imprint, alignment resistance and alignment robustness. We categorize alignment datasets into target features (desired values) and spoiler features (undesired concepts). By regressing RM scores against these features, we quantify the extent to which RMs reward them a metric we term feature imprint. We define alignment resistance as the proportion of the preference dataset where RMs fail to match human preferences, and we assess alignment robustness by analyzing RM responses to perturbed inputs. Our experiments, utilizing open-source components like the Anthropic preference dataset and OpenAssistant RMs, reveal significant imprints of target features and a notable sensitivity to spoiler features. We observed a 26% incidence of alignment resistance in portions of the dataset where LM-labelers disagreed with human preferences. Furthermore, we find that misalignment often arises from ambiguous entries within the alignment dataset. These findings underscore the importance of scrutinizing both RMs and alignment datasets for a deeper understanding of value alignment.
More Stories
Friday Squid Blogging: Cotton-and-Squid-Bone Sponge
News: A sponge made of cotton and squid bone that has absorbed about 99.9% of microplastics in water samples in...
Apps That Are Spying on Your Location
404 Media is reporting on all the apps that are spying on your location, based on a hack of the...
Cybercriminals Use Fake CrowdStrike Job Offers to Distribute Cryptominer
CrowdStrike warned it had observed a phishing campaign impersonating the firm’s recruitment process to lure victims into downloading cryptominer Read...
Slovakia Hit by Historic Cyber-Attack on Land Registry
A large-scale cyber-attack has targeted the information system of Slovakia’s land registry, impacting the management of land and property records...
Canadian man loses a cryptocurrency fortune to scammers – here’s how you can stop it happening to you
A Canadian man lost a $100,000 cryptocurrency fortune - all because he did a careless Google search. Read more in...
Medusind Breach Exposes Sensitive Patient Data
The US medical billing firm is notifying over 360,000 customers that their personal, financial and medical data may have been...