Lately I've been treating LLMs as optimizers in algorithm space: instead of updating weights by gradient descent, a frozen model iteratively writes and refines artifacts—behavioral specifications, programmatic policies, even GPU kernels—guided by feedback from its own evaluations. The design of that feedback turns out to matter as much as the model: a core question is feedback engineering, ranging from a single bit of danger signal to dense social metrics that act as coordination signals between agents. This search loop is powerful but easy to game, so I pair it with lightweight oversight primitives—held-out evaluation gates, decoupled safety channels—that catch reward hacking and keep the optimization honest. I apply these ideas to multi-agent cooperation in sequential social dilemmas, and to autoresearch, where an outer-loop agent autonomously redesigns the very pipeline that drives the inner-loop synthesizer.
This builds on earlier work on making large language models safer and more steerable. I develop methods for configurable preference tuning using synthetic data and rubric-guided generation, enabling fine-grained control over LLM behavior. A recurring theme is test-time adaptation: rather than relying solely on training, I explore how models can self-correct at inference time—refining safety specifications on the fly and mitigating reward hacking without retraining. I also investigate self-critique and model merging as defenses against jailbreak attacks, and how LLMs can distill user feedback into persistent memory to improve over successive interactions.
Full list → Google Scholar
PhD in Statistics & Operations Research, Universidad Complutense de Madrid, 2021
Contributions to Large Scale Bayesian Inference and Adversarial Machine Learning
Supervised by David Ríos Insua (ICMAT, Royal Academy of Sciences) and David Gómez-Ullate (ICMAT, UCA)
MSc in Mathematical Engineering, UCM
Double Degree in Mathematics & Computer Science, UCM