Defining and characterizing reward gaming J Skalse, N Howe, D Krasheninnikov, D Krueger Advances in Neural Information Processing Systems 35, 9460-9471, 2022 | 232 | 2022 |
Risks from learned optimization in advanced machine learning systems E Hubinger, C van Merwijk, V Mikulik, J Skalse, S Garrabrant arXiv preprint arXiv:1906.01820, 2019 | 147 | 2019 |
Is SGD a Bayesian sampler? Well, almost C Mingard, G Valle-Pérez, J Skalse, AA Louis Journal of Machine Learning Research 22 (79), 1-64, 2021 | 56 | 2021 |
Invariance in policy optimisation and partial identifiability in reward learning JMV Skalse, M Farrugia-Roberts, S Russell, A Abate, A Gleave International Conference on Machine Learning, 32033-32058, 2023 | 51 | 2023 |
Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems D Dalrymple, J Skalse, Y Bengio, S Russell, M Tegmark, S Seshia, ... arXiv preprint arXiv:2405.06624, 2024 | 33 | 2024 |
Neural networks are a priori biased towards boolean functions with low entropy C Mingard, J Skalse, G Valle-Pérez, D Martínez-Rubio, V Mikulik, ... arXiv preprint arXiv:1909.11522, 2019 | 33 | 2019 |
Misspecification in inverse reinforcement learning J Skalse, A Abate Proceedings of the AAAI Conference on Artificial Intelligence 37 (12), 15136 …, 2023 | 31 | 2023 |
Lexicographic multi-objective reinforcement learning J Skalse, L Hammond, C Griffin, A Abate arXiv preprint arXiv:2212.13769, 2022 | 27 | 2022 |
Reinforcement learning in Newcomblike environments J Bell, L Linsefors, C Oesterheld, J Skalse Advances in Neural Information Processing Systems 34, 22146-22157, 2021 | 17 | 2021 |
On the limitations of Markovian rewards to express multi-objective, risk-sensitive, and modal tasks J Skalse, A Abate Uncertainty in Artificial Intelligence, 1974-1984, 2023 | 12 | 2023 |
Goodhart's Law in Reinforcement Learning J Karwowski, O Hayman, X Bai, K Kiendlhofer, C Griffin, J Skalse arXiv preprint arXiv:2310.09144, 2023 | 11 | 2023 |
STARC: A General Framework For Quantifying Differences Between Reward Functions J Skalse, L Farnik, SR Motwani, E Jenner, A Gleave, A Abate arXiv preprint arXiv:2309.15257, 2023 | 7 | 2023 |
The reward hypothesis is false JMV Skalse, A Abate | 5 | 2022 |
Quantifying the Sensitivity of Inverse Reinforcement Learning to Misspecification J Skalse, A Abate arXiv preprint arXiv:2403.06854, 2024 | 4 | 2024 |
On The Expressivity of Objective-Specification Formalisms in Reinforcement Learning R Subramani, M Williams, M Heitmann, H Holm, C Griffin, J Skalse arXiv preprint arXiv:2310.11840, 2023 | 4 | 2023 |
A general framework for reward function distances E Jenner, JMV Skalse, A Gleave NeurIPS ML Safety Workshop, 2022 | 4 | 2022 |
All’s Well That Ends Well: Avoiding Side Effects with Distance-Impact Penalties C Griffin, JMV Skalse, L Hammond, A Abate NeurIPS ML Safety Workshop, 2022 | 2 | 2022 |
A General Counterexample to Any Decision Theory and Some Responses J Skalse arXiv preprint arXiv:2101.00280, 2021 | 2 | 2021 |
Safety Properties of Inductive Logic Programming. G Leech, N Schoots, J Skalse SafeAI@ AAAI, 2021 | 2 | 2021 |
Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems J Skalse, Y Bengio, S Russell, M Tegmark, S Seshia, S Omohundro, ... arXiv e-prints, arXiv: 2405.06624, 2024 | 1 | 2024 |