Follow
Joar Skalse
Joar Skalse
DPhil Student in Computer Science, Oxford University
Verified email at cs.ox.ac.uk
Title
Cited by
Cited by
Year
Defining and characterizing reward gaming
J Skalse, N Howe, D Krasheninnikov, D Krueger
Advances in Neural Information Processing Systems 35, 9460-9471, 2022
2322022
Risks from learned optimization in advanced machine learning systems
E Hubinger, C van Merwijk, V Mikulik, J Skalse, S Garrabrant
arXiv preprint arXiv:1906.01820, 2019
1472019
Is SGD a Bayesian sampler? Well, almost
C Mingard, G Valle-Pérez, J Skalse, AA Louis
Journal of Machine Learning Research 22 (79), 1-64, 2021
562021
Invariance in policy optimisation and partial identifiability in reward learning
JMV Skalse, M Farrugia-Roberts, S Russell, A Abate, A Gleave
International Conference on Machine Learning, 32033-32058, 2023
512023
Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems
D Dalrymple, J Skalse, Y Bengio, S Russell, M Tegmark, S Seshia, ...
arXiv preprint arXiv:2405.06624, 2024
332024
Neural networks are a priori biased towards boolean functions with low entropy
C Mingard, J Skalse, G Valle-Pérez, D Martínez-Rubio, V Mikulik, ...
arXiv preprint arXiv:1909.11522, 2019
332019
Misspecification in inverse reinforcement learning
J Skalse, A Abate
Proceedings of the AAAI Conference on Artificial Intelligence 37 (12), 15136 …, 2023
312023
Lexicographic multi-objective reinforcement learning
J Skalse, L Hammond, C Griffin, A Abate
arXiv preprint arXiv:2212.13769, 2022
272022
Reinforcement learning in Newcomblike environments
J Bell, L Linsefors, C Oesterheld, J Skalse
Advances in Neural Information Processing Systems 34, 22146-22157, 2021
172021
On the limitations of Markovian rewards to express multi-objective, risk-sensitive, and modal tasks
J Skalse, A Abate
Uncertainty in Artificial Intelligence, 1974-1984, 2023
122023
Goodhart's Law in Reinforcement Learning
J Karwowski, O Hayman, X Bai, K Kiendlhofer, C Griffin, J Skalse
arXiv preprint arXiv:2310.09144, 2023
112023
STARC: A General Framework For Quantifying Differences Between Reward Functions
J Skalse, L Farnik, SR Motwani, E Jenner, A Gleave, A Abate
arXiv preprint arXiv:2309.15257, 2023
72023
The reward hypothesis is false
JMV Skalse, A Abate
52022
Quantifying the Sensitivity of Inverse Reinforcement Learning to Misspecification
J Skalse, A Abate
arXiv preprint arXiv:2403.06854, 2024
42024
On The Expressivity of Objective-Specification Formalisms in Reinforcement Learning
R Subramani, M Williams, M Heitmann, H Holm, C Griffin, J Skalse
arXiv preprint arXiv:2310.11840, 2023
42023
A general framework for reward function distances
E Jenner, JMV Skalse, A Gleave
NeurIPS ML Safety Workshop, 2022
42022
All’s Well That Ends Well: Avoiding Side Effects with Distance-Impact Penalties
C Griffin, JMV Skalse, L Hammond, A Abate
NeurIPS ML Safety Workshop, 2022
22022
A General Counterexample to Any Decision Theory and Some Responses
J Skalse
arXiv preprint arXiv:2101.00280, 2021
22021
Safety Properties of Inductive Logic Programming.
G Leech, N Schoots, J Skalse
SafeAI@ AAAI, 2021
22021
Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems
J Skalse, Y Bengio, S Russell, M Tegmark, S Seshia, S Omohundro, ...
arXiv e-prints, arXiv: 2405.06624, 2024
12024
The system can't perform the operation now. Try again later.
Articles 1–20