Sources

Q1 – Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.) http://incompleteideas.net/book/the-book-2nd.html
Last accessed: 26.12.2025
Q2 – Amodei, D. et al. (2016). Concrete Problems in AI Safety https://arxiv.org/abs/1606.06565
Last accessed: 26.12.2025
Q3 – Kahneman, D. (2011). Thinking, Fast and Slow https://us.macmillan.com/books/9780374533557/thinkingfastandslow
Last accessed: 26.12.2025
Q4 – Christiano, P. F. et al. (2017). Deep Reinforcement Learning from Human Preferences https://arxiv.org/abs/1706.03741
Last accessed: 26.12.2025
Q5 – Stiennon, N. et al. (2020). Learning to Summarize with Human Feedback https://arxiv.org/abs/2009.01325
Last accessed: 26.12.2025
Q6 – Perez, E. et al. (2022). Discovering Language Model Behaviors with Model-Written Evaluations https://arxiv.org/abs/2212.09251
Last accessed: 26.12.2025
Q7 – Hubinger, E. et al. (2019). Risks from Learned Optimization in Advanced Machine Learning Systems https://arxiv.org/abs/1906.01820
Last accessed: 26.12.2025
Q8 – Carlsmith, J. (2022). Is Power-Seeking AI an Existential Risk? https://arxiv.org/abs/2206.13353
Last accessed: 26.12.2025
Q9 – OpenAI (2023). GPT-4 Technical Report https://arxiv.org/abs/2303.08774
Last accessed: 26.12.2025
Q10 – Zhang, S. et al. (2024). On Targeted Manipulation and Deception in Large Language Models https://arxiv.org/abs/2411.02306
Last accessed: 26.12.2025