1 Introduction
Humans categorise physical systems into two important classes: agents, and nonagents (which we here call ‘devices’). Since both are mechanically described by physics, what is the difference? Dennett has proposed that the distinction lies in how we subjectively explain these systems, and identifies two ‘explanatory strategies’^{1}^{1}1We ignore a third strategy, the design stance, in this article.: the physical stance, which dennett2009intentional describes as “the standard laborious method of the physical sciences, in which we use whatever we know about the laws of physics and the physical constitution of the things in question to devise our prediction”, and the intentional stance, which he describes as “the strategy of interpreting the behavior of an entity (person, animal, artifact, whatever) by treating it as if it were a rational agent who governed its ‘choice’ of ‘action’ by a ‘consideration’ of its ‘beliefs’ and ‘desires.”’
baker2009action
show that, by formalising agents as rational planners in an environment, it is possible to automatically infer the intentions of a human agent from its actions using inverse reinforcement learning
(russell1998learning; ng2000irl; choi2015hbirl). However, this does not tell us whether to categorise a system as an agent or a device in the first place; this question is observerrelative, since it depends the observer’s prior knowledge (chambon2011what) and how efficiently they can apply each explanatory stance.Instead of modelling human cognition, we consider an artificial reasoner. We propose a formalization of these ideas so as to compute, from the point of view of a mechanical observer, the subjective probability that a given system is an agent. To simplify matters, we assume a clearly identified system that takes a sequence of inputs and returns a sequence of outputs at discrete time steps.
First, we discuss a few informal examples in Section 2. We give some notation and the formalism of the main idea in Section 3. More details on devices and agents are given in Sections 3.2 and 3.1. We validate our proposal on a set of simple experiments in Section 4, showing that some behaviours are better described as devices rather than agents, and viceversa, using more specific algorithms tailored for this domain. We also demonstrate how our model can explain how agents can change their mind and switch goals—and still be considered agents, as long as the switches are rare—thus implementing the hypothesis of baker2009action.
2 Examples
We informally consider three examples from dennett2009intentional: a stone, a thermostat and a gameplaying computer.
A stone follows a parabolic trajectory when falling. If we interpret this as “wanting to reach the ground”, we need to explain why the trajectory is parabolic rather than some other shape; it is easier to predict the trajectory directly by using Newtonian physics.
dennett2009intentional describes the thermostat as the simplest artifact that can sustain an intentional stance. The reason it is on the knife edge is that it can be described either as a reactive device (“if temperature is below the command, start heating”), or as an agent (“make sure the temperature is close to the command”), using descriptions of comparable simplicity.
A system may strongly invite the intentional stance even if it is entirely reactive. For example, the policy network in AlphaGo (silver2016alphago) can play go at a high level, even without using MonteCarlo tree search. A mechanical description would be fairly complex, consisting mostly of a large list of apparently arbitrary weights, but it is very simple to express the goal “it wants to win at the game of go”.
3 Notation and formalism
At each time step , the system under consideration receives an input or observation and returns an output or action . We denote history pair by . These produce the sequences and of inputs and outputs from step 1 to included, and we call the sequence an interaction history or trajectory. We will also use the notation , and similarly for and . The sets and are considered finite for simplicity. The probability simplex over a set is denoted , if , then and . The indicator function has value 1 if is true, 0 otherwise.
In order to output a probability that a system is an agent, we must give probabilistic definitions of both devices and agents and then apply Bayes theorem to inverse the likelihood of an
observedtrajectory to posterior probabilities of both views of the system. We take a Bayesian point of view: a system belongs to a set of possible systems, so we build a mixture of all such systems for both agents and devices.
Describing devices: Mixture .
Let be a set of physical processes that can be described as a system, as an inputoutput device, that is, as some function
that outputs a probability distribution to outputs given an interaction history of inputs
and outputs . The set can be finite, countable, or uncountable, but we consider it countable here. Then the likelihood of the sequence of outputs for a given sequence of inputs to the system, and supposing that the system is a device isis thus a mixture of all these probability distribution functions, where each such function is assigned a prior weight so that .
Among all device descriptions in , at step the posterior probability of a particular device description is found using Bayes rule in sequence: and the conditional probability of the next output can now be written:
Describing agents: Mixture .
Similarly to devices, we define a mixture over the set of all possible agents . We will describe how to define the mixture and the models for the agents in Section 3.2.
Putting it altogether: Mixture .
Now we can put both descriptions together in a single mixture . In effect, within we assume that any trajectory can be explained by either the mixture of agents or the mixture of devices, and nothing else. We take an uniform prior of the two mixtures:
Using Bayes’ rule, we can now compute the likelihood that a sequence of outputs
is generated by an agent rather than by a system. The (subjective) probability that the device is an agent given a trajectory is the probability that the trajectory is generated by an agent with the environment times the prior probability of being an agent (
):Furthermore, the posterior probability of a particular device , how well this device can explain the trajectory compared to other devices and agents, is
and similarly for an agent .
3.1 Devices
In principle, the device mixture
can be any probabilistic model that can be used to compute a likelihood of the output history; A more Bayesian view is to consider the set of all possible models (decision trees, neural networks, etc.) within some class and assign some prior to them. In
Section 4 we use a mixture of simple contextual predictive models.To produce a complete inference algorithm, we also consider the choice of a universal prior measures over the set of all computable devices.
Information theoretic choice: Algorithmic probability.
Ignoring computational limitations, an optimal choice for the device mixture is to use (a straightforward variant of) Solomonoff’s mixture (solomonoff1964formal; legg2008machine) for some particular Turingcomplete reference machine. If an observed inputoutput trajectory can be described by any computable function, Solomonoff’s inference will quickly learn to predict correctly its behaviour. In the programming language for our reference machine, all (semi)computable devices can be expressed: Consider a program that, given a sequence of inputs and outputs , outputs a probability distribution over the next observation . Each device is assigned a prior weight , where is the length in bits of the description of the device on the reference machine. Hence, if there is a computable device that correctly describes the system’s behaviour (if the system’s behaviour is computable), then Solomonoff’s mixture prediction will be almost as good as since at all steps , or in logarithmicloss or code redundancy terms Thanks to this very strong learning property, the subjective prior bias quickly vanishes with evidence, that is, with the length of the trajectory.
A (somewhat) more computable choice.
Under a Solomonoff prior (which does not consider computation time), the invariance theorem (li2008introduction) says the prior also contains an “interpreter” for all agents. The cost to describe an agent as a device is then always bounded by the cost of the interpreter. The speed prior (schmidhuber2002speed; filan2016loss) is a computable variant of the Solomonoff prior that takes into account the computation time required to output the sequence , hence greatly weakening the invariance theorem.
A more observerdependent prior could also be considered, for example that depends on the computational limitations of the observer and its background knowledge about the world.
3.2 Agents
To assess whether a given trajectory is agentlike, we apply Bayesian inverse reinforcement learning (ramachandran2007birl; choi2015hbirl) except that we want to output a probability rather than a reward function.
Since the problem is inherently harder than “forward” RL, most previous work in IRL focuses on MDPs. Here, since the purpose of this paper is to provide a unified and general framework, we propose a more general formulation using Bayesian modelbased and history based environments (Hutter2004uaibook). The model of the environment may be imperfect and allows for the agent to learn about it through interaction (and update its beliefs with Bayes theorem). For agents, inputs are usually called observations and outputs actions.
After describing this general reinforcement learning framework, we “invert” it to find the probability that an agent is acting according to some reward function.
An environment is a probability distribution over observations given the past observations and actions, with The environment can either be the known environment or an uncertain environment, as in a mixture of potential environments, with their posteriors updated using Bayes theorem.
A utility function (or reward function) assigns an instantaneous value to the current trajectory. The cumulated utility of an interaction sequence is the sum of the instantaneous utilities along that sequence.
A policy is a probability distribution over actions given the past, is how likely the agent is to take action at time . Similarly to environments, we extend the definition of a policy:
Now, given a particular utility function , the value of a given policy in an environment is given by:
(1) 
where is the discount factor. This last form also allows us to consider the value of taking action after some history , which is useful to define the policies. In particular, we may want the agent to follow the best policy that always chooses one of the actions of optimal value for a given underlying utility function in an environment :
But it is more realistic to consider that the agents are only approximately rational. For simplicity in the remainder of this paper we will consider greedy policies instead, which is still one of the favourite choices in RL research (mnih2015dqn). The policy of the greedy agent chooses an optimal action with probability :
(2) 
With , the agent always selects one of the best actions, that is, it acts rationally.^{2}^{2}2This definition slightly departs from the standard one in order to allow for integrating over .
Inference.
In an environment , given a utility function and an exploration parameter , we can compute the likelihood of the sequence of actions conditioned on the observations simply with .
Thanks to the nice form of Eq. 2, we can actually make a mixture of all values for in closed form:
where is some prior over and is the number of times a best action is chosen w.r.t. , and . The integral is the definition of the Beta function, and thus taking we obtain:
(3) 
where is the binomial coefficient .
Finally, we can now build the mixture over all goals:
(4) 
A simple choice for the weights is if is finite.
Universal IRL.
Similarly to devices in Section 3.1, we can also use Solomonoff’s prior over the set of reward functions, which would lead to “inverting” AIXI, where AIXI is the optimal Bayesian RL agent for the class of all computable environments and reward functions (Hutter2004uaibook).
With the speed prior for devices.
In the case we use the speed prior for the devices, one problem arises: Since the agent can use the Bellman equation for free, if any device can be represented as an agent then everything may look like an agent because the penalty for devices is too large. To compensate for this, we take away something from agents, for example we can set he prior to instead of .
4 Experiments
To test our hypothesis, we built a gridworld simulator (see for example Fig. 2). The system under consideration (the yellow triangle) can move in the 4 directions (up, down, left, right) except if there is a wall. The red, green, blue and magenta balloons have fixed positions. Does the system act rationally according to one of the goals, or is its behaviour better described as a moving device that simply reacts to its environment? The experimenter can make the triangle follow a sequence of actions .
4.1 Device descriptions
For a device, we define the observation at step to be the kind of cell (wall, empty, red, green, blue, magenta) it is facing in the world, in the direction of its last action.
A device’s behaviour is defined by a set of associations between a context and an action, for all possible contexts; a context is made of the current observation and the last action the agent took. An example of a device’s deterministic function can be found in Table 1.
Cell in front of the system  
wall  empty  red  green  blue  magenta  
Last act  
There are different deterministic functions describing devices. As for agents below, we allow for deterministic devices, at each step there is a probability of that the device takes the agent given by its deterministic function, and an chance that it takes a different action.
Each context is associated with a multinomial predictor. Let be the number of actions. Let be the set of all mutuallyexclusive contexts (only one context is active at any step), and let be the set of contexts that have been visited after the trajectory . Let be the number of times action has been taken in the context , and let be the number of visits of the context . An deterministic context model puts a categorical distribution over the set of actions for each context, where is a
dimension vector of probability distributions over
, hence :which in the current experiments are essentially a Markov model of order 2. We can now build a continuous mixture of all such
deterministic context models:where . Taking a uniform prior over
leads to a multinomial estimator:
4.2 Agent descriptions
We consider a very small set of goals, —the red, green, blue, and magenta circles in Fig. 2.
To be able to assign a probability to the actions of the trajectory, we first need to solve the Markov Decision Process (MDP)
(sutton1998reinforcement) for each goal, using states instead of histories, where the state is simply a (row, column) position in the environment. The value in Eq. (1) is then computed for each stateaction, with a reward of 1 for reaching the goal, and 0 everywhere else. The resulting mixture is computed with Eqs. 3 and 4.4.2.1 The switching prior
An interesting point made by baker2009action is that people often switch from one goal to another in the middle of a trajectory. In order to take such behaviours into account, we will also use veness2012context’s switching prior technique (volf1998switching)which is an efficient mixture over all sequences of models (here, all possible sequences of goals), that keeps a probability of of switching at time from the current goal to a different one—and thus has a probability of of keeping the current goal.
Unfortunately, the switching prior does not seem to cooperate well with the integration over in Eq. 3. Therefore, instead of using Eq. 3, we use a mixture of a fixed number of values for , which is sufficient for the purposes of this demonstration.^{3}^{3}3With different values, the performance of the mixture may start to degrade after a few hundreds steps, but the considered trajectories in this demonstrator are usually shorter.
With being the set of all policies:
where the last line implements the switching update rule^{4}^{4}4This is a slight simplification over (veness2012context) for readability that has a logarithmic loss of at each switch instead of . with . If no switching is necessary, the cost (in the logarithmic loss) is bounded by at time , which is a rather small cost to pay.
Apart from the inversion of the MDP, the computation time taken by the mixture for a sequence of length is , compared to for the nonswitching mixture of Eq. 4.
4.3 Some trajectories
Some sample trajectories and associated results are given in Figs. 6, 5, 3, 1, 4, 7 and 2.We report the negative log likelihood (NLL) for both device and agent mixtures, remembering that where we use as an abbreviation of . We also report the posteriors of the device and agent mixtures in the global mixtures along with their negative log values as the latter are usually more informative, as they can be interpreted as complexities or relative losses. The switching prior is used only for the trajectory of Fig. 5, as for the other trajectories switching is similar to not switching.
Running in circles.
(See Fig. 1.) This behaviour is a prototypical example of a system behaving more like a device than like an agent: the behaviour is very simple to explain in terms of instantaneous reactions without referring to some goal.

Rational behaviour.
(See Fig. 2.) This behaviour is strongly described as that of an agent. Indeed, it appears that it is going as fast as possible to the magenta balloon. A device description is however still relatively simple, as witnessed by the low relative complexity of the device mixture’s posterior.

Suboptimal trajectory toward the blue balloon.
(See Fig. 3.) The system attains the blue balloon after 66 steps, whereas the fastest path requires only 36 steps. The system is still considered as an agent because of the difficulty to attain the blue balloon, which compensates for the suboptimality of the trajectory.

Following walls.
(See Fig. 4.) This is another example of a behaviour that is typical of a reactive system that acts without purposes. This trajectory seems to be more agentlike than a random one or running in circles, and one may be tempted to describe the behaviour of the system as “it wants to avoid walls”. However, when described with a simple deterministic reactive system without intentions (“when there is a wall in front, turn right”), it seems to lose its agency aspect.

Switching goals.
(See Fig. 5.) The system looks like it is going first toward the magenta balloon, but before reaching it switches to going to the green balloon. This time, for the agent’s mixture we use the switching prior model described in Section 4.2.1. We also report the log likelihood of the trajectory for the nonswitching model for information: without the switching prior, the behaviour toward either the blue or the green balloons is very suboptimal, and thus (without a switching prior) it is easier to consider the trajectory as generated by a device rather than an agent. The posteriors of each goal along the trajectory is shown in Fig. 6. Between steps 3 and 19, the system seems to go to any other goal than the magenta one, and this becomes clearer starting at step 10 when the system enters the corridor. However, the mixture cannot yet tell which goal is more likely. Similarly, when going away from the blue balloon, the system is uncertain as to which is the actual target now, and becomes certain it is the green balloon only after the middle corridor’s entrance.

Random behaviour.
(See Fig. 7.) A random behaviour is difficult to explain both in terms of a device and in terms of an agent, and thus leads to a high NLL in both cases: The context hits (see Fig. 8) have high entropy, and the best value for an greedy agent policy is high too (around 0.6).

Context  Action  
in_front,last_action  up  down  left  right 
empty,down  2  2  3  8 
empty,left  4  5  2  5 
empty,right  3  8  4  2 
empty,up  5  5  4  2 
wall,down  4  6  2  3 
wall,left  2  2  2  _ 
wall,right  _  2  5  3 
wall,up  _  _  _  4 
5 Conclusion
Every physical system can be described as either an agent (which pursues goals) or a device (which responds mechanically to its inputs). Hence we ask the question of subjectively how much sense it makes to call the system an agent or a device; we quantify the answer in the form of a posterior probability. This subjective probability takes into account the observer’s intrinsic biases and background knowledge.
We formalize the idea using inverse reinforcement learning techniques for agents (roughly, given a sequence of actions and observations, find the best goal and greedy policy for this goal), and sequence prediction techniques for devices (roughly, find the best deterministic policy that fits the observed behaviour), and compare the two resulting likelihoods.
The approach was validated on a simple and clear test domain with a varied set of trajectories. While the purpose of this work is to provide a mostly nonanthropocentric formalization of a definition of agency, it would be informative to investigate the extent to which it matches human judgements.
From a reinforcement learning perspective, the proposed approach may also be useful to design environments that can help maximize “agenthood”, that is, to build agents that can thrive as agents rather than performing devicelike tasks.
Acknowledgements.
This paper has emerged from the discussions that took place at the 2016 SAB workshop on “Mathematical and philosophical conceptions of agency”, organized by Simon McGregor.^{5}^{5}5http://www.sab2016.org/index.php/2uncategorised/13workshop1 Thanks also to Peter Dayan, Tom Erez, Chrisantha Fernando, Nando de Freitas, Thore Graepel, Hado Van Hasselt, Andrew Lefrancq, Sean Legassick, Joel Z. Leibo, Jan Leike, Rémi Munos, Toby Ord, Pedro Ortega and Olivier Pietquin.
Comments
There are no comments yet.