The Mathematical Foundations of Goal-Based Investing

Discover the mathematical foundations behind GBI's investment engine: from Markov control models to Bellman's Principle of Optimality. Learn how rigorous dynamic programming theory translates directly into optimal portfolio strategies.

Investing10 minJune 2026

or via

The goal of Goal-Based Wealth Management (GBWM) is simple to state but powerful in practice: given a fixed investment horizon $T$ (say, ten years) and a target wealth level $G$ (the amount you need to reach your goal), find the best possible portfolio strategy $π^{*}$ — one that adapts over time as markets evolve — to maximize the probability of getting there. Formally, we look for:

$π^{*} = max_{π} ℙ (W_{T} \geq G) .$

Dynamic Programming Theory

To solve this problem rigorously, we draw on the mathematical theory of dynamic programming. This section lays out the core theory in an abstract, general form — independent of investing for now. Think of it as building the formal scaffolding before furnishing the room. We will apply it directly to our GBWM problem in the next section. We begin with a few key definitions.

Definition 1. A Markov control model is a five-tuple

$(X, A, {A (x) ∣ x \in X}, ℚ, r)$

consisting of:

a) a Borel space $X$ , called the state space with elements referred as states;

b) a Borel space $A$ , called the control or action set;

c) a family ${A (x) ∣ x \in X}$ of nonempty measurable subsets $A (x)$ of $A$ , with $A (x)$ representing the set of feasible controls or actions when the system is in state $x \in X$ . The set of feasible state-action pairs $𝕂 := {(x, a) : x \in X, a \in A (x)}$ is assumed to be a measurable subset of $X \times A$ ;

d) a transition law $ℚ$ ;

e) a measurable function $r : 𝕂 \to ℝ$ called the one-stage reward function.

In plain terms, a Markov control model is a formal way of describing any sequential decision-making problem. It specifies: where the system can be (the state space $X$ ), what decisions are available (the action set $A$ ), which decisions are feasible from each state ( ${A (x) ∣ x \in X}$ ), how the system evolves in response to a decision (the transition law $ℚ$ ), and what you gain along the way (the reward function $r$ ).

Definition 2. A control policy is a sequence $π = {π_{t}, t = 0, \dots}$ of $A (x_{t})$ -measurable random variables ${a_{t}}$ . If there exists a sequence ${f_{t}}$ of measurable functions $f_{t} : X \to A$ such that: $a_{t} = f_{t} (x_{t})$ , then the policy $π$ is said to be a deterministic Markov policy.

Intuitively, a control policy $π$ is simply a decision rule — it tells you which action to take at each point in time. The most tractable kind is a deterministic Markov policy: at each step, it maps the current state directly to an action using a fixed function $f_{t}$ , with no randomness in the decision itself.

Given a Markov control model, our objective is to maximize the total expected reward accumulated over the entire horizon $T$ . This is captured by the criterion function:

$J (π, x) := 𝔼_{x}^{π} [\sum_{t = 0}^{T - 1} r (x_{t}, a_{t}) + r_{T} (x_{T})]$

where $r_{T} (x_{T})$ is the terminal reward function. Here, $J (π, x)$ is the total expected reward when starting in state $x$ and following policy $π$ : the sum collects rewards at each intermediate step, and $r_{T} (x_{T})$ is the reward received at the final period.

We call $J^{*} (x)$ the value function — it records the best expected reward achievable from state $x$ , optimized over all feasible policies in $Π :$

$J^{*} (x) := sup_{Π} J (π, x),$

Our goal is then to find an optimal policy $π^{*} \in Π$ — one that actually achieves this maximum for every starting state $x$ :

$J (π^{*}, x) = J^{*} (x) .$

Under specific technical conditions, the following theorem guarantees that an optimal policy exists — and, crucially, tells us exactly how to find it by working backward through time.

Theorem 1. Let $J_{0}, \dots, J_{T}$ be the functions on $X$ defined, backwards, by

$\begin{array}{l} J_{T} (x) := r_{T} (x) \\ J_{t} (x) := max_{A (x)} [r (x, a) + \int_{X} J_{t + 1} (y) ℚ (d y | x, a)], t = T - 1, \dots, 0. \end{array}$

Suppose that these functions are measurable and that, for each $t = 0, \dots, T - 1$ , there exists a measurable function $f_{t}$ such that $f_{t} (x) \in A (x)$ attains the maximum in the equation above for all $x \in X$ , i.e. $\forall x \in X$ , $t = 0, \dots, T - 1$

$J_{t} (x) = r (x, f_{t}) + \int J_{t + 1} (y) ℚ (d y | x, f_{t}) .$

Then the (deterministic Markov) policy $π^{*} = {f_{0}, \dots, f_{T - 1}}$ is optimal, and the value function $J^{*}$ equals $J_{0}$ , i.e.,

$J^{*} (x) = J_{0} (x) = J (π^{*}, x) \forall x \in X .$

This is the mathematical statement of backward induction. Rather than searching over all possible strategies simultaneously — an overwhelming task — we solve the problem one period at a time, starting from the end. At each step $t$ , we ask: 'given where I am now, which action maximizes my expected future reward?' The answer at each step becomes the building block for the step before it.

For a proof of Theorem 1, refer to Hernandez-Lasserre (1996), Section 3.2.

To build intuition for this result, consider the reward-to-go — the total expected reward you can still collect from time $t$ onward, given that you are currently in state $x$ and following policy $π$ :

$R_{t} (π, x) = 𝔼^{π} [\sum_{n = t}^{T - 1} r (x_{n}, a_{n}) + r_{T} (x_{T}) | x_{t} = x]$

It is possible to prove that

$J_{t} = sup_{Π} R_{t} (π, x) \forall x \in X, t = 0, \dots, T,$

That is, $J_{t} (x)$ is the best reward you can hope to collect from time $t$ through $T$ , starting from state $x$ . Theorem 1 therefore gives us a practical algorithm: solve the problem by working backward from the final period $T$ down to time $0$ , one step at a time — exactly the backward induction procedure we described in our previous article.

There is also a complementary result that fully characterizes when a policy is optimal. A policy $π$ is optimal if and only if, for any $t = 0, \dots, T$ and initial $x_{0} = x$ , its reward-to-go equals the optimal value function:

$𝔼_{x}^{π} [R_{t} (π, x_{t})] = 𝔼_{x}^{π} [J_{t} (x_{t})] .$

This result is known as Bellman's Principle of Optimality (see Bellman (1957) for a full description and proof). Intuitively, it means that a globally optimal strategy must also be locally optimal at every step — there are no beneficial 'short-term sacrifices.' If a policy is ever suboptimal from some state onward, it cannot be globally optimal.

Before applying this theory to our problem, we should address one important technical requirement embedded in Theorem 1: at each backward step, a maximizing action $f_{t} (x)$ must actually exist — not just be approached in the limit, but genuinely attained. Assumption 1 makes this precise.

Assumption 1. The Markov control model and a given measurable function $u : X \to ℝ$ are such that

$u^{*} (x) = sup_{A (x)} [r (x, a) + \int_{X} u (y) ℚ (d y | x, a)], x \in X$

is measurable and there exists a measurable function $f$ such that the function within brackets attains its maximum at $f (x) \in A (x)$ for all $x$ , i.e.,

$u^{*} (x) = r (x, f) + \int_{X} u (y) ℚ (d y | x, f) \forall x \in X .$

In other words, Assumption 1 guarantees that the 'best action' at each state is not merely approached but is actually achieved by some well-defined function $f$ .

Hernandez-Lasserre (1996), Section 3.3, provides three sufficient sets of conditions under which Assumption 1 holds. The one most relevant to our application is the following.

Theorem 2. With the same notation as above, the following set of conditions

a) $A (x)$ is non-empty and compact for all $x \in X$ , and the function $x \to A (x)$ is upper semi-continuous;

b) the one-stage reward $r$ is continuous and bounded (from above);

c) the transition law $ℚ$ satisfies the Feller condition, i.e.

$\int v (y) ℚ (d y | x, a)$

is bounded and continuous on $𝕂$ for every continuous bounded function $v$ on $X$ ,

implies Assumption 1 for any non-negative measurable function $u : X \to ℝ$ .

In plain terms, Theorem 2 says: if the set of available actions is compact and varies continuously with the state (a), the reward function is well-behaved — continuous and bounded (b), and the transition law responds smoothly to changes in the state and action, in the sense that small changes in inputs produce small changes in expected outcomes (the Feller condition in c) — then the backward induction algorithm is guaranteed to work.

GBWM as a Dynamic Programming Problem

With the theoretical scaffolding now in place, we can map our investing problem directly onto the Markov control model framework. Using the same terminology introduced above:

a) The state space $X$ is represented in our framework by the wealth space ${W_{t}}_{t = 0, \dots, T}$ , i.e. a state $x \in X$ is a wealth value $W_{t} = x$ ;

b) The action set $A$ is represented by our portfolio selection, i.e. $A = {π_{k} (t)}_{k = 1, \dots, N, t = 0, \dots, T}$ with $π_{k} (t)$ - portfolio of risk category $k$ at time $t$ ;

c) The family ${A (x) ∣ x \in X}$ of subsets of $A$ is represented by the set of efficient one-period portfolios available when the system is in state $W_{t} = x$ ;

d) The transition law $ℚ (\cdot | x, π_{k} (t))$ is generated by transition probabilities computed as part of our modeling.

The state space, action set, and transition law all map naturally onto our framework. The reward function $r : 𝕂 \to ℝ$ , however, requires a bit more care.

There are two differences between the standard dynamic programming setup and our problem. First, dynamic programming typically maximizes an expected value, whereas we want to maximize a probability. Second, our problem has a single reward only at the final date — reaching the goal — whereas the standard formulation allows for rewards at intermediate steps too (a variation we also use on our platform for goals with periodic payouts).

To bridge both gaps, we set the reward function as follows:

${\begin{cases} r (x_{t}, π_{k} (t)) = 0, & t = 0, \dots, T \forall k = 1, \dots, N \\ r_{T} = 𝟙_{{W_{T} \geq G}}, \end{cases}$

This means the reward is 1 if the final wealth reaches the goal $G$ and 0 if it falls short — a clean pass/fail outcome at the end of the horizon. Since

$𝔼 [𝟙_{{W_{T} \geq G}}] = ℙ (W_{T} \geq G)$

we see that maximizing the expected value of this indicator function is exactly the same as maximizing the probability of reaching the goal — which resolves the first difference.

With this reward function in place, conditions (a) and (c) of Theorem 2 are straightforward to verify in our setting. Condition (b) is more delicate: because the reward is an indicator function — jumping from 0 to 1 at the threshold $G$ — it is not continuous, which means it does not directly satisfy the continuity requirement.

We will not go into the technical details here, but it is possible to regularize the reward function in a way that recovers continuity, allowing condition (b) to be satisfied and the full framework to apply.

Bibliography

Hernandez-Lasserre (1996): Hernández-Lerma, O. and Lasserre, J. B. (1996). Discrete-time Markov control processes, volume 30 of Applications of Mathematics (New York). Springer-Verlag, New York. Basic optimality criteria.

Bellman (1957): Bellman, R. (1957). Dynamic programming. Princeton Landmarks in Mathematics. Princeton University Press, Princeton, NJ. Reprint of the 1957 edition, with a new introduction by Stuart Dreyfus.

Want to see this theory in action? Try our tool and discover how dynamic programming algorithms power truly optimal investment advice.

Set your investment goal

Subscribe to our blog

Stay on top of the most important aspects of investing and finance and learn more about how to use goals-based investing to successfully manage your own wealth

I am interested in the subject areas of...