Recap¶

Message Passing¶

As per our email, our Message Passing algorithm currently works on a Grid. Message passing is the most effective when running over a grid. Just a brief description of our Message Passing algorithm:

Break down the graph structure by sorting them using topological sort and form the junction nodes using the graph structure.
Starting at the root node, for every node (assuming marginal variables $a,b$):
1. Ask every children for their messages
2. Reduce the children messages and the marginal function($f_{a,b}$) into the output messages by computing the min
3. return output messages

The reduction operation performed is the $f_{min}$ which the message passing technique exploits, keeping track of the current minimum across dimensions and computing the least operations to get $f_{min}$

Using Message Passing yields 2 major benefits:

Reduction in computing the marginal function, computing the marginal function only once for all points in the marginal domain.
Reduction in the computations needed to perform $f_{min}$ by computing the marginal ${f_{a,b}}_{min}$ given the children's messages.

DIRECT¶

DIRECT relies on the definition of potentially optimal rectangles. An interval (rectangle, at a particular side) $j$ is potentially optimal if $$ \forall i \in I_3 : f(c_j) \leq f(c_i) $$ where $I_3$ is the set of indices of size $d_j$. In other words, the intervals for which is the $f_{min}$s in size $d$ are potentially optimal.

DIRECT considers potentially optimal rectangles by looking at the various optimal rectangles across various sizes (which in our zooming language - zoom level). At every iteration,

All potentially optimal rectangles would be split at the longest side
The function is evaluated at all centers of the new rectangles. The process is repeated until a termination condition is met (accuraccy or max iterations)

Note that if we have $d$ dimensions, it will be guaranteed that after $log_3(N)^d$ iterations that space explored would be equivalent to $N^d$hypercube grid

Ideas for Direct-MP¶

I think there are two ideas to resolve the problem:

Use the message passing for computation but not $f_{min}$ reduction
We redfine the DIRECT's defintiion of potentially optimal rectangle to potnetially optimal voxel.

Use the message passing for computation but not $f_{min}$ reduction¶

The key idea is not to compute the min at every step. Which means we compute the messages to be $f_{a,b}$ and not reduce the messages and keep the messages stored inside the junction nodes.

Then, the key modification would be to consider is to propagate the marginal set of points with their associated values, then we use the messages to compute the desired f(x) after all messaged are computed or when appropriate. Even though this method seems like there would be no benefit compared with just evaluating f(x) itself, there can be a computational benefit in the long run:

Consider your example, in addition we assume that the nodes are a path graph with edges $[(1,2), (2,3), ... , (d-1, d)]$.

MP: $5f(d-1)$ where $f$ is the computation of the small $f$ (ie. 1,2)
Full: $F$ would yield: $(2d+1)F$ or what you have pointed out when $d=1000$ it will be $2001$ evaluations of $F$

In computational terms, the ratio of compute time for an equivalent $F$ using $f(d-1)$ is $f(d-1) \sim F1.5$ (measured). The ratio gets better as the graph is complex, the computation here assumes a simple path graph as menioned.

Considering each additional iteration,

MP: additional $2f$ computations, considering that the messages are saved for the next iteration.
Full: additional $2F$ computations

Here, we assume that the message passing compute is negliable.

After $k$ additional iterations:

MP: 5f(d-1) + 2fk
Full: (2d+1)F + 2Fk

There would be a certain $k$ when MP would be faster, as for every iteration $(2f < 2F)$.

Redfine the DIRECT's defintion to potentially optimal voxel¶

Very similar to the zooming technique, we adapt the DIRECT technique - instead of splitting up the selected rectangle only on the longest side we explore the best voxel of every size. For example:

Let $Li$ be the set of voxels at level $i$. Let $Li_j$ be the best j-th voxel in $Li$, best is being defined by $f_{min}$.

So, in every iteration we explore all of the potentially optimal voxel:

First, we grid up the space 3^d, which is represented by $L1$ and compute the f_min in this level.
Next, we explore the best hypercube $L1_1$ by gridding up the best voxel. (Note $L2 = L1_1$ when gridded up)
Thereafer, we select the next best hypercube in $L1$ - $L1_2$ to grid also the best grid in $L2$ - $L2_1$

and so on... Note that the number of voxels explored in every iteration is linear to the iteration number.

Computation of the best voxel is clear and can be computed directly using our current message passing setup. However, it is still not clear if there is a way to compute the next best (and subsequent) voxels efficiently.

Contrasting to our Zooming technique, this technique continues to explore the current best voxel in every level not just the best hypercube like what we are doing now. I think with this technique it would allow us to avoid being too greedy and miss a better optima. The shortcoming would be that we consume much more computation as in every iteration we commit to explore exponentially more voxels in the entirety of the algorithm.

For this to be possible, the math behind DIRECT needs to work with exploring hypercubes instead of rectangles and perhaps also inherit the other properties.(I havnt taken a deep look at this yet). My intuition would be that it would be similar as in the vanilla version of DIRECT the full grid of a space would be also reached after $log_3(N)^d$ iterations.

Also, this would be parallelizable.