Theory of point-image measurements

Image-plane likelihood

The computation of an image-plane likelihood requires the preliminary inversion of the lens mapping and the identification of the counter images of a given source. This operation, as noted above, is non-trivial and time consuming. For this reason, usually, an image-plane optimization is carried out only after an approximate lens model has been found using a source-plane optimization.

In principle, once the counter-images of a source have been found, the computation of the image-plane likelihood is a mere application of the equations written at the beginning of this page. In practice, things are a little more complicated because we need to associate the various predicted images to the observed ones. This task can be carried out following different strategies.

Theory: point image measurements

In the following equations, we provide analytic expressions for the (logarithmic) likelihood associated to position measurements. The way the problem is formulated makes it easy to write also the derivative of the log-likelihood.

Direct method (`match = :direct`)

If, somehow, we have sorted out the predicted images so that each of them is associated to the corresponding observed image, we can just compute the log-likelihood as written above, i.e.

\[\log P(\{ \hat\theta_n \} ∣ \{ \theta_n \}) = \sum_n \frac{1}{2} \log{\biggl| \frac{Q_n}{2\pi} \biggr|} - \frac{1}{2} (\hat\theta_n - \theta_n)^T Q_n (\hat\theta_n - \theta_n) \; .\]

This is done by the Gravity.loglikelihoodgrad method described below with the keyword mode = :direct. Note that in case the keyword bayesianfactor is false, the first term inside the summation of the last expression is not included, so that the summation is only performed over the quadratic term.

The log-likelihood expression above can be easily differentiated with respect to the source position $\beta$: taking into account again the theorem of the inverse function, we find

\[\frac{\partial \log P(\{ \hat\theta_n \} \mid \{ \theta_n \})}{\partial \beta} = \sum_n A_n^{-T} Q_n (\hat\theta_n - \theta_n)\]

This quantity can be useful when performing a likelihood maximization over the source position (if we do not adopt a Bayesian approach).

Best match (`match = :best` or `match = :fast`)

In this case we consider a likelihood of the form

\[\log P(\{ \hat\theta_n \} \mid \{ \theta_m \}, \sigma) = \sum_n \frac{1}{2} \log{\biggl| \frac{Q_n}{2\pi} \biggr|} - \frac{1}{2} (\hat\theta_n - \theta_{\sigma(n)})^T Q_n (\hat\theta_n - \theta_{\sigma(n)}) \; ,\]

where $\sigma$ represents a permutation of the various predicted images. We use different subscripts for the images and predictions because we have, in general, different numbers of them. In the following we call $N$ the number of images and $M$ the number of predictions. The procedure finds the permutation $\sigma$ that maximizes the expression above and returns the maximum value found, together with the Jacobian with respect to the source position.

The exact meaning of permutation here depends on the value of the duplicates keyword:

if true, it represents a permutation with repetition: this way, each observed image can be matched to each predicted one with no restriction, so that two observed images could be in principle associated to the same predicted one. This way we have $M^N$ different permutations. In practice, finding the best permutation is not complicated: we just compute all possible sum terms above and take the best ones for each source.
if false, it represents an $N$-permutation of the $M$ predicted images, resulting in ${}^M \mathrm{P}_N = M! / (M - N)!$ possibilities. The strategy we adopt in order to find the best $\sigma$ is to try the same technique as in the case of permutations with repetition, and checking if we have repetitions in the best solution: if we do not, we take this solution; if we do, we loop over the (possibly large) number of permutations.

Finally, once the best permutation has been found, the gradient of the log-likelihood is computed in a way similar to the above formula:

\[\frac{\partial \log P(\{ \hat\theta_n \} ∣ \{ \theta_n \}, \sigma)}{\partial \beta} = \sum_n A_n^{-T} Q_n \bigl( \hat\theta_n - \theta_{\sigma(n)} \bigr)\]

In case of duplicates = true, we can also use match = :fast, which is virtually identical to match = :best, but uses a fast algorithm (resulting in speed gain of more than a factor ten).

Marginalization over all matches (`match = :all`)

Looking again at the previous subsection, we can easily recognize that the permutation $\sigma$ is a nuisance parameter: we are not really interested at it, as we merely use it to compute the likelihood. In a more correct Bayesian approach, we should therefore marginalize over all possible permutations and write

\[P(\{ \hat\theta_n \} \mid \{ \theta_m \}) = \sum_\sigma P(\{ \hat\theta_n \} \mid \{ \theta_{\sigma(m)} \}, \sigma) P( \sigma ) = \frac{1}{\Sigma} \sum_\sigma \prod_n P(\hat\theta_n \mid \theta_{\sigma(n)})\]

where $\Sigma$ is the number of allowed permutations $\sigma$ (that is, $\Sigma = M^N$ or $\Sigma = M! / (M - N)!$ depending on the kind of permutations considered).

Computationally, this expression can be evaluated in two different ways depending on the permutation type. In case of permutations with repetitions (duplicates = true), we simply have

\[P(\{ \hat\theta_n \} \mid \{ \theta_m \}) = \frac{1}{\Sigma} \prod_n \sum_m P(\hat\theta_n \mid \theta_m) \; .\]

This expression, written in terms of log-quantities, becomes

\[\log P(\{ \hat\theta_n \} \mid \{ \theta_m \}) = -\log\Sigma + \sum_n \mathrm{LSE}_m \log P(\hat\theta_n \mid \theta_m) \; ,\]

where $\mathrm{LSE}_m$ denotes the LogSumExp operation over $m$. Its gradient is given by

\[\frac{\partial \log P(\{ \hat\theta_n \} \mid \{ \theta_m \})}{\partial \beta} = \sum_n \sum_m \mathrm{softmax}_m\bigl[\log P(\hat\theta_n \mid \theta_m) \bigr] A_n^{-T} Q_n \bigl( \hat\theta_n - \theta_m \bigr)\; ,\]

where $\mathrm{softmax}$ is the softmax function:

\[\mathrm{softmax}_m(x_m) = \frac{\exp x_m}{\sum_{m'} \exp x_{m'}} \; .\]

In case of $N$-permutation of the $M$ predicted images (duplicates = false), instead, we cannot perform the simplification above, and we have therefore

\[\log P(\{ \hat\theta_n \} \mid \{ \theta_m \}) = -\log \Sigma + \mathrm{LSE}_\sigma \biggl[ \sum_n \log P(\hat\theta_n \mid \theta_{\sigma(n)}) \biggr]\]

The gradient of this expression is

\[\frac{\partial \log P(\{ \hat\theta_n \} \mid \{ \theta_m \})}{\partial \beta} = \mathrm{softmax}_\sigma \biggl[\sum_n \log P(\hat\theta_n \mid \theta_m) \biggr] \sum_n A_n^{-T} Q_n \bigl( \hat\theta_n - \theta_{\sigma(n)} \bigr)\; ,\]

Finally, we note that the use of the normalization $1 / \Sigma$ above introduces, effectively, a factor that penalizes cases where we have many images predicted and few images observed. This is very sensible and shows that the approach followed here naturally encapsulates a penalty factor for unobserved images. This factor is taken into account in the code with the keyword parameter missedpredictionspenalty = true (which is the default).

Theory: luminous image measurements

We now consider the case of luminous image measurements and provide analytic expressions for the (logarithmic) likelihood associated to position measurements. Since the Jacobian of the lens mapping is already used to compute the magnification factor (lens modules), we cannot provide any derivative of the likelihood without higher order derivatives of the lens mapping.

If we suppose the intrinsic (unlensed) magnitude $m^\mathrm{s}$ is a given parameter, the generalization of the results of the previous subsection is trivial: essentially, we just need to consider additional terms associated to the magnitude measurements. For example, the result obtained for the direct method is

\[\log P(\{ \hat\theta_n \}, \{ \hat m_n \} ∣ \{ \theta_n \}, \{ m_n \}) = \sum_n \log \biggl| \frac{Q_n}{2\pi} \biggr| + \log \frac{q_n}{2\pi} - \frac{1}{2} (\hat\theta_n - \theta_n)^T Q_n (\hat\theta_n - \theta_n) - \frac{1}{2} q_n (\hat m_n - m_n)^2 \; ,\]

where $m_n \equiv m^\mathrm{s} - \mathit{LM}_n$. As usual, in case the keyword bayesianfactor is false, the first two terms inside the summation are not included. The other methods are easily generalized.

Since the relationship between the source and the observed magnitudes is very simple, it might be convenient to marginalize analytically the likelihood for $m^\mathrm{s}$: this way, we can carry out the analysis (maximum likelihood or Bayesian inference) using one parameter less. We will consider the results obtained in the various cases below.

Marginalized magnitude direct method (`match = :direct`)

We need to compute

\[P(\{ \hat\theta_n \}, \{ \hat m_n \} ∣ \{ \theta_n \}) = \int P(\{ \hat\theta_n \}, \{ \hat m_n \} ∣ \{ \theta_n \}, m^\mathrm{s}) \, \mathrm{d}m^\mathrm{s}\]

The integrand, with respect to $m^\mathrm{s}$, is just a univariate normal distribution with mean, variances, and normalization given by the equations at the end of the source-plane likelihood section. Hence we find

\[\log P(\{ \hat\theta_n \}, \{ \hat m_n \} ∣ \{ \theta_n \}) = \frac{1}{2} \sum_n \log{\biggl| \frac{Q_n}{2\pi} \biggr|} + \log \frac{q_n}{2\pi} - (\hat\theta_n - \theta_n)^T Q_n (\hat\theta_n - \theta_n) - q_n \Delta m_n^2 - \frac{1}{2} \log \frac{q}{2\pi} \; ,\]

where $\Delta m_n ≡ \hat m_n + \mathit{LM}_n - \bar m^\mathrm{s}$ and the other quantities ($q$ and $\bar m^\mathrm{s}$) are computed as discussed above.

Marginalized magnitude best match (`match = :best` or `match = :fast`)

In these case the likelihood, after marginalization, takes the form

\[\log P(\{ \hat\theta_n \}, \{ \hat m_n \} \mid \{ \theta_m \}, m^\mathrm{s}, \sigma) = \frac{1}{2} \sum_n \log{\biggl| \frac{Q_n}{2\pi} \biggr|} + \log \frac{q_n}{2\pi} - (\hat\theta_n - \theta_{\sigma(n)})^T Q_n (\hat\theta_n - \theta_{\sigma(n)}) - q_n \Delta m_n^2 - \frac{1}{2} \log \frac{q}{2\pi} \; ,\]

where $\sigma$ represents a permutation of the various predicted images. The quantity $q \equiv \sum_n q_n$ does not depend on the permutation; instead, we have

\[\bar m^\mathrm{s} = q^{-1} \sum_n q_n (\hat m_n + \mathit{LM}_{\sigma(n)}) \; .\]

and

\[\Delta m_n ≡ \hat m_n + \mathit{LM}_{\sigma(n)} - \bar m^\mathrm{s} \; ;\]

therefore both these quantities depend on the permutation.

Given the relatively complex expression, the simplifications carried out previously cannot be used in this case: in fact, even in case of permutations with repetitions, the contribution to the sum due to the $n$-term does not depend on the value of $\sigma(n)$ only, but on the entire permutation. As a result, we need to compute the result for all possible permutations and just return the maximum likelihood.

Again, if we have duplicates = true, we can use match = :fast to gain a significant speed in the algorithm. Note that in this case the best match is computed using only the positions (but then the likelihood includes also the magnitude terms): as a result, this technique should be used only when the association of observed and predicted images can be carried out using only the positions (as is usually the case), and not the combined position-magnitude.

Marginalized magnitude over all matches (`match = :all`)

The same essentially applies to the last case: we will need to evaluate the likelihood for all permutations and just average out all results obtained.

Image-plane marginalized likelihood

A third option is to compute the image-plane likelihood, but marginalizing at the same time with respect to the source position. The idea is to isolate the variables that characterize the source parameters from the variables that characterize the lenses. This way, we can use an inner loop to perform inference on the source parameters with fixed lens parameters. The advantage is that some computations, such as the calculation of the grid used to invert the lens equation, can be performed outside the inner loop, potentially saving some computation time.

This mode is implemented with a likelihood that, internally, performs that inner loop on the source positions. Specifically, the likelihood computes a first approximation of the source positions by mapping the observed images into the lens plane using Gravity.sourceimage. Then, it uses that approximate position as a starting point to minimize the log-likelihood (computed in the image plane). Finally, it uses the maximum of the log-likelihood to compute an approximate marginalized likelihood using Laplace's approximation if bayesianfactor = true, or returns just the maximum likelihood value otherwise.