Image-plane likelihood

The computation of an image-plane likelihood requires that we include explicitly the source parameters (positions, magnitudes, quadrupole moments...) in the model, and also that we perform an inversion of the lens mapping and associate the counter images of a given source. All these operations, as noted above, are non-trivial and time consuming. For this reason, usually, an image-plane optimization is carried out only after an approximate lens model has been found using a source-plane optimization.

In principle, once the counter-images of a given source have been found, the computation of the image-plane likelihood is a mere application of the equations written at the beginning of this page. In practice, things are a little more complicated because we need to associate the various predicted images to the observed ones. This task can be carried out following different strategies.

In the following we will reserve the subscript $i \in \{1, \dots, I\}$ to indicate the observed images and the subscript $p \in \{1, \dots, P\}$ for the predicted images from a given source.

Inversion of the lens equation

The inversion of the lens equation, i.e. the discovery of the image positions associated to a given source position, plays a central role in the image-plane likelihood. This task is generally carried out using a non-linear equation solvers. Many non-linear equation solvers are known, but in Gravity.jl we have decided to use the simplest one, the Newton's method.

Newton's method, as any iterative method, requires a starting guess of the solution. When we have already a good approximation of the lens model, we can generally safely use the positions of the observed images as starting points of the method. The convergence of the lens inversion method is in these cases generally quite fast, unless the Jacobian of lens equation is not well-behaved (this can happen for sources on top of critical curves with formally infinite magnification, or for lens models with singularities).

In case we do non have a good lens model, or in case we wish to find new solutions of the lens equations (for example, to find possible counter-images), we can use a suitable grid and map all grid cells into the source plane using the lens equation. We can then identify the cells which, when mapped in the source plane, will enclose the source position, and use their centers (or better, the inverse of the source position using a linear approximation of the lens equation) as guesses for the solution. If the grid is suitable built, this technique generally ensures that we will find all possible solutions of the lens equation, at the expense of a much higher computational complexity.

Positions

Below we provide analytic expressions for the (logarithmic) likelihood associated to position measurements. The way the problem is formulated makes it easy to write also the derivative of the log-likelihood.

Direct method (`matches = :direct`)

If, somehow, we have sorted out the predicted images so that each of them is associated to the corresponding observed image (which necessarily implies $I = P$), we can just compute the log-likelihood as written above, i.e.

\[2 \log P(\{ \hat\theta_i \} ∣ \{ \theta_i \}) = \sum_{i=1}^I \log{\biggl| \frac{\mathbf{\Theta}_i}{2\pi} \biggr|} - (\hat\theta_i - \theta_i)^T \mathbf{\Theta}_i (\hat\theta_i - \theta_i) \; .\]

This is done by the Gravity.loglikelihoodgrad method described below with the keyword mode = :direct. Note that in case the keyword bayesianfactor is false, the first term inside the summation of the last expression is not included, so that the summation is only performed over the quadratic term.

The log-likelihood expression above can be easily differentiated with respect to the source position $\beta$: taking into account again the theorem of the inverse function, we find

\[\frac{\partial \log P(\{ \hat\theta_i \} \mid \{ \theta_i \})}{\partial \beta} = \sum_{i=1}^I A_i^{-T} \mathbf{\Theta}_i (\hat\theta_i - \theta_i)\]

This quantity can be useful when performing a likelihood maximization over the source position, or when using Monte Carlo techniques that requires the derivatives of the posterior (such as Hamiltonian Monte Carlo).

Best match (`matches = :best` or `matches = :best_x`)

In this case we consider a likelihood of the form

\[2 \log P(\{ \hat\theta_i \} \mid \{ \theta_p \}, \sigma) = \sum_{i=1}^I \log{\biggl| \frac{\mathbf{\Theta}_i}{2\pi} \biggr|} - (\hat\theta_i - \theta_{\sigma(i)})^T \mathbf{\Theta}_i (\hat\theta_i - \theta_{\sigma(i)}) \; ,\]

where $\sigma$ represents a permutation of the various predicted images. We use different subscripts for the images and predictions because we have, in general, different numbers of them. Gravity.jl, for matches = :best, finds the permutation $\sigma$ that maximizes the expression above and returns the maximum value found, together with the Jacobian with respect to the source position. The case matches = :best_x, for point sources, is totally equivalent to matches = :best; for the other source types discussed below, instead, the code for matches = :best_x looks for the permutation that maximizes the likelihood associated to the positions, but then computes the likelihood taking into account all source parameters.

The exact meaning of permutation here depends on the value of the duplicates keyword:

if true, it represents a permutation with repetition: this way, each observed image can be matched to each predicted one with no restriction, so that two observed images could be in principle associated to the same predicted one. This way we have $P^I$ different permutations. In practice, finding the best permutation is very simple: we just compute all possible sum terms above and take the best ones for each source. Thus, essentially, for each observed image we take the closest predicted image in terms of Mahalanobis distance.
if false, it represents an $I$-permutation of the $P$ predicted images, resulting in ${}^P \mathrm{P}_I = P! / (P - I)!$ possibilities. The strategy we adopt in order to find the best $\sigma$ is to try the same technique as in the case of permutations with repetition, and checking if we have repetitions in the best solution: if we do not, we take this solution; if we do, we loop over the (possibly large) number of permutations. This choice requires that the predicted images are at least as numerous as the observed ones, i.e. that $P \ge I$. This is potentially problematic, as it might prevent a convergence if the lensing model is far from a realistic configuration.

Finally, once the best permutation has been found, the gradient of the log-likelihood is computed in a way similar to the above formula:

\[\frac{\partial \log P(\{ \hat\theta_i \} ∣ \{ \theta_p \}, \sigma)}{\partial \beta} = \sum_{i=1}^I A_i^{-T} \mathbf{\Theta}_i \bigl( \hat\theta_i - \theta_{\sigma(i)} \bigr)\]

Marginalization over all matches (`matches = :all`)

Looking again at the previous subsection, we can easily recognize that the permutation $\sigma$ is a nuisance parameter: we are not really interested at it, as we merely use it to compute the likelihood. In a more correct Bayesian approach, we should therefore marginalize over all accepted permutations $\Sigma$ and write

\[P(\{ \hat\theta_i \} \mid \{ \theta_p \}) = \sum_{\sigma\in\Sigma} P(\{ \hat\theta_i \} \mid \{ \theta_{\sigma(i)} \}, \sigma) P( \sigma ) = \frac{1}{|\Sigma|} \sum_{\sigma\in\Sigma} \prod_{i=1}^I P(\hat\theta_i \mid \theta_{\sigma(i)}) \; .\]

Computationally, this expression can be evaluated in two different ways depending on the permutation type. In case of permutations with repetitions (duplicates = true), we simply have

\[P(\{ \hat\theta_i \} \mid \{ \theta_p \}) = \frac{1}{|\Sigma|} \prod_{i=1}^I \sum_{p=1}^P P(\hat\theta_i \mid \theta_p) \; .\]

This expression, written in terms of log-quantities, becomes

\[\log P(\{ \hat\theta_i \} \mid \{ \theta_p \}) = -\log|\Sigma| + \sum_{i=1}^I \mathop{\huge\Lambda}\limits_{p=1}^P \log P(\hat\theta_i \mid \theta_p) \; ,\]

where $\mathop{\Lambda}_p$ denotes the LogSumExp operation over $p$. Its gradient is given by

\[\frac{\partial \log P(\{ \hat\theta_i \} \mid \{ \theta_p \})}{\partial \beta} = \sum_{i=1}^I \sum_{p=1}^P \mathrm{softmax}_p\bigl[\log P(\hat\theta_i \mid \theta_p) \bigr] A_i^{-T} \mathbf{\Theta}_i \bigl( \hat\theta_i - \theta_p \bigr)\; ,\]

where $\mathrm{softmax}$ is the softmax function:

\[\mathrm{softmax}_p(x_p) = \frac{\exp x_p}{\sum_{p'} \exp x_{p'}} \; .\]

In case of $I$-permutation of the $P$ predicted images (duplicates = false), instead, we cannot perform the simplification above, and we have therefore

\[\log P(\{ \hat\theta_i \} \mid \{ \theta_p \}) = -\log |\Sigma| + \mathop{\mathrm{\huge\Lambda}}\limits_\sigma \biggl[ \sum_{i=1}^I \log P(\hat\theta_i \mid \theta_{\sigma(i)}) \biggr]\]

The gradient of this expression is

\[\frac{\partial \log P(\{ \hat\theta_i \} \mid \{ \theta_p \})}{\partial \beta} = \mathrm{softmax}_\sigma \biggl[\sum_{i=1}^I \log P(\hat\theta_i \mid \theta_p) \biggr] \sum_{i=1}^I A_i^{-T} \mathbf{\Theta}_i \bigl( \hat\theta_i - \theta_{\sigma(i)} \bigr)\; ,\]

Finally, we note that the use of the normalization $1 / |\Sigma|$ above introduces, effectively, a factor that penalizes cases where we have many images predicted and few images observed. This is very sensible and shows that the approach followed here naturally encapsulates a penalty factor for unobserved images. This factor is taken into account in the code with the keyword parameter missedpredictionspenalty = true (which is the default).

Magnitudes

We now consider the case of luminous image measurements and provide analytic expressions for the (logarithmic) likelihood associated to magnitude measurements. Note that since the Jacobian of the lens mapping is already used to compute the magnification factor (lens modules), we cannot provide any derivative of the likelihood without higher order derivatives of the lens mapping.

If we suppose that the intrinsic (unlensed) magnitude $M$ is a given parameter, the generalization of the results of the previous subsection is trivial: essentially, we just need to consider additional terms associated to the magnitude measurements. For example, the result obtained for the direct method is

\[2 \log P(\{ \hat\theta_i \}, \{ \hat m_i \} ∣ \{ \theta_i \}, M) = \sum_{i=1}^I \log \biggl| \frac{\mathbf{\Theta}_i}{2\pi} \biggr| + \log \frac{\lambda_i}{2\pi} - (\hat\theta_i - \theta_i)^T \mathbf{\Theta}_i (\hat\theta_i - \theta_i) - \lambda_i (\hat m_i + \mathit{LM}_i - M)^2 \; .\]

As usual, in case the keyword bayesianfactor is false, the first two terms inside the summation are not included. The other methods are easily generalized.

Marginalized magnitude

Since the relationship between the source and the observed magnitudes is very simple, it might be convenient to marginalize analytically the likelihood over $M$: this way, we can carry out the analysis (maximum likelihood or Bayesian inference) using one parameter less for each image family.

As for the source-plane likelihood, we can take advantage of the specific form of the likelihood and use a conjugate Gaussian prior for the source magnitude $M$ characterized by hyperparameters $\hat M_0$ and $\lambda_0$. If this case we find

\[P(\{ \hat\theta_i \}, \{ \hat m_i \} \mid \{ \theta_i \}) = \int P(\{ \hat\theta_i \}, \{ \hat m_i \} \mid \{ \theta_i \}, M) \, P(M \mid \hat M_0, \lambda) \, \mathrm{d}M\]

The integrand, with respect to $M$, is just a univariate normal distribution with mean, variances, and normalization (evidence) given by the equations in the source-plane likelihood section.

Note that so far we have assumed that we are able to match the images with the predictions one-by-one, as implicitly assumed above (matches = :direct). More generally, if we take a fixed permutation $\sigma$ that associates observed images to predicted ones, we will need to evaluate the evidence for the specific permutation and, if necessary, we will need to marginalize over $\sigma$. Computations with variable permutations (associated to matches = :best or matches = :all) pose some difficulties when associated to the marginalization of the source magnitude. In fact, in this case the $i$-th term entering the sum in the log-likelihood computation will not involve only the permuted image (thus, say $\sigma(i)$), but will depend on the entire permutation $\sigma$. Specifically, the likelihood will contain (quadratically) terms such as $\bar M$, which will depend on the entire permutation $\sigma$. This makes it very difficult to apply the simplifications carried out previously. As a result of this complication, we are unable to simplify the computation above in the most common cases, and we essentially need to compute likelihood for all possible permutations.

Quadrupole measurements

Sources with quadrupole measurements have a log-likelihood of the form

\[\begin{align*} 2 \log {} & P(\{ \hat\theta_i \}, \{ \hat m_i \}, \{ \hat Q_i \} ∣ \{ \theta_i \}, M, \mathbf{S}, \{ \nu_i \} ) = \sum_{i=1}^I \biggl[ \log \biggl| \frac{\mathbf{\Theta}_i}{2\pi} \biggr| - (\hat\theta_i - \theta_i)^T \mathbf{\Theta}_i (\hat\theta_i - \theta_i) \\ & {} + \log \frac{\lambda_i}{2\pi} - \lambda_i (\hat m_i + \mathit{LM}_i - M)^2 \\ & {} + \frac{\nu_i}{2} \log |\nu_i A_i^{T} \mathbf{S} A_i| - \log \bigl[ 4 \pi \Gamma(\nu_i - 1) \bigr] + \frac{\nu_i - 3}{2} \log |\hat Q_i| - \nu_i \mathrm{tr} (A^T_i \mathbf{S} A_i \hat Q_i) \biggr] \; , \end{align*}\]

where we have split in each line the contribution from the three different measurements (positions, magnitudes, and quadrupole moments).

For this kind of sources, we can apply arguments similar to the ones discussed above. In particular, if we keep $M$ and $\mathbf{S}$ as free parameters, the computation of the log-likelihood proceeds as for point-sources, with the additional terms associated to the magnitudes and quadrupole moments.

Alternatively, we can try to perform a marginalization over the source quadrupole moment $\mathbf{S}$. As usual, this operation can be performed relatively easily using the equations discussed in the source-plane likelihood section.

Time-delay measurements

For point-sources with associated time-delays we can define a log-likelihood function following closely what done for magnitude measurements. The equations, from a mathematical point of view, are identical, and we also have again the possibility of performing a marginalization over the source time $T$. As for luminous sources, a marginalization over the source time $T$ can be carried out.