A machine learning photon detection algorithm for coherent x-ray ultrafast fluctuation analysis

X-ray free electron laser experiments have brought unique capabilities and opened new directions in research, such as creating new states of matter or directly measuring atomic motion. One such area is the ability to use finely spaced sets of coherent x-ray pulses to be compared after scattering from a dynamic system at different times. This enables the study of fluctuations in many-body quantum systems at the level of the ultrafast pulse durations, but this method has been limited to a select number of examples and required complex and advanced analytical tools. By applying a new methodology to this problem, we have made qualitative advances in three separate areas that will likely also find application to new fields. As compared to the “droplet-type” models, which typically are used to estimate the photon distributions on pixelated detectors to obtain the coherent x-ray speckle patterns, our algorithm achieves an order of magnitude speedup on CPU hardware and two orders of magnitude improvement on GPU hardware. We also find that it retains accuracy in low-contrast conditions, which is the typical regime for many experiments in structural dynamics. Finally, it can predict photon distributions in high average-intensity applications, a regime which up until now has not been accessible. Our artificial intelligence-assisted algorithm will enable a wider adoption of x-ray coherence spectroscopies, by both automating previously challenging analyses and enabling new experiments that were not otherwise feasible without the developments described in this work.


I. INTRODUCTION
The construction and operation of X-ray free electron lasers (XFELs) [1][2][3][4][5] has enabled a great leap towards deeper understanding of a diverse area of scientific research areas 6 , including planetary science 7 , astrophysics 8 , medicine 9 and molecular chemistry 10 .With the unprecedented brightness, short pulse duration, and x-ray wavelengths, new states of matter can be created and studied 11 , while dynamics can be monitored, and now controlled, on ultrafast timescales 12 .
With the start of high repetition rate next-generation light sources, methods which have so far been challenging will become feasible, such as resonant inelastic x-ray scattering at high timeand spectra-resolution 13 and x-ray photoemission spectroscopy 14 .One such example is X-ray photon correlation spectroscopy (XPCS) 15,16 , which uses the spatial coherence of the X-ray beam to produce a scattering 'fingerprint' of the sample.This fingerprint, or speckle pattern, can be correlated in time to directly observe equilibrium dynamics of a given system.This information of the thermal fluctuations can be related back the energetics and the interactions in the system.This is typically measured by calculating the intensity-intensity autocorrelation function and extracting the intermediate scattering function S(q,t) (Equation 1), g 2 (q,t) = 1 + AS(q,t) 2 (1) which allows the time correlation to be related back to the physical properties of the system being studied.
Another benefit of these new machines is in their ability to produce finely spaced X-ray pulses with controllable delay, using X-ray optics 17 or special modes of the accelerator 18 .These pulses enable studies of spontaneous fluctuations at orders of magnitude faster timescales than what is possible using XPCS at x-ray synchrotron facilities, with one key area of application being emergent phenomena in quantum materials.We refer to this multi-pulse adding technique here as X-ray photon fluctuation spectroscopy (XPFS) 19 .This is a unique tool which differs from traditional pump-probe spectroscopy which detects the relaxation from a non-equilibrium state, by instituting more of a 'probe-probe' method, where fluctuations in the equilibrium state can be measured directly by comparing how the system changes between probe pulses.Here, one adds the pulses which are too close together in time to be read out by the detector 20 and uses statistics of the coherent speckle 21 to compute the fluctuation spectra using the contrast 22,23 , i.e. the fast dynamical information of the system can be distinguished by studying single photon fluctuations.
Even with the massive amount of photons per pulse, three things typically result in a single photon detection process: the decrease of intensity after the scattering process on a single pulse basis, the short pulse duration, and the sometimes reduced intensity required to ensure excitations are not produced in the sample.
In principle, if the discrete distribution of photon counts over the detector can be accurately measured and enough samples averaged, it is possible to determine the dynamical evolution the sample by computing the speckle contrast C (q,t) as a function of delay-time t and momentum transfer q.The contrast is obtained by fitting a negative binomial distribution parameterized by and the average number of photons per pixel k24 -i.e.P k; k, M : Fitting this negative binomial distribution requires the extraction of photon counts from raw detector images, and works fairly well in the hard x-ray regime and for large pixel size detectors [24][25][26] .
In cases where the pixels are small, or the energy of the x-rays is much lower, this process can involve additional obstacles.One challenge is the point spread function of a single photon can spread non-uniformly over many pixels.This is especially true in the soft x-ray regime, where there can be a large variability in the charge cloud size -owing to variable diffusion lengths within a pixel -and low signal to noise ratios.These effects have recently been shown to be corrected by a variational droplet model called the Gaussian Greed Guess (GGG) droplet model 27 , which can fit the large variation in charge cloud radii to produce discrete images where each pixel contains the number of corresponding photons.
While 'droplet-type' models have been largely successful, there is a need to increase the speed of these computational models as well as to handle common scenarios, such as low signal-to-noise.
A few works have employed machine learning techniques to address some of these outstanding challenges.For instance, the use of convolutional neural networks to analyze XPCS data for wellresolved speckles has showed the denoising approaches are able to achieve significantly better signal-to-noise statistics as well as estimations of key parameters of interest [28][29][30] .Previous work has also considered the single-photon analysis for hard X-ray detectors using machine learning.
One approach 31 has been to use a tensorflow computational graph with hand-crafted convolutional masks derived from an in-depth study of photon physics at semiconductor junctions 32 .This implementation is extremely fast, but does not apply to regimes where there may be a large number of photons per droplet.Another method 33 proposes a feed-forward neural network architecture, based on a sliding prediction of 5x5 regions of the input image.This was proposed for the photon map prediction task and is shown to be applicable for hard X-ray, low count rate experiments.
However, additional factors such as noise, low photon energies, and insufficient signal-to-noise ratios can cause limitations in this methodology and thus obscure scientific results.Furthermore, in cases where a higher intensity can be measured, the charge clouds can quickly coalesce, making this problem intractable.
In this work, we expand the applicability of this ultrafast method by demonstrating robust single-shot prediction using an AI-assisted algorithm in the soft X-ray regime for data with relatively high average count-rates and significant charge sharing.This is carried out using a fully convolutional neural network architecture 34 , which we compare against the GGG method, currently the best algorithm for soft X-ray analysis using small pixel-size detectors 27 .We find that we are able to access a new phase space of measurement parameters that, until now, has not been accessible in structural dynamics studies using this method.Our algorithm enables a two order of magnitude speedup on appropriate hardware, is relative accurate for low contrast cases, and is stable at higher intensities than the GGG algorithm.We first describe the machine learning model and simulator used to train it, specifying the architecture, how the model is trained, and the evaluation metric used.This is followed by our main results, and the three areas which were shown to return excellent results relative to the current state-of-the-art models.Finally, we end with a discussion of uncertainty quantification, and how one can judge the error for different models.

II. MODELLING AND ANALYSIS APPROACH A. Simulator Description
One key issue in the development of supervised machine learning algorithms is a robust simulator which can adequately describe the data.To describe the simulator here, we denote an input XPFS frame as x i ∈ R 90x90 and the corresponding output photon map as p i ∈ R 30x30 .The 3x3 reduction in dimensionality between x i and p i is used to mimic the speckle oversampling factor that is typically used in LCLS experiments.The final calculations are performed on a 30x30 image to allow for the proper photon events to be expressed per speckle.
The detector images and corresponding photon maps are simulated according to the exact parameters described in 27 which were tuned to mimic a previous experiment by matching the overall pixel and droplet histogram.To simulate ground truth photon maps, the following ranges were used: k ∈ [0.025, 2.0] and C (q,t) ∈ [0.1, 1.0].The relevant detector parameters are the probabilities (w i ) and sizes (σ G ) of the photon charge clouds, the variance of the zero-mean Gaussian background detector noise (σ N ) and the total number of analog-to-digital units (ADUs) per photon.
These parameter values are reproduced below in Table I and an example of a detector image / photon map pair is shown in Figure 1.For comparison, we used the Gaussian Greedy Guess (GGG) algorithm with relevant parameters which were optimized for these specific simulation parameters described above 27 .

B. Model Architecture
In this problem, we seek a supervised machine learning model which learns the functional mapping f : x i → p i , from N f paired simulated data points (x i=1:N f , p i=1:N f ).In this case, the functional mapping was chosen to be a U-Net neural network (Figure 2), a fully convolutional autoencoder architecture, which is characterized by having skip-connections between different layers of resolution and has been shown to perform well on various image segmentation tasks 36 .
In the schematic in Figure 2, the architecture is outlined via successive "convolutional blocks".
Each such convolutional block consists of two convolutional layers sequentially applied to the input.After each convolution, we utilize batch normalization 37 to ensure robust optimization, followed by a Rectified Linear Unit (ReLU) activation.Convolutional block layers are shown between the input and output images using NN-SVG 38 .

C. Training Details and Validation Metrics
To train the model, we use the Frobenius norm between the predicted photon maps ( Pi ) and the true photon maps (P i ).This loss function measures the average squared deviation between the predicted photon map and the true photon map, where the average is taken both within a given frame (which contains N p pixels) and between frames (N f ) in the dataset. (3) The U-Net model is trained by minimizing L P, P with respect to the model parameters.To train the neural network, we use the following hyperparameters: Adaptive Moment Estimation (ADAM) algorithm for optimization 39 , batch size = 128, learning rate = 0.001, and batch normalization.We used NVIDIA A100 GPU hardware with the Keras API 40 .
We performed analysis at low and high count rates ( k) and trained respective models.For the low-k data, 100,000 training data points were simulated based on the detector parameters in Table I and uniformly selecting k in range [0.025, 0.2] and C (q,t) over the range [0.1, 1.0].For the high-k analysis, 300,000 data points were used for training with an equal proportion of datapoints coming from k ∈ [0.025, 0.2], [1.0, 2.0] and [0.025, 2.0], respectively, with the the C (q,t) randomly chosen from the range [0.1, 1.0].To select between competing trained models, the optimal neural network was selected based on maximizing the correlation between the estimated and the true contrast on held-out validation sets of size 5000 for contrast values in the range [0.1, 1.0] with increments of 0.05.Here, it is worth emphasizing that the metric used to evaluate the photonizing task is important.For example, the overall accuracy is not necessarily a good metric since many photon maps have a small number of photons.Therefore, a model which uniformly predicts 0 for each pixel will show a uninformatively high accuracy, which is clearly not the desired performance and will lead to poor statistics.Similar issues have been documented in problems with high class imbalances 41 , and correlation based similarity metrics for evaluation are recommended therein 42 .
Since our final goal is to obtain a good estimate of the contrast, it is useful to use this information directly in the evaluation metric.For the low-k analysis, validation datasets were simulated in the range [0.025-0.2].For the high-k analysis, two additional datasets corresponding to k in the ranges [1.0, 2.0] and [0.025, 2.0] were used.The evaluation metric for this analysis was the average correlation for the three different k ranges.An example of a sample validation plot is shown in Appendix A.
Finally, to obtain an estimate of the contrast from the predicted photon maps we use the maximum likelihood estimation procedure on the negative binomial distribution.In general, the negative binomial distribution is a function of both C (q,t) and k.However, we directly use a per-image estimate for k and therefore the MLE procedure reduces to a 1D optimization in C (q,t) 43 .

A. Speed of Inference
As X-ray sources and detectors move towards faster repetition rates nearing 1 MHz, it is important to preserve the possibility of live data analysis.Here, we compare the speed of an optimized GGG droplet algorithm 44 against the trained CNN model (see Table II).On 1 CPU, the CNN outperforms the GGG algorithm by roughly an order of magnitude.This advantage stretches to two orders of magnitude when comparing GGG parallelized across multiple CPU cores to the CNN running on one NVIDIA A100 GPU.The observed speedup presented here is consistent with intuition.At inference, the trained neural network, which consists primarily of matrix multiplication operations, is efficiently parallelized over thousands of GPU processes 45 .In contrast, the GGG algorithm requires for-loop operations at the level of each droplet.For this reason, one additional beneficial property of the CNN model is that the prediction rate does not depend on the content of the XPFS frames and is consequently independent of k.In contrast, for the GGG algorithm, the run time scales linearly with k (Figure 3).
Here, it is worth mentioning that the GGG algorithm is already orders of magnitudes faster than the Droplet Least Squares algorithm 24 , which is exponential in computational complexity.
Finally, since the neural network architecture follows a fully convolutional paradigm, it is possible to make predictions on large input / detector sizes than those used in the training set.This is enabled by the fact that fully-convolutional architectures learn local spatial filters which apply to the full image, making the learning process relatively efficient.For this analysis, the neural network can handle input sizes of (N f , 90a, 90b, 1), where a, b are positive integers and N f denotes a variable number of frames.Note, that this allows for any detector size to be used after zeropadding to nearest (90a, 90b) frame size.We calculate the average time to make predictions on datasets of dimensionality [100, 90, 90, 1], [100, 270, 270, 1] and [100, 900, 900, 1].We observe rates of 3.4, 3.1 and 0.3 kHz, respectively.As the rate only decreases a factor of 10 between a frame size of (90,90) and (900,900), it appears that we do not observe quadratic scaling that would be observed using droplet-based algorithms.Furthermore, the ability to analyze such data in a single-shot manner is a significant advantage over the sliding approach for droplet analysis which has been developed 33 .A representative example of the CNN prediction using an input resolution of 270x270 pixels is shown in Figure 4.

B. Accuracy
In this section, we compare the prediction quality of the CNN model against the GGG algorithm on data which simulates LCLS experiments and for which the ground truth contrast is known 27 .
We begin with an analysis of low count-rate ( k ∈ [0.025, 0.2]) data and subsequently present results for higher count-rate data ( k ∈ [0.2, 2.0]).To quantify performance, we show parity plots for the predicted and true contrast on datasets with varying contrasts.
At low k and high contrasts, the CNN and GGG algorithm give good predictions for the contrast.Although, it is worth pointing out that in this regime the CNN algorithm systematically underpredicts the contrast and has a slightly larger bias than the GGG algorithm.However, at low contrast levels, the GGG algorithm exhibits much greater bias than the CNN (Figure 5a).One possible reason for the overall superior performance of the CNN is that it takes into account variation in photon charge cloud sizes during training.Here, it is worth emphasizing that even after optimizing droplet parameters on simulated data with known detector parameters, the GGG algorithm still exhibits large bias for lower contrast values, indicating that the algorithm may not have the complexity required to fully treat such data.
As k increases, the CNN performance decreases on an average photon map accuracy basis (Figure 5c).To further examine the CNN errors, we clipped the output photon map to the range of [0,8] (i.e.no photon map has more than eight photons or less than zero photons) and analyzed the confusion matrix (CF i, j ) of the predictions (Figure 5d).The diagonal of the confusion matrix represents per-class accuracy.For instance, CF 2,2 represents the accuracy of prediction for pixels containing two photons.By examining the diagonal elements, it is clear that the CNN model makes a greater proportion of errors for higher photon counts; note the trend does not hold for the eight photon event due to the clipping operation.The off-diagonals of the confusion matrix indicate how the model makes errors.For example, the CF 3,6 term indicates the probability of the model assigning three photons to a pixel when the true number of photons was actually equal to six.From these elements, we see that the CNN tends to systematically under-predict high photon events.Taken together, these observations suggest that there is a dataset imbalance issue owing to the fact that low photon events are more probable in the training set.
Although at high k, the CNN is less accurate on a per-photon map basis (relative to its low k performance), this does not necessarily imply inferior contrast predictions.In fact, the parity plots are similar for low and high k cases (Figure 5b).This observation stems from the trade-off between information content and accuracy at high k (Appendix B).For a fixed dataset size, it is harder to estimate photon counts correctly, but the counts have significantly more information about the unknown C (q,t) parameter.In Figure 6, we quantify the performance of the CNN model and the GGG algorithm at different k ranges using the correlation in the contrast-contrast parity plot as our metric.It is evident that the GGG algorithm is slightly biased across all k levels and performs poorly for k > 2.0.This is unsurprising, as droplet-based algorithms were designed to cope with small droplets with relatively few overlapping charge clouds.Furthermore, this implies further development of the CNN algorithm will be capable of handling large k data sets.

C. Uncertainty Quantification
In this section, we consider a neural network ensemble approach to quantify the uncertainty in the predicted photon maps and contrasts.The motivation for such an analysis is that, while deep learning models have exhibited significant successes in their application to scientific problems, they have a tendency to engender overconfident predictions that may be inexact.As an example, neural networks are unable to recognize Out Of Distribution (OOD) instances and habitually make erroneous predictions for such cases with high confidence [46][47][48] .In reliability-critical tasks, such errors and uncertainties in model predictions have led to undesirable outcomes [49][50][51] .In this context, quantifying the uncertainties in deep learning model predictions is highly desirable.
There are two sources of predictive uncertainty that need to be considered: Epistemic and Aleatoric.Epistemic uncertainty 52 (reducible or subjective uncertainty) arises due to lack of knowledge regarding the dynamics of the system under consideration, or an inability to express the underlying dynamics accurately using models.Epistemic uncertainties can lead to biases in the predictions.Aleatoric uncertainty 52 (irreducible uncertainty or stochastic uncertainty) arises due to noise in the training data, projection of data onto a lower space, absence of important features, etc. Aleatoric sources can lead to variances in the predictions.
For our analysis, we use an ensemble of neural networks to make a point prediction of the contrast C (q,t) as well as to give an estimate of statistical uncertainty.This is in line with model ensembling based uncertainty quantification (UQ) methods validated in literature 53,54 .Such ensembling accounts for aleatoric uncertainties due to the data and weight uncertainties.In our investigation, the neural network ensemble is formed via sequential sampling, wherein ten partially decorrelated models were sampled during the model training.Contiguous samples were spaced by ten optimization epochs each.The contrast is calculated for each model via a maximum likelihood procedure and the contrast point prediction is taken as the median predicted value.To estimate the model uncertainty, we provide a 95% contrast prediction interval using the standard deviation of the predicted contrasts and making the assuming that the predictive distribution follows a t-distribution with nine degrees of freedom (Figure 7).
We see that the error bars are larger at lower contrasts and correctly captures the notion that the prediction task is harder at lower contrasts 44 .We also notice a systematic bias in the CNN models at high contrasts.This bias may arise due to the epistemic uncertainties due to the model form (structural uncertainty).Such structural uncertainties in deep learning models cannot be accounted for by any extant approach, including the procedure used in this investigation.However, this indicates a need for more refinement on the present approach (for instance, more fine grained optimization of the model architecture, etc) and will be explored more fully in future work.
Another interesting avenue is to look at the median predicted photon map and the predicted standard deviation map to examine where the neural networks lack consensus.An example of this pair of outputs are shown in Figure 8. Evidently, the ensemble predicts insignificant uncertainty for the majority of the image with the exception of a few pixels with relatively high uncertainty.
An interesting future strategy could involve using the CNN model as a fast, initial approach and subsequently run more complex fitting algorithms on regions of the image with high predicted uncertainty.

IV. CONCLUSIONS
In this work, we have developed a convolutional neural network architecture which is capable of analysing single-photon X-ray speckle data in non-optimal situations, such as for small pixel size detector or with soft x-ray energies.We have benchmarked this algorithm on realistic simulated data and found that it outperforms the conventional Gaussian Greedy Guess (GGG) droplet algorithm in terms of speed and computational complexity.Furthermore, the algorithm is able to extract the contrast information for new ranges that were previously inaccessible, such as low contrast -relevant for systems which scatter weakly, as well as in a high k regime.Both of the latter developments will create new opportunities to study fluctuations using XPFS in novel systems, such as in quantum or topological materials.

FIG. 2 .
FIG. 2.A schematic for the U-Net neural network developed here for single photon counting detection.The input is given by the 90x90 detector image with output shown of the resultant 30x30 speckle photon map.

FIG. 3 .
FIG.3.Time to make predictions on 2000 XPFS frames for the CNN vs. the GGG algorithm as a function of k.Error bars are obtained from the standard error in the slope of the linear regression fit.The CNN exhibits constant scaling with k while the GGG scaling is observed to be linear.

FIG. 4 .
FIG. 4. (a)A larger input detector image with 270x270 pixels and (b) corresponding predicted photon map (90x90 pixels).The CNN is able to make predictions on larger inputs than it was trained on.

FIG. 5 .
FIG. 5. Predicted Contrast versus the True Contrast for CNN and GGG algorithms for (a) k ∈ [0.025, 0.2] and (b) k ∈ [0.025, 2.0].Each datapoint corresponds to prediction on a dataset of 2000 datapoints and subsequent maximum likelihood estimation for the contrast parameter.Notably, the CNN model exhibits much smaller bias than the GGG algorithm at low contrasts and high k.(c) Average accuracy as a function of contrast level for three separate k ranges.Error bars are obtained from the standard error in the slope of the linear regression fit.(d) Confusion matrices for CNN predictions on testing data with k ∈ [0.025-2.0].The CNN model makes a higher proportion of errors on high photon count events and asymmetrically under-predicts high photon events.

[ 0 .FIG. 6 .
FIG.6.Contrast-contrast parity correlation for datasets generated using different k ranges.Note, the ground truth photon maps do not have perfect correlation due to finite sampling statistics.

FIG. 7 .
FIG.7.a) Contrast-contrast parity plot for data in the k range of [0.025, 2.0] using a vertical neural network ensemble.The contrast point prediction is obtained from the median contrast prediction and the 95% prediction intervals follows from assumed t-distribution statistics.
FIG. 8. a) CNN ensemble used to visualize the (a) median prediction photon map and (b) the standard deviation prediction in the photon map.This analysis allows users to visualize the region of the image where the photon assignment may be challenging.

TABLE I
35Simulation parameters used to generate detector images based on previous XPFS data collected by Seaberg et al. at the iron L-edge for magnetic scattering35.

TABLE II .
Speed Comparison Between the CNN and the GGG Algorithm.Rates are reported for a prediction on 1000 XPFS shots.Using a CNN deployed on GPU hardware yields a speedup of two orders of magnitude relative to a multi-CPU Droplet Implementation.