Progress on Statistical Issues in Searches Banner


SLAC Logo

KIPAC Logo

Graphic - Logotype - PULSE logo

NASA Logo

Abstracts for Contributed Talks and Posters

Total number of Abstracts:  15

Presenter Name Email Title Brief Abstract
Chattopadhyay, Ishanu ishanu.chattopadhyay@cornell.edu Semantic Circuits We introduce the theory of semantic circuits, aimed at efficient algorithmic manipulation of statistical information contained in observed time-series. Key notions from formal languages, information theory, infinite Abelian groups, and stochastic processes are brought together to formulate a rich mathematical theory of semantic information manipulation. The developed information processing algorithms are realized as logical machines, composed of complex interconnections between copies of a few basic or atomic sub-units; thus reflecting a strong analogy with circuit theory. A few basic elements is discussed, such as the semantic copy, inversion, and annihilation. This opens up a new approach to model-free disambiguation and classification of complex stochastic processes, to be carried out without a priori knowledge of the causal process, and without even attempting to infer the underlying model directly. In particular, we answer the following statistical questions: 1) Given two finitely observed traces, is it possible to quantify and estimate, the similarity, or the lack thereof, between the underlying hidden stochastic processes? 2) Given a set of finite traces, is it possible to cluster them based on their underlying causal generators? Our ideas are predicated on the ability of probabilistic automata over finite alphabets to model a wide class of quantized stochastic processes. We show that a large class of statistical processes, these questions can be answered simply via semantic manipulation of quantized versions of the observation sets; and we never have to infer the underlying model, we have to only assume that one exists. Our approach is in sync with the emerging school of thought championing model-free inference and decision-making. Ability to disambiguate causally significant changes in the underlying generator from noise artifacts, and the normal randomness of stochastic dynamics, will allow model-free anomaly detection. In the context of analyzing data from astronomical observations, such model-free causal clustering can lead to the potential discovery of new kinds of astronomical objects, or “variables”, via identification of incipient anomalies in the observed light curves. Additionally, the principles established here directly address key issues in digital communication and coding theory, and opens up new research avenues aimed at developing semantic codes with potentially high resilience to data corruption.
Cowan, Glen g.cowan@rhul.ac.uk Two developments in tests for discovery: use of weighted Monte Carlo events and an improved measure In particle physics a search for a new signal process often takes the form of a frequentist statistical test based on observed numbers of collision events. In general the test makes reference to the numbers of events expected under different hypotheses, e.g., that they were produced by certain background processes, and in some cases these numbers are estimated using weighted Monte Carlo events. Methods for incorporating such events into a statistical tests are examined and the properties of several approaches are investigated with numerical examples. In addition, an approximate expression is derived for the expected discovery significance for a Poisson counting experiment in which the background rate is constrained by a Poisson control measurement. The validity of the new expression is compared with Monte Carlo rsults and to other formulae for expected significance often used in particle physics.
Fung, Russell russellfung@mac.com Towards Extracting Dynamics below Timing Jitter by Manifold-based Machine Learning Towards Extracting Dynamics below Timing Jitter by Manifold-based Machine Learning R. Fung [1], J.M. Glownia [2], J. Cryan [2], A. Natan [2], P.H. Bucksbaum [2], and A. Ourmazd [1] [1] University of Wisconsin, 1900 E. Kenwood Blvd, Milwaukee, WI 53211, USA [2] PULSE, SLAC National Laboratory, 2575 Sand Hill Rd, Menlo Park, CA 94025, USA We report an on-going effort to extract dynamics in the presence of substantial pump-probe timing jitter due to the SASE nature of unseeded X-ray Free Electron Lasers. Our preliminary results indicate that it may be possible to reconstruct processes with periods as short as 3fs from data obtained with ~300fs jitter. The data, collected at the LCLS, consist of noisy time-of-flight spectra of the fragments of photodissociated N2 molecules in the presence of a nonionizing, strong-field dressing optical laser. Manifold-embedding techniques are first used to determine the intrinsic manifold spanned by the data and thus the fundamental degrees of freedom open to the system. Nonlinear singular-value decomposition on the intrinsic manifold is then employed to determine the specific dynamics followed by the interaction of the molecules with the laser fields on the fs timescale. Our approach may open the way to extracting dynamics from large, noisy datasets at time-resolutions up to two orders of magnitude below pump-probe timing jitter.
Kashyap, Vinay vkashyap@cfa.harvard.edu Challenges to Source Detection in High-Energy Astronomical Images High-energy astronomy data are usually comprised of lists of photons that are registered in a detector. Each photon can be located in a 4-dimensional space of (sky position, energy, arrival time). This imposes several challenges on the process of detecting X-ray emitters in the data. I will describe the history of source detection in X-ray images, and point out various instrumental and astronomical effects that play a prominent role in limiting the efficiency of the process. With the accumulation of high-quality imaging data from Chandra and XMM observatories, problems previously ignored have risen to the forefront. I will illustrate these issues (multiple mosaic observations, upper limits, overlapping and extended sources) with examples and point out tactics that may be brought to bear to analyze such cases.
Lipson, Hod hod.lipson@cornell.edu Machine Science: Distilling Natural Laws from Exptl Data, from nuclear physics to biology Can machines discover scientific laws automatically? For centuries, scientists have attempted to identify and document analytical laws that underlie physical phenomena in nature. Despite the prevalence of computing power, the process of finding natural laws and their corresponding equations has resisted automation. This talk will outline a series of recent research projects, starting with self-reflecting robotic systems, and ending with machines that can formulate hypotheses, design experiments, and interpret the results, to discover new scientific laws. While the computer can discover new laws, will we still understand them? Our ability to have insight into science may not keep pace with the rate and complexity of automatically-generated discoveries. Are we entering a post-singularity scientific age, where computers not only discover new science, but now also need to find ways to explain it in a way that humans can understand? We will see examples from psychology to cosmology, from classical physics to modern physics, from big science to small science.
Loh, Duane duaneloh@slac.stanford.edu Effects of extraneous noise in Cryptotomography (reconstructions from random, unoriented tomograms) X-ray pulses produced by free-electron lasers can be focussed to produce high-resolution diffraction signal from single nanoparticles before the onset of considerable radiation damage [1]. These 2D diffraction patterns are inherently noisy and have no direct means of signal-averaging because the particles themselves are currently injected at random, unknown 3D orientations into the particle-radiation interaction region. Simulations have successfully recovered 3D reconstructions from such remarkably noisy and fully unoriented 2D diffraction data [2]. However, actual experimental data [3] show that extraneous noise (either from background scattering or detector noise) can limit the resolution of the reconstruction or even jeopardize reconstruction attempts. We study the second and more severe of these two effects through a simplified version of this reconstruction problem. A straightforward consideration of conditional probabilities [2, 4] can help define when the extraneous noise overwhelms reconstruction attempts. Nevertheless, an ensemble of data with considerable numbers of bright fluctuations may still reconstruct successfully. Incidentally, we also extend a specialized reconstruction algorithm [2, 4] to recover distinct sub-species within an ensemble of illuminated samples. We expect our simplified simulations to provide insights that would have taken considerably longer to develop when restricted to the full 3D reconstruction problem. [1] R. Neutze, et al. Nature 406, 752-757 (2000). [2] N.D. Loh, V. Elser. Phys. Rev. E 80, 026705 (2009). [3] N.D. Loh, et al. Phys. Rev. Lett. 104, 225501 (2010). [4] V. Elser. IEEE Trans. on Info. theory 55(10):4715-4722 (2009).
Lopez-Caniego, Marcos caniego@ifca.unican.es Biparametric adaptive filter: detection of compact sources in complex microwave backgrounds We consider the detection of compact sources in maps of the Cosmic Microwave Background following the philosophy behind the Mexican hat wavelet family (MHWn) of linear filters. We present a new analytical filter, the biparametric adaptive filter (BAF), that is able to adapt itself to the statistical properties of the background as well as to the profile of the compact sources, maximizing the amplification and improving the detection process. We have tested the performance of this filter using realistic simulations of the microwave sky between 30 and 857 GHz as observed by the Planck satellite, where complex backgrounds can be found. We demonstrate that doing a local analysis on flat patches allows one to find a combination of the optimal scale of the filter R and the index of the filter g that will produce a global maximum in the amplification, enhancing the signal-to-noise ratio (SNR) of the detected sources in the filtered map and improving the total number of detections above a threshold. The improvement of the new filter in terms of SNR is particularly important in the vicinity of the Galactic plane and in the presence of strong Galactic emission.
Loredo, Thomas loredo@astro.cornell.edu Search and discovery in a population context Astronomers undertake surveys of the sky, not merely to discover or monitor individual objects considered in isolation, but also to understand cosmic populations as aggregates. For example, galaxy surveys may seek to measure the luminosity function of different galaxy types---the distribution of luminosities, as a function of distance or redshift---which reflects the physics of galaxy formation and evolution. Stellar surveys may seek to understand the stellar luminosity or initial mass function, to probe star formation and its dependence on cosmic environment. Statistically, object discovery and population inference are coupled tasks; an optimal analysis cannot treat them separately. I will discuss statistical approaches to object discovery that attempt to integrate the object and population levels of statistical inference. My emphasis will be on the Bayesian multilevel modeling (hierarchical Bayes) viewpoint, but I will also touch on empirical Bayes and (frequentist) multiple testing viewpoints.
Lubow, Steve lubow@stsci.edu Hubble Source Catalog We (Tamas Budavari and I) have created an initial catalog of objects observed by the WFPC2 and ACS instruments on the Hubble Space Telescope (HST). The catalog is based on observations taken on more than 6000 visits (telescope pointings) of ACS/WFC and more than 25000 visits of WFPC2. The catalog is obtained by cross matching by position in the sky all Hubble Legacy Archive (HLA) Source Extractor source lists for these instruments. The source lists describe properties of source detections within a visit. The calculations are performed on a SQL Server database system. First we collect overlapping images into groups, e.g., Eta Car, and determine nearby (approximately matching) pairs of sources from different images within each group. We then apply a novel algorithm for improving the cross matching of pairs of sources by adjusting image orientations. Next, we combine pairwise matches into maximal sets of possible multi-source matches. We apply a greedy Bayesian method to split the maximal matches into more reliable matches. We test the accuracy of the matches by comparing the fluxes of the matched sources. The result is a set of information that ties together multiple observations of the same object. A biproduct of the catalog is greatly improved relative astrometry of many of the HST images. We also provide information on nondetections that can be used to determine dropouts. With the catalog, for the first time, one can carry out time domain, multi-wavelength studies across a large set of HST data. The catalog will be made publicly available. Much more can be done to expand the catalog capabilities.
Lyons, Louis l.lyons@physics.ox.ac.uk p1 versus p0 plots We often want to compare experimental data with 2 competing hypotheses H0 and H1 (e.g. Standard Model of Particle Physics, and exciting New Physics, respectively). A data statistic t is chosen, and p-values p0 and p1 are defined as the probabilities under H0 and H1 respectively of obtaining a value of t as extreme as the observed one, or more so. Plots of p1 versus p0 are very useful for understanding many topics that arise in trying to choose between H0 and H1.
Morgan, Adam amorgan@astro.berkeley.edu Rapid, Machine-Learned Resource Allocation: Application to High-redshift GRB Follow-up As the number of observed Gamma-Ray Bursts (GRBs) continues to grow, follow-up resources need to be used more efficiently in order to maximize science output from limited telescope time. As such, it is becoming increasingly important to rapidly identify bursts of interest as soon as possible after the event, before the afterglows fade beyond detectability. Studying the most distant (highest redshift) events, for instance, remains a primary goal for many in the field. Here we present our Random forest Automated Triage Estimator for GRB redshifts (RATE GRB-z) for rapid identification of high-redshift candidates using early-time metrics from the three telescopes onboard Swift. While the basic RATE methodology is generalizable to a number of resource allocation problems, here we demonstrate its utility for telescope-constrained follow-up efforts with the primary goal to identify and study high-z GRBs.
Rolke, Wolfgang wolfgang.rolke@upr.edu Estimating a Signal In the Presence of an Unknown Background We describe a method for fitting distributions to data which only requires knowledge of the parametric form of either the signal or the background but not both. The unknown distribution is fit using a non-parametric kernel density estimator. The method returns parameter estimates as well as errors on those estimates. Simulation studies show that these estimates are unbiased and that the errors are correct.
Snyder, Arthur snyder@slac.stanford.edu Trials "factor" in "bump" hunts Trials "factor" in "bump" hunts This talks presents a method for estimating the trials factor for a classic "bump" hunt in which a distribution is scanned for an enhancements by fitting for a narrow signal at a set of scan points. When scan points are well separated each point can be consider a trial for which the chance of an apparently significant fluctuation must be counted. In practice we want to scan in small steps, to avoid missing any "bumps", and the observations at different points will be correlated. This correlation must be taken into account when estimating the effective number of trials. The basic procedure is to treat the scan point as a vectors space in which the correlation between points is interpreted as the inner/dot product. A simple diagonalization can then produce a space in which the new coordinates are uncorrelated. A "toy" Monte Carlo done in the diagonalized space can be transformed back to the original scan points space and the number of times simulated scan points exceed the significance of interest counted. The method does not directly speed up the Monte Carlo generation of "toy" experiments, but eliminates the need to scan each experiment. In a complex fit eliminating scanning can be a substantial savings in the CPU resources needed. The heart of the method is the estimation of the correlations between scan points. In simple cases this can be done as sums based on the signal and background PDFs. The calculation is illustrated in the case of a recent claim of a narrow peak at ~130 GeV in Fermi Space Telescope data.
Tenenbaum, Peter peter.tenenbaum@nasa.gov Detection of Transiting Planet Candidates in Kepler Mission Data The Kepler Mission dataset includes photometric measurements made at 30 minute intervals of over 150,000 stars, yielding over 50,000 measurements per target star. The photometric time series are searched for periodic reductions in intensity which indicate a potential transiting planet, with a required sensitivity of approximately 4 standard deviations per transit in order to detect Earth-size planets orbiting Sun-like stars. In addition to sensitivity, the search must meet requirements on throughput, false positive rejection, and true positive retention in order to meet the mission goals. We briefly review the details of the Kepler Mission design which impact the search for transiting planet signatures, followed by a description of the principal challenges in areas of sensitivity, throughput, false-positive rejection, and true positive retention, and the current and future algorithmic features which allow the search algorithm to meet these requirements.
Trigo, Mariano mtrigo@slac.stanford.edu Singular value decomposition applied to femtosecond X-ray scattering from phonons in solids In time-resolved diffuse scattering experiments, a laser perturbs the material sample and an area detector records the scattering of a delayed femtosecond x-ray pulse. The typical LCLS data contain components from beam fluctuations, detector noise and diffuse scattering from laser-induced phonons in the solid. We present a method to analyze time-resolved diffuse x-ray scattering data from LCLS based on singular value decomposition (SVD). We will show that the SVD is effective in separating the diffuse scattering from phonons from other sources of noise. We will also discuss methods for filtering low frequency beam energy fluctuations that are uncorrelated with time-delay.