Machine learning has rapidly become a tool of choice for the astronomical community. It is being applied across a wide range of wavelengths and problems, from the classification of transients to neural network emulators of cosmological simulations, and is shifting paradigms about how we generate and report scientific results. At the same time, this class of method comes with its own set of best practices, challenges, and drawbacks, which, at present, are often reported on incompletely in the astrophysical literature. With this paper, we aim to provide a primer to the astronomical community, including authors, reviewers, and editors, on how to implement machine learning models and report their results in a way that ensures the accuracy of the results, reproducibility of the findings, and usefulness of the method.
Up to 150000 asteroids will be visible in the images of the ESA Euclid space telescope, and the instruments of Euclid offer multiband visual to near-infrared photometry and slitless spectra of these objects. Most asteroids will appear as streaks in the images. Due to the large number of images and asteroids, automated detection methods are needed. A non-machine-learning approach based on the StreakDet software was previously tested, but the results were not optimal for short and/or faint streaks. We set out to improve the capability to detect asteroid streaks in Euclid images by using deep learning. We built, trained, and tested a three-step machine-learning pipeline with simulated Euclid images. First, a convolutional neural network (CNN) detected streaks and their coordinates in full images, aiming to maximize the completeness (recall) of detections. Then, a recurrent neural network (RNN) merged snippets of long streaks detected in several parts by the CNN. Lastly, gradient-boosted trees (XGBoost) linked detected streaks between different Euclid exposures to reduce the number of false positives and improve the purity (precision) of the sample. The deep-learning pipeline surpasses the completeness and reaches a similar level of purity of a non-machine-learning pipeline based on the StreakDet software. Additionally, the deep-learning pipeline can detect asteroids 0.25-0.5 magnitudes fainter than StreakDet. The deep-learning pipeline could result in a 50% increase in the number of detected asteroids compared to the StreakDet software. There is still scope for further refinement, particularly in improving the accuracy of streak coordinates and enhancing the completeness of the final stage of the pipeline, which involves linking detections across multiple exposures.
We present a catalog of quasars selected from broad-band photometric ugri data of the Kilo-Degree Survey Data Release 3 (KiDS DR3). The QSOs are identified by the Random Forest supervised machine learning model, trained on SDSS DR14 spectroscopic data. We first clean the input KiDS data from entries with excessively noisy, missing or otherwise problematic measurements. Applying a feature importance analysis, we then tune the algorithm and identify in the KiDS multiband catalog the 17 most useful features for the classification, namely magnitudes, colors, magnitude ratios, and the stellarity index. We use the t-SNE algorithm to map the multi-dimensional photometric data onto 2D planes and compare the coverage of the training and inference sets. We limit the inference set to r<22 to avoid extrapolation beyond the feature space covered by training, as the SDSS spectroscopic sample is considerably shallower than KiDS. This gives 3.4 million objects in the final inference sample, from which the Random Forest identified 190,000 quasar candidates. Accuracy of 97%, purity of 91%, and completeness of 87%, as derived from a test set extracted from SDSS and not used in the training, are confirmed by comparison with external spectroscopic and photometric QSO catalogs overlapping with the KiDS footprint. The robustness of our results is strengthened by number counts of the quasar candidates in the r band, as well as by their mid-infrared colors available from WISE. An analysis of parallaxes and proper motions of our QSO candidates found also in Gaia DR2 suggests that a probability cut of p(QSO)>0.8 is optimal for purity, whereas p(QSO)>0.7 is preferable for better completeness. Our study presents the first comprehensive quasar selection from deep high-quality KiDS data and will serve as the basis for versatile studies of the QSO population detected by this survey.
Modern scientific data mainly consist of huge datasets gathered by a very large number of techniques and stored in very diversified and often incompatible data repositories. More in general, in the e-science environment, it is considered as a critical and urgent requirement to integrate services across distributed, heterogeneous, dynamic "virtual organizations" formed by different resources within a single enterprise. In the last decade, Astronomy has become an immensely data rich field due to the evolution of detectors (plates to digital to mosaics), telescopes and space instruments. The Virtual Observatory approach consists into the federation under common standards of all astronomical archives available worldwide, as well as data analysis, data mining and data exploration applications. The main drive behind such effort being that once the infrastructure will be completed, it will allow a new type of multi-wavelength, multi-epoch science which can only be barely imagined. Data Mining, or Knowledge Discovery in Databases, while being the main methodology to extract the scientific information contained in such MDS (Massive Data Sets), poses crucial problems since it has to orchestrate complex problems posed by transparent access to different computing environments, scalability of algorithms, reusability of resources, etc. In the present paper we summarize the present status of the MDS in the Virtual Observatory and what is currently done and planned to bring advanced Data Mining methodologies in the case of the DAME (DAta Mining & Exploration) project.