Dark matter halos are the building blocks of the cosmic large-scale structures and the bridges between the dark and luminous sector of the Universe. They are diverse in internal structure, mass assembly history and interactions with environment. The figure at the top illustrates various quantities that are commonly adopted in literature to describe the structural properties of a typical dark matter halo, as well as its formation history and environment. A crucial step toward a full picture of dark matter halo formation is to understand the intrinsic relationship between the structure and the assembly history and environment of dark halos.
Traditionally, halo properties are extracted by parameterized fitting, and different halo properties are linked by simple statistics like histograms and low-order moments. Although many discoveries have been made, these methods give only limited interpretation and predictive power due to the following reasons: (1) halo’s formation history is often complicated, which may not be captured by a small number of parameters, (2) multiple parameters can be defined to describe the different aspects of a halo, but they often degenerate among themselves, (3) bias-variance trade-off is hard to be made in simple statistics, which is either too simple to capture the highly non-linear relation in high-dimension space, or is too complex to avoid over-fitting.
Recently, Mr. Yangyao Chen, Prof. Cheng Li  from Department of Astronomy at Tsinghua University, in collaboration with the ELUCID  team, investigated the relation of halo properties by a classic linear dimension reduction method, called Principal Component Analysis (PCA) , and a modern ensemble method for decision trees, called Random Forest (RF) . By PCA, manually choice of parameterization form of halo assembly history can be avoided. Figure 1 shows the reconstruction of halo assembly histories from the first 3 principal components. The assembly history can be well captured by using a small set of parameters, and, in principle, more subtle substructures in the histories can also be recovered with more principal components. By using the Random Forest method, the non-linear pattern in the feature space is resolved, the contribution from predictors is obtained, and the degeneracy between parameters is broken. Figure 2 shows contours in the (halo concentration, assembly) plane. It is hard by eye to tell which assembly quantity is stronger correlated with concentration. By using Random Forest, the contribution from each predictor is much clear. These methods can help simplify the characterization of halo population and clarify the degeneracy of halo properties, and will be employed in exploring the galaxy-halo connection in future works.
Link to relevant publications:
 Yangyao Chen, et al., 2020, ApJ submitted, https://arxiv.org/abs/2003.05137
 Huiyuan Wang, et al. 2016, ApJ, https://arxiv.org/abs/1608.01763
 Karl Pearson, 1901, The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, http://doi.org/10.1080/14786440109462720
 Harold Hotelling, 1933, Journal of Educational Psychology, https://doi.org/10.1037/h0071325
 Leo Breiman, 2001, Machine Learning, http://doi.org/10.1023/A:1010933404324