Background

Introduction

In high-throughput studies one often uses databases such as the Core-COF [TLYZ17] or Core-MOF [NCC+17] database. Over the course of our work, we noticed that these databases contain a non-negligible number of duplicates, hence we wrote this tool to easily (and more or less efficiently) find and eliminate then.

Starting doing machine learning with these databases, we also noticed that we need tools for comparing them: We wanted to quantify ‘diversity’ and ‘distance’ of and between databases. Especially the DistExampleComparison class was written due to the fact that ML models are often good at interpolation but bad at extrapolation [MAC+18] – hence we needed tools to detect this.

For expensive simulations – or more efficient ML training – one also wants to have tools for clever sampling. This is what the sampling module tries to do.

Duplicate removal

The approach in the main duplicate removal routines is the following:

  1. get the the Niggli reduced cell to have the smallest possible, well defined structure

  2. get some cheap scalar features that describe the composition and use them to filter out structures that probably have the same composition.

  3. after we identified the structures with identical composition (which are usually not too many) we can run a more expensive structural comparisons using structure graphs or the Kabsch algorithm As the Kabsch algorithm is relatively inexpensive (but it needs a threshold) we also have an option, where one can use the Kabsch algorithm and then do a comparison based on the structure graphs.

Kabsch algorithm

The Kabsch algorithm [Kab76][Kab78] attempts to solve the problem of calculating the RMSD between to structures, which is in general not well defined as it depends on the relative orientations of the structures w.r.t each other. The algorithm calculates the optimal rotation matrix that mininmizes the RMSD between the structures and it is often used in visualizations to e.g. align structures. The basic idea behind the algorithm is the following:

  1. First we center the structures at the origin

  2. Then, we calculate the covariance matrix

    \[\mathbf{H}_ij = \sum_{k}^N \mathbf{P}_{ki} \mathbf{Q}_{kj}\]

    where \(P_{ki}\) and \(Q_{kj}\) are the point-clouds (position coordinates) of the two structures.

  3. The optimal rotation matrix is then given by

    \[\mathbf{R} = \left(\mathbf{H}^\mathsf{T}\mathbf{H} \right)^{\frac{1}{2}} \mathbf{H}^{-1}\]

    which can be implemented using a SVD decomposition.

The implementation in this package is based on the rmsd package from Jimmy Charnley Kromann [Kro19], we just added routines for periodic cases.

Graph based

Hashing

Statistics

The statistics module (comparators.py) implements three different classes that allow to

  • measure the structural diversity of one database (DistStatistic)

  • compare two databases (DistComparison)

  • compare one sample with a database (DistExample)

The DistStatistic class implements parallelized versions of random (with resampling) RMSD and structure graph comparisons within a database whereas the DistComparison class also implements those but also several statistical tests like:

  • maximum mean discrepancy

  • Anderson-Darling

  • Kolmogorov-Smirnov

  • Mutual information

that work on a list of list or dataframe of features.

The main Statistics class also implements further statistical metrics, such as measures of central tendency like the trimean which are not that commonly used (unfortunately).

The DistExample class clusters the database (e.g. based on same property space) and then compares the sample to the \(k\) samples closest to the centroids.

maximum mean discrepancy (MMD)

MMD basically uses the kernel trick.

Warning

There are better implementations for MMD and especially the choice of the kernel width. In a future release, we might introduce shogon as optional dependency and use it if installed.

Sampling

For all sampling, we standardize the features by default to avoid overly large effects by e.g. different units [TFT17]. In case you want to use different weights one different features you can multiply manually the columns of the dataframe with weight factors and then turn the standardization off.

Farthest point sampling

The greedy farthest point sampling (FPS) [PeyrePechaudKC10] tries to find a good sampling of the point set \(S\) by selecting points according to

\[x_{k+1} = \text{argmax}_{x \in S} \min_{0\le i \le k} d(x_i, x)\]

where \(d(x_i, x)\) is an appropriate distance metric, which in our case is by default Euclidean. We initialize \(x_0\) by choosing a random point from \(S\).

KNN based

The \(k\)-nearest neighbor based sample selection clusters the \(S\) into \(k\) cluster and then selects the examples closest to the centroids. This is based on the rational that \(k\) (hence we sample from different clusters as we want a diverse set).

Cleaning

A problem when attempting high-throughput studies with experimental structures, e.g from the Cambridge Structural Database, is that structures [SHK+19]

  • contain unbound water

  • are disordered (e.g. methyl groups in two positions, aromatic carbon exist in several configurations in the .cif file

  • contain a lot of information that is not necessarily useful for the simulation and can cause problems when using the structure as an input file for simulation packages. Also, dropping unnecessary information can significantly reduce the filesize.

There has already been work done on this topic: The authors of the Core-MOF database described their approach in the accompanying paper [CCH+14] and the group around David Fairen-Jimenez published small scripts that use Mercury and a pre-defined list of solvents to remove unbound solvents [MLW+17].

Unfortunately, to our knowledge, there exist no open-source tools at try to address all of the three issues mentioned above.

Warning

We are well aware of the problems of automatic structure sanitation tools [ZPM19]. and also advise to use them with care and to report issues such that we can improve the tools.

Rewriting the cif files

For the first stage of rewriting the .cif files, we use the PyCifRW package [Hes06] which is the most robust .cif parser we are aware of. We keep only the lattice constants and the most important loops (fractional coordinates, type and labels as well as the symmetry operations) whilst also using the atomic types as label as this is imperative for some simulation packages.

Furthermore, we remove all uncertainty indications and sanitize the filename (e.g. remove non-unicode and unsafe characters such as parentheses).

Optionally, we also remove all disorder groups other than . and 1. This assumes that the disorder groups were properly labelled by the crystallographer.

Removing unbound solvent

For removal of unbound solvent, we construct the structure graph and query for the molecular subgraphs (pymatgen internally constructs a supercell to distinguish moleules from e.g. 2D layers). If the composition of one of the molecular subgraphs is in our list of solvent molecules we delete the molecule from the structure.

Removing disorder

Warning

Please note that this module is experimental and does not work in all cases.

Checking

The Checker class allows you to run with the run_flagging function to run several test on a set of structures:

  • It checks if there are clashing atoms (which might be due to disorder)

  • It checks if there a hydrogens in the structure, in three different strictness levels: * check if there are any hydrogens at all * check if there are carbons and any hydrogens at all * check if there are carbons with less or equal two non-hydrogen neighbors if they contain any hydrogens (i.e.

    something like aromatic carbons that contain no H)

  • Check if there are unbound (solvent) molecules/atoms in the structure, in two versions: * threshold-based: is there any atom that has no neighbor closer than the threshold? * graph based: are there unbound molecules in the structure graph (basically a more fancy nearest-neighbor thing)

  • Check if pymatgen can read the structure