Fellows Research Meetings

Cloning Models - Something for free?

Stephen Haslett
The Australian National University, Canberra

Cloning in genetics means creation of an exact genetic copy of an organism. In a similar sense, it is possible to generate new datasets that have exactly the same estimated linear model parameters as the original dataset via what are “equivalent” or cloned models. When collecting data is expensive, small datasets are common. Examples include DNA data at particular sites or for genes, quality control, ecology studies, agricultural field trials, and animal-based studies (where there are also ethical issues). In other situations, government statistics agencies cannot release original data for confidentiality reasons, or require data encryption.

Although simulation is used routinely to test statistical models (McCullagh, 2002), or to provide alternative datasets which are confidential, such simulated datasets contain both potential model mis-specification and modelling errors. Simulated and original data do not have the same fitted model. Nevertheless, such approximate methods are used routinely in statistics (e.g. in simulation studies) to supplement and test models, and to provide alternative datasets that can be publically released (e.g. CURFs - Confidentialised Unit Record Files - see UNECE, 2014). In contrast, datasets generated using cloned models have zero modelling error. Cloning in its various forms thus offers potential improvement to standard bootstrap and jackknife methods, which generate simulated data that inevitably contain model error. Bootstrapping and jackknifing have created enormous interest and an extensive literature since publication of Efron and Tibishani (1993). Cloning may possibly provide better methods where model error is not negligible, for example for testing of saturated models in which there are as many model parameters as observations. Cloning can also be used to remove random variation via smoothing to elucidate underlying phenomena, to better visualise an underlying fitted model, and to detect model aberrations.

Such supplementary data might be called cloned data, but the term already has multiple meanings (c.f. Haslett, & Govindaraju, 2012, with Lele et al, 2010 where cloning for maximum likelihood estimation using Bayesian software is achieved by the simple device of replicating the original data many times).

To date, model cloning has been studied only for certain types of statistical model. For example, for any p-dimensional dataset, via model cloning we can generate 2p-1 further datasets that have identical multiple linear regression parameter estimates and hence model fit. See Haslett and Govindaraju (2009, 2012), which utilise orthogonal subspaces within the model-design matrix. Despite its novelty, model cloning already has known applications in data confidentiality, encryption, smoothing, and data visualisation, and relatively unexplored potential to improve hypothesis testing. Even the current cloning methods for regression and general linear models form a wide class, including the types of experimental designs used routinely in agriculture and industry, and the mixed linear models used in genetics, epidemiology and small area estimation. Cloning can also be used for database encryption, even if there is no interest at all in the underlying regression model - see Haslett and Govindaraju (2012).

Cloning for linear models can be achieved in a number of ways. See for example Haslett and Govindaraju (2012) and Haslett and Puntanen (2011). Non-full rank methods (via generalized inverses and matrix column spaces), as discussed in Haslett et al (2013), Haslett and Puntanen (2010 (a-c), 2011), Haslett, Puntanen and Arendacká (2015), and in the appendix to Haslett and Haslett (2007), are extensions of research by Rao (1967, 1968, 1971, 1973), Zyskind (1967), and Mitra and Moore (1973), among others. Using subspace arguments, even for non-full rank models it is possible to use model cloning to construct different datasets where not only does a full fixed parameter linear model have identical estimates and estimated covariances, but so do all its submodels. By using a given error covariance structure, or with a relatively mild restriction on estimation of error covariance matrices, these results can be extended to linear mixed models.

Cloning via residuals, mentioned in the initial sections of Haslett and Govindaraju (2012), raises one of several further possibilities for extending model cloning methods beyond linear models. However, model cloning can already provide a straightforward but secure method of data encryption and for linear models has potential to underpin better practical methods of dealing with the all too common situation in which there is too little data, or the original data cannot be publically released.

Last reviewed: 7 September, 2017