--- title: "Parameter estimation and distribution selection by ExtDist" author: "Haizhen Wu, A. Jonathan R. Godfrey and Sarah Pirikahu" affiliation: "Massey University" date: "`r Sys.Date()`" output: rmarkdown::pdf_document: includes: in_header: header.tex bibliography: bibliography.bib vignette: > %\VignetteIndexEntry{Parameter-Estimation-and-Distribution-Selection-by-ExtDist} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ## Introduction Parameter estimation and distribution selections are common tasks in statistical analysis. In the context of variables acceptance sampling [see @Wu.Govindaraju.2014 etc.], when the underlying distribution model of the quality characteristic is determined, the estimated quality of a product batch measured by the proportion nonconforming, is computed through the estimated parameter(s) of the underlying distribution based on a sample. Conversely, if a collection of candidate distributions are considered to be eligible, then selection of a distribution that best describes the data becomes necessary. The \CRANpkg{ExtDist} package is devoted to provide a consistent and unified framework for these tasks. ```{r, message=FALSE} library(ExtDist) ``` ## Parameter Estimation Suppose we have a set of data, which has been randomly generated from a Weibull distributed population, ```{r Data, comment=""} set.seed(1234) head(X <- rWeibull(50, shape = 2, scale = 3)) ``` It is possible to write code to obtain parameters via a maximum likelihood estimation procedure for the data. However, it is more convenient to use a single function to achieve this task. ```{r Weibull est, comment=""} est.par <- eWeibull(X) ``` The $e-$ prefix we introduced in \CRANpkg{ExtDist} is a logical extension to the $d-$, $p-$, $q-$, $r-$ prefixes of the distribution-related functions in R base package. Moreover, the output of $e-$ functions is defined as a S3 class object. ```{r class, comment=""} class(est.par) ``` The estimated parameters stored in the "eDist" object can be easily plugged into other $d-$, $p-$, $q-$, $r-$ functions in \CRANpkg{ExtDist} to get the density, percentile, quantile and random variates for a distribution with estimated parameters. ```{r dpqr egs, results='hide'} dWeibull(seq(0,2,0.4), params = est.par) pWeibull(seq(0,2,0.4), params = est.par) qWeibull(seq(0,1,0.2), params = est.par) rWeibull(10, params = est.par) ``` To remain compatible with current convention, these functions also accept the parameters as individual arguments, so the following code variations are also permissible. ```{r, results='hide'} dWeibull(seq(0,2,0.4), shape = est.par$shape, scale = est.par$scale) pWeibull(seq(0,2,0.4), shape = est.par$shape, scale = est.par$scale) qWeibull(seq(0,1,0.2), shape = est.par$shape, scale = est.par$scale) rWeibull(10, shape = est.par$shape, scale = est.par$scale) ``` ## Distribution selection As a S3 class object, several S3 methods have been developed in \CRANpkg{ExtDist} to extract the distribution selection criteria and other relevant information. ```{r selection criterion, comment=""} logLik(est.par) # log likelihood AIC(est.par) # Akaike information criterion AICc(est.par) # corrected Akaike information criterion BIC(est.par) # Bayes' Information Criterion. MDL(est.par) # minimum description length vcov(est.par) # variance-covariance matrix of the parameters of the fitted distribution ``` Based on these criteria, for any sample, the best fitting distribution can be obtained from a list of candidate distributions. For example: ```{r Example, comment=""} Ozone <- airquality$Ozone Ozone <- Ozone[!is.na(Ozone)] # Removing the NA's from Ozone data summary(Ozone) best <- bestDist(Ozone, candDist=c("Gamma", "Weibull", "Normal", "Exp"), criterion = "logLik");best ``` When more than one set of results are of interest for a list of candidate distribution fits, a summary table can be output by using the function DistSelCriteria. ```{r DistSelCriteria, comment=""} DistSelCriteria(Ozone, candDist = c("Gamma", "Weibull", "Normal", "Exp"), criteria = c("logLik","AIC","AICc", "BIC")) ``` Multiple distributions can also be compared visually using the compareDist function. ```{r CompareDist, comment=""} compareDist(Ozone, attributes(best)$best.dist.par, eNormal(Ozone)) ``` If the best fit distribution has been determined plot.eDist can provide the corresponding histogram with fitted density curve, Q-Q and P-P plot for this distribution. ```{r ploteDist, error=FALSE} plot(attributes(best)$best.dist.par) ``` ## Weighted sample Another notable feature of the \CRANpkg{ExtDist} package is that it can deal with weighted samples. Weighted samples appear in many contexts, e.g.: in non-parametric and semi-parametric deconvolution [see e.g. @Hazelton.Turlach.2010 etc.] the deconvoluted distribution can be represented as a pair $(Y,w)$ where $w$ is a vector of weights with same length as $Y$. In size-biased (unequal probability) sampling, the true population is more appropriately described by the weighted (with reciprocal of the inclusion probability as weights) observations rather than the observations themselves. In Bayesian inference, the posterior distribution can be regarded as a weighted version of the prior distribution. Weighted distributions can also play an interesting role in stochastic population dynamics. In \CRANpkg{ExtDist}, the parameter estimation was conducted by maximum weighted likelihood estimation, with the estimate $\hat{\boldsymbol{\theta}}$ of $\boldsymbol{\theta}$ being defined by \begin{align}\label{eq:1} \hat{\boldsymbol{\theta}}^{w}=\arg\max_{\boldsymbol{\theta}}\sum_{i=1}^n w_i\ln f(Y_i;\boldsymbol{\theta}), \end{align} where $f$ is the density function of the ditstribution to be fitted. For example, for a weighted sample with ```{r Chunk10} Y <- c(0.1703, 0.4307, 0.6085, 0.0503, 0.4625, 0.479, 0.2695, 0.2744, 0.2713, 0.2177, 0.2865, 0.2009, 0.2359, 0.3877, 0.5799, 0.3537, 0.2805, 0.2144, 0.2261, 0.4016) w <- c(0.85, 1.11, 0.88, 1.34, 1.01, 0.96, 0.86, 1.34, 0.87, 1.34, 0.84, 0.84, 0.83, 1.09, 0.95, 0.77, 0.96, 1.24, 0.78, 1.12) ``` the parameter estimation and distribution selection for weighted samples can be achieved by including the extra argument $w$: ```{r Chunk11, comment=""} eBeta(Y,w) bestDist(Y, w, candDist = c("Beta_ab","Laplace","Normal"), criterion = "AIC") DistSelCriteria(Y, w, candDist = c("Beta_ab","Laplace","Normal"), criteria = c("logLik","AIC","AICc", "BIC")) ``` ## References