Planet R

July 02, 2015

Removed CRANberries

Package tdm (with last version 3.0.3) was removed from CRAN

Previous versions (as known to CRANberries) which should be available via the Archive link are:

2014-12-15 3.0.3
2014-05-14 3.0.2
2013-10-21 3.0.1
2013-09-23 3.0.0
2013-05-20 2.2.6
2013-03-04 2.2.5
2012-02-24 2.2.4
2012-02-22 2.2.3

July 02, 2015 11:13 PM

Package stab (with last version 0.1.9) was removed from CRAN

Previous versions (as known to CRANberries) which should be available via the Archive link are:

2014-12-11 0.1.9
2013-12-30 0.1.8
2013-11-28 0.1.7
2013-09-23 0.1.6
2013-05-07 0.1.5
2013-03-11 0.1.4
2010-11-22 0.1.3
2010-11-18 0.1.2
2010-04-29 0.0.8
2010-02-23 0.0.7
2009-11-24 0.0.6
2008-11-04 0.0.5
2008-10-29 0.0.4
2008-09-28 0.0.3
2008-05-20 0.0.2
2008-05-02 0.0.1

July 02, 2015 11:13 PM

Package ivivc (with last version 0.2.2) was removed from CRAN

Previous versions (as known to CRANberries) which should be available via the Archive link are:

2014-11-21 0.2.2
2014-09-23 0.2.1
2014-05-21 0.2.0
2013-10-25 0.1.9
2013-09-23 0.1.8
2013-05-07 0.1.7
2013-03-08 0.1.6
2009-10-12 0.1.5
2008-11-04 0.1.4
2008-10-30 0.1.3
2008-10-16 0.1.2
2008-04-10 0.1.1

July 02, 2015 11:13 PM

Package bear (with last version 2.6.4) was removed from CRAN

Previous versions (as known to CRANberries) which should be available via the Archive link are:

2014-12-09 2.6.4
2014-04-22 2.6.3
2014-02-24 2.6.2
2014-01-10 2.6.1
2013-12-04 2.6.0
2013-11-01 2.5.8
2013-09-23 2.5.7
2013-06-27 2.5.6
2013-05-07 2.5.5
2013-04-04 2.5.4
2010-06-29 2.5.3
2010-05-28 2.5.2
2010-05-22 2.5.1
2010-05-10 2.5.0
2010-05-01 2.4.4
2010-04-28 2.4.3
2010-03-23 2.4.2
2009-09-18 2.4.1
2009-07-21 2.4.0
2009-04-14 2.3.1
2009-03-09 2.3.0
2009-02-12 2.2.0
2008-12-18 2.1.0
2008-10-29 2.0.2
2008-10-14 2.0.1
2008-10-02 1.1.5
2008-09-11 1.1.4
2008-08-06 1.1.3
2008-07-28 1.1.2
2008-07-11 1.0.0

July 02, 2015 11:13 PM

July 01, 2015

Bioconductor Project Working Papers

Distance Correlation Measures Applied to Analyze Relation between Variables in Liver Cirrhosis Marker Data

Distance Correlation is another newer choice to compute the relation between variables. However, the Bayesian counterpart of Distance Correlation is not established. In this paper, Bayesian counterpart of Distance Correlation is pro- posed. Proposed method is illustrated with Liver Chirrhosis Marker data. The relevant studies information about relation between AST and ALT is used to formulate the prior information for Bayesian computation. The Distance Correlation between AST and ALT (both are liver performance marker) is computed with 0.44. The credible interval is observed with (0.41, 0.46).Bayesian counter- part to compute Distance correlation is simple and handy.

by Atanu Bhattacharjee Dr. at July 01, 2015 08:02 PM

June 29, 2015

Bioconductor Project Working Papers

June 21, 2015

Journal of Statistical Software

ergm.graphlets: A Package for ERG Modeling Based on Graphlet Statistics

Vol. 65, Issue 12, Jun 2015


Exponential-family random graph models are probabilistic network models that are parametrized by sufficient statistics based on structural (i.e., graph-theoretic) properties. The ergm package for the R statistical computing environment is a collection of tools for the analysis of network data within an exponential-family random graph model framework. Many different network properties can be employed as sufficient statistics for exponential- family random graph models by using the model terms defined in the ergm package; this functionality can be expanded by the creation of packages that code for additional network statistics. Here, our focus is on the addition of statistics based on graphlets. Graphlets are classes of small, connected, induced subgraphs that can be used to describe the topological structure of a network. We introduce an R package called ergm.graphlets that enables the use of graphlet properties of a network within the ergm package of R. The ergm.graphlets package provides a complete list of model terms that allows to incorporate statistics of any 2-, 3-, 4- and 5-node graphlets into exponential-family random graph models. The new model terms of the ergm.graphlets package enable both exponential-family random graph modeling of global structural properties and investigation of relationships between node attributes (i.e., covariates) and local topologies around nodes.

June 21, 2015 07:00 AM

DiceDesign and DiceEval: Two R Packages for Design and Analysis of Computer Experiments

Vol. 65, Issue 11, Jun 2015


This paper introduces two R packages available on the Comprehensive R Archive network. The main application concerns the study of computer code output. Package DiceDesign is dedicated to numerical design of experiments, from the construction to the study of the design properties. Package DiceEval deals with the fit, the validation and the comparison of metamodels. After a brief presentation of the context, we focus on the architecture of these two packages. A two-dimensional test function will be a running example to illustrate the main functionalities of these packages and an industrial case study in five dimensions will also be detailed.

June 21, 2015 07:00 AM

remote: Empirical Orthogonal Teleconnections in R

Vol. 65, Issue 10, Jun 2015


In climate science, teleconnection analysis has a long standing history as a means for describing regions that exhibit above average capability of explaining variance over time within a certain spatial domain (e.g., global). The most prominent example of a global coupled ocean-atmosphere teleconnection is the El Nin ̃o Southern Oscillation. There are numerous signal decomposition methods for identifying such regions, the most widely used of which are (rotated) empirical orthogonal functions. First introduced by van den Dool, Saha, and Johansson (2000), empirical orthogonal teleconnections (EOT) denote a regression based approach that allows for straight-forward interpretation of the extracted modes. In this paper we present the R implementation of the original algorithm in the remote package. To highlight its usefulness, we provide three examples of potential use- case scenarios for the method including the replication of one of the original examples from van den Dool et al. (2000). Furthermore, we highlight the algorithm’s use for cross- correlations between two different geographic fields (identifying sea surface temperature drivers for precipitation), as well as statistical downscaling from coarse to fine grids (using Normalized Difference Vegetation Index fields).

June 21, 2015 07:00 AM

Mann-Whitney Type Tests for Microarray Experiments: The R Package gMWT

Vol. 65, Issue 9, Jun 2015


We present the R package gMWT which is designed for the comparison of several treatments (or groups) for a large number of variables. The comparisons are made using certain probabilistic indices (PI). The PIs computed here tell how often pairs or triples of observations coming from different groups appear in a specific order of magnitude. Classical two and several sample rank test statistics such as the Mann-Whitney-Wilcoxon, Kruskal-Wallis, or Jonckheere-Terpstra test statistics are simple functions of these PI. Also new test statistics for directional alternatives are provided. The package gMWT can be used to calculate the variable-wise PI estimates, to illustrate their multivariate distribution and mutual dependence with joint scatterplot matrices, and to construct several classical and new rank tests based on the PIs. The aim of the paper is first to briefly explain the theory that is necessary to understand the behavior of the estimated PIs and the rank tests based on them. Second, the use of the package is described and illustrated with simulated and real data examples. It is stressed that the package provides a new flexible toolbox to analyze large gene or microRNA expression data sets, collected on microarrays or by other high-throughput technologies. The testing procedures can be used in an eQTL analysis, for example, as implemented in the package GeneticTools.

June 21, 2015 07:00 AM

June 16, 2015

Bioconductor Project Working Papers

One-Step Targeted Minimum Loss-based Estimation Based on Universal Least Favorable One-Dimensional Submodels

Consider a study in which one observes n independent and identically distributed random variables whose probability distribution is known to be an element of a particular statistical model, and one is concerned with estimation of a particular real valued pathwise differentiable target parameter of this data probability distribution. The canonical gradient of the pathwise derivative of the target parameter, also called the efficient influence curve, defines an asymptotically efficient estimator as an estimator that is asymptotically linear with influence curve equal to the efficient influence curve.The targeted maximum likelihood estimator is a two stage estimator obtained by constructing a so called least favorable parametric submodel through an initial estimator with score, at zero fluctuation of the initial estimator, that spans the efficient influence curve, and iteratively maximizing the corresponding parametric likelihood till no more updates occur, at which point the updated initial estimator solves the so called efficient influence curve equation. The latter property establishes the asymptotic efficiency of the TMLE under appropriate conditions, including that the initial estimator is within a neighborhood of the true data distribution.

In this article we construct a one-dimensional universal least favorable submodel for which the TMLE only takes one step, and thereby requires minimal extra fitting with data to achieve its goal of solving the efficient influence curve equation. We generalize these to universal least favorable submodels through the relevant part of the data distribution as required for targeted minimum loss-based estimation, and to universal score-specific submodels for solving any other desired equation beyond the efficient influence curve equation. We demonstrate the one-step targeted minimum loss-based estimators based on such universal least favorable submodels for a variety of examples showing that any of the goals for TMLE we previously achieved with local (typically multivariate) least favorable parametric submodels and an iterative TMLE can also be achieved with our new one-dimensional universal least favorable submodels, resulting in new one-step TMLEs for a large class of estimation problems previously addressed. Finally, remarkably, given a multidimensional target parameter, we develop a universal canonical one-dimensional submodel such that the one-step TMLE, only maximizing the log-likelihood over a univariate parameter, solves the multivariate efficient influence curve equation. This allows us to construct a one-step TMLE based on a one-dimensional parametric submodel through the initial estimator, that solves any multivariate desired set of estimating equations.

by Mark J. van der Laan at June 16, 2015 11:37 PM

June 13, 2015

Dirk Eddelbuettel

pkgKitten 0.1.3: Still creating R Packages that purr

A new release, now at version 0.1.3, of pkgKitten arrived on CRAN this morning.

The main change is (optional) support of the excellent whoami package by Gabor which allows us to fill in the Author: and Maintainer: fields of the DESCRIPTION file with automatically discovered values. This is however only a Suggests: and not a Depends: to not force the added dependencies on everywhere. We also alter the default values of Title: and Description: so that they actually pass the current level of tests enforced by R CMD check --as-cran.

Changes in version 0.1.3 (2015-06-12)

  • The fields Title: and Description: in the file DESCRIPTION file are now updated such that they actually pass R CMD check on current versions of R.

  • If installed, the whoami package (version 1.1.0 or later) is now used to discover the username and email in the DESCRIPTION file.

More details about the package are at the pkgKitten webpage and the pkgKitten GitHub repo.

Courtesy of CRANberries, there is also a diffstat report for this release

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

June 13, 2015 02:13 PM

June 08, 2015

Bioconductor Project Working Papers

June 06, 2015

Dirk Eddelbuettel


Conrad put out a new minor release 5.200.1 of Armadillo yesterday. Armadillo is a powerful and expressive C++ template library for linear algebra aiming towards a good balance between speed and ease of use with a syntax deliberately close to a Matlab.

Our corresponding RcppArmadillo release is now on CRAN and on its way into Debian. See below for the brief list of changes.

Changes in RcppArmadillo version (2015-06-04)

  • Upgraded to Armadillo release 5.200.1 ("Boston Tea Smuggler")

    • added orth() for finding the orthonormal basis of the range space of a matrix

    • expanded element initialisation to handle nested initialiser lists (C++11)

    • workarounds for bugs in GCC, Intel and MSVC C++ compilers

  • Added another example to inst/examples/fastLm.r

Courtesy of CRANberries, there is also a diffstat report for the most recent CRAN release. As always, more detailed information is on the RcppArmadillo page. Questions, comments etc should go to the rcpp-devel mailing list off the R-Forge page.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

June 06, 2015 12:18 AM

May 31, 2015

Dirk Eddelbuettel

rfoaas 0.1.6

After a few quiet months, a new version of rfoaas is now on CRAN. As before, it shadows upstream release of FOAAS from which carried over the version number 0.1.6.

The rfoaas package provides an interface for R to the most excellent FOAAS service--which provides a modern, scalable and RESTful web service for the frequent need to tell someone to f$#@ off.

Release 0.1.6 of FOAAS builds on the initial support for filters and now adds working internationalization. This is best shown by example:

R> library(rfoaas)
R> off("Tom", "Everyone")                                      # standard upstream example
[1] "Fuck off, Tom. - Everyone"
R> off("Tom", "Everyone", language="fr")                       # now in French
[1] "Va te faire foutre, Tom. - Everyone"
R> off("Tom", "Everyone", language="de", filter="shoutcloud")  # or in German and LOUD

We start with a standard call to off(), add the (now fully functional upstream) language support and finally illustrate the filter added in the preceding 0.1.3 release.

As usual, CRANberries provides a diff to the previous CRAN release. Questions, comments etc should go to the GitHub issue tracker.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

May 31, 2015 11:11 PM

May 27, 2015

Dirk Eddelbuettel

drat 0.0.4: Yet more features and documentation

A new version, now at 0.0.4, of the drat package arrived on CRAN yesterday. Its name stands for drat R Archive Template, and it helps with easy-to-create and easy-to-use repositories for R packages, and is finding increasing by other projects.

Version 0.0.4 brings both new code and more documentation:

  • support for binary repos on Windows and OS X thanks to Jan Schulz;
  • new (still raw) helper functions initRepo() to create a git-based repository, and pruneRepo() to remove older versions of packages;
  • the insertRepo() functions now uses tryCatch() around git commands (with thanks to Carl Boettiger);
  • when adding a file to a drat repo we ensure that the repo path does not contains spaces (with thank to Stefan Bache);
  • stress that file-based repos need a URL of the form file:/some/path with one colon but not two slashes (also thanks to Stefan Bache);
  • new Using Drat with Travis CI vignette thanks to Colin Gillespie;
  • new Drat FAQ vignette;
  • other fixes and extensions.

Courtesy of CRANberries, there is a comparison to the previous release. More detailed information is on the drat page.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

May 27, 2015 12:06 PM

May 25, 2015

Statistical Modelling

The functional linear array model

The functional linear array model (FLAM) is a unified model class for functional regression models including function-on-scalar, scalar-on-function and function-on-function regression. Mean, median, quantile as well as generalized additive regression models for functional or scalar responses are contained as special cases in this general framework. Our implementation features a broad variety of covariate effects, such as, linear, smooth and interaction effects of grouping variables, scalar and functional covariates. Computational efficiency is achieved by representing the model as a generalized linear array model. While the array structure requires a common grid for functional responses, missing values are allowed. Estimation is conducted using a boosting algorithm, which allows for numerous covariates and automatic, data-driven model selection. To illustrate the flexibility of the model class we use three applications on curing of resin for car production, heat values of fossil fuels and Canadian climate data (the last one in the electronic supplement). These require function-on-scalar, scalar-on-function and function-on-function regression models, respectively, as well as additional capabilities such as robust regression, spatial functional regression, model selection and accommodation of missings. An implementation of our methods is provided in the R add-on package FDboost.

by Brockhaus, S., Scheipl, F., Hothorn, T., Greven, S. at May 25, 2015 09:27 AM

Cox regression models with functional covariates for survival data

We extend the Cox proportional hazards model to cases when the exposure is a densely sampled functional process, measured at baseline. The fundamental idea is to combine penalized signal regression with methods developed for mixed effects proportional hazards models. The model is fit by maximizing the penalized partial likelihood, with smoothing parameters estimated by a likelihood-based criterion such as AIC or EPIC. The model may be extended to allow for multiple functional predictors, time varying coefficients, and missing or unequally spaced data. Methods were inspired by and applied to a study of the association between time to death after hospital discharge and daily measures of disease severity collected in the intensive care unit, among survivors of acute respiratory distress syndrome.

by Gellar, J. E., Colantuoni, E., Needham, D. M., Crainiceanu, C. M. at May 25, 2015 09:27 AM

Two Cholesky-log-GARCH models for multivariate volatilities

Parsimonious estimation of high-dimensional covariance matrices is of fundamental importance in multivariate statistics. Typical examples occur in finance, where the instantaneous dependence among several asset returns should be taken into account. Multivariate GARCH processes have been established as a standard approach for modelling such data. However, the majority of GARCH-type models are either based on strong assumptions that may not be realistic or require restrictions that are often too hard to be satisfied in practice. We consider two alternative decompositions of time-varying covariance matrices $${\mathbf{\Sigma }}_{t}$$ . The first is based on the modified Cholesky decomposition of the covariance matrices and second relies on the hyperspherical parametrization of the standard Cholesky factor of their correlation matrices $${\hbox{ R }}_{t}$$ . Then, we combine each Cholesky factor with the log-GARCH models for the corresponding time–varying volatilities and use a quasi maximum likelihood approach to estimate the parameters. Using log-GARCH models is quite natural for achieving the positive definiteness of $${\mathbf{\Sigma }}_{t}$$ and this is a novelty of this work. Application of the proposed methodologies to two real financial datasets reveals their usefulness in terms of parsimony, ease of implementation and stresses the choice of the appropriate models using familiar data-driven processes such as various forms of the exploratory data analysis and regression.

by Pedeli, X., Fokianos, K., Pourahmadi, M. at May 25, 2015 09:27 AM

Dirichlet Lasso: A Bayesian approach to variable selection

Selection of the most important predictor variables in regression analysis is one of the key problems statistical research has been concerned with for long time. In this article, we propose the methodology, Dirichlet Lasso (abbreviated as DLASSO) to address this issue in a Bayesian framework. In many modern regression settings, large set of predictor variables are grouped and the coefficients belonging to any one of these groups are either all redundant or all important in predicting the response; we say in those cases that the predictors exhibit a group structure. We show that DLASSO is particularly useful where the group structure is not fully known. We exploit the clustering property of Dirichlet Process priors to infer the possibly missing group information. The Dirichlet Process has the advantage of simultaneously clustering the variable coefficients and selecting the best set of predictor variables. We compare the predictive performance of DLASSO to Group Lasso and ordinary Lasso with real data and simulation studies. Our results demonstrate that the predictive performance of DLASSO is almost as good as that of Group Lasso when group label information is given; and superior to the ordinary Lasso for missing group information. For high dimensional data (e.g., genetic data) with missing group information, DLASSO will be a powerful approach of variable selection since it provides a superior predictive performance and higher statistical accuracy.

by Das, K., Sobel, M. at May 25, 2015 09:27 AM

May 23, 2015


Parametric Inference: Likelihood Ratio Test Problem 2

More on Likelihood Ratio Test, the following problem is originally from Casella and Berger (2001), exercise 8.12.


For samples of size $n=1,4,16,64,100$ from a normal population with mean $\mu$ and known variance $\sigma^2$, plot the power function of the following LRTs (Likelihood Ratio Tests). Take $\alpha = .05$.
  1. $H_0:\mu\leq 0$ versus $H_1:\mu>0$
  2. $H_0:\mu=0$ versus $H_1:\mu\neq 0$


  1. The LRT statistic is given by $$ \lambda(\mathbf{x})=\frac{\displaystyle\sup_{\mu\leq 0}\mathcal{L}(\mu|\mathbf{x})}{\displaystyle\sup_{-\infty<\mu<\infty}\mathcal{L}(\mu|\mathbf{x})}, \;\text{since }\sigma^2\text{ is known}. $$ The denominator can be expanded as follows: $$ \begin{aligned} \sup_{-\infty<\mu<\infty}\mathcal{L}(\mu|\mathbf{x})&=\sup_{-\infty<\mu<\infty}\prod_{i=1}^{n}\frac{1}{\sqrt{2\pi}\sigma}\exp\left[-\frac{(x_i-\mu)^2}{2\sigma^2}\right]\\ &=\sup_{-\infty<\mu<\infty}\frac{1}{(2\pi\sigma^2)^{1/n}}\exp\left[-\displaystyle\sum_{i=1}^{n}\frac{(x_i-\mu)^2}{2\sigma^2}\right]\\ &=\frac{1}{(2\pi\sigma^2)^{1/n}}\exp\left[-\displaystyle\sum_{i=1}^{n}\frac{(x_i-\bar{x})^2}{2\sigma^2}\right],\\ &\quad\text{since }\bar{x}\text{ is the MLE of }\mu.\\ &=\frac{1}{(2\pi\sigma^2)^{1/n}}\exp\left[-\frac{n-1}{n-1}\displaystyle\sum_{i=1}^{n}\frac{(x_i-\bar{x})^2}{2\sigma^2}\right]\\ &=\frac{1}{(2\pi\sigma^2)^{1/n}}\exp\left[-\frac{(n-1)s^2}{2\sigma^2}\right],\\ \end{aligned} $$ while the numerator is evaluated as follows: $$ \begin{aligned} \sup_{\mu\leq 0}\mathcal{L}(\mu|\mathbf{x})&=\sup_{\mu\leq 0}\prod_{i=1}^{n}\frac{1}{\sqrt{2\pi}\sigma}\exp\left[-\frac{(x_i-\mu)^2}{2\sigma^2}\right]\\ &=\sup_{\mu\leq 0}\frac{1}{(2\pi\sigma^2)^{1/n}}\exp\left[-\displaystyle\sum_{i=1}^{n}\frac{(x_i-\mu)^2}{2\sigma^2}\right]. \end{aligned} $$ Above expression will attain its maximum if the value inside the exponential function is small. And for negative values of $\mu\in(-\infty,0)$ the quantity $(x_i-\mu)^2$ would be large, implies that the exponential term would become small. Therefore, the only value that will give us the supremum likelihood is $\mu=\mu_0=0$. Hence, $$ \begin{aligned} \sup_{\mu\leq 0}\mathcal{L}(\mu|\mathbf{x})&=\frac{1}{(2\pi\sigma^2)^{1/n}}\exp\left[-\displaystyle\sum_{i=1}^{n}\frac{(x_i-\mu_0)^2}{2\sigma^2}\right]\\ =\frac{1}{(2\pi\sigma^2)^{1/n}}&\exp\left[-\displaystyle\sum_{i=1}^{n}\frac{(x_i-\bar{x}+\bar{x}-\mu_0)^2}{2\sigma^2}\right]\\ =\frac{1}{(2\pi\sigma^2)^{1/n}}&\exp\left\{-\displaystyle\sum_{i=1}^{n}\left[\frac{(x_i-\bar{x})^2+2(x_i-\bar{x})(\bar{x}-\mu_0)+(\bar{x}-\mu_0)^2}{2\sigma^2}\right]\right\}\\ =\frac{1}{(2\pi\sigma^2)^{1/n}}&\exp\left[-\frac{(n-1)s^2+n(\bar{x}-\mu_0)^2}{2\sigma^2}\right], \\ &\text{since the middle term is 0.}\\ =\frac{1}{(2\pi\sigma^2)^{1/n}}&\exp\left[-\frac{(n-1)s^2+n\bar{x}^2}{2\sigma^2}\right], \text{since }\mu_0=0.\\ \end{aligned} $$ So that $$ \begin{equation} \label{eq:lrtre} \begin{aligned} \lambda(\mathbf{x})&=\frac{\frac{1}{(2\pi\sigma^2)^{1/n}}\exp\left[-\frac{(n-1)s^2+n\bar{x}^2}{2\sigma^2}\right]}{\frac{1}{(2\pi\sigma^2)^{1/n}}\exp\left[-\frac{(n-1)s^2}{2\sigma^2}\right]}\\ &=\exp\left[-\frac{n\bar{x}^2}{2\sigma^2}\right].\\ \end{aligned} \end{equation} $$ And we reject the null hypothesis if $\lambda(\mathbf{x})\leq c$, that is $$ \begin{aligned} \exp\left[-\frac{n\bar{x}^2}{2\sigma^2}\right]&\leq c\\ -\frac{n\bar{x}^2}{2\sigma^2}&\leq \log c\\ \frac{\lvert\bar{x}\rvert}{\sigma/\sqrt{n}}&\geq\sqrt{-2\log c}=c'. \end{aligned} $$
    Figure 1: Plot of Likelihood Ratio Test Statistic for $n = 4,\sigma = 1$.

    Hence, rejecting the null hypothesis if $\lambda(\mathbf{x})\leq c$, is equivalent to rejecting $H_0$ if $\frac{\bar{x}}{\sigma/\sqrt{n}}\geq c'\in[0,\infty)$. Figure 1 depicts the plot of the LRT, the shaded region is on the positive side because that's where the alternative region is, $H_1:\mu>0$, in a sense that if the LRT is small enough to reject $H_0$, then it simply tells us that the plausibility of the parameter in the alternative in explaining the sample is higher compared to the null hypothesis. And if that's the case, we expect the sample to come from the model proposed by $H_1$, so that the sample mean $\bar{x}$, being an unbiased estimator of the population mean $\mu$, a function of the LRT statistic, should fall on the side (shaded region) of the alternative.

    So that the power function, that is the probability of rejecting the null hypothesis given that it is true (the probability of Type I error) is, $$ \begin{aligned} \beta(\mu)&=\mathrm{P}\left[\frac{\bar{x}-\mu_0}{\sigma/\sqrt{n}}\geq c'\right],\quad\mu_0=0\\ &=1-\mathrm{P}\left[\frac{\bar{x}+\mu-\mu-\mu_0}{\sigma/\sqrt{n}}< c'\right]\\ &=1-\mathrm{P}\left[\frac{\bar{x}-\mu}{\sigma/\sqrt{n}} + \frac{\mu-\mu_0}{\sigma/\sqrt{n}}< c'\right]\\ &=1-\mathrm{P}\left[\frac{\bar{x}-\mu}{\sigma/\sqrt{n}}< c'+ \frac{\mu_0-\mu}{\sigma/\sqrt{n}}\right]\\ &=1-\Phi\left[c'+ \frac{\mu_0-\mu}{\sigma/\sqrt{n}}\right]. \end{aligned} $$ Values taken by $\Phi$ are negative and so it decreases, but since we subtracted it to 1, then $\beta(\mu)$ is an increasing function. So that for $\alpha=.05$, $$ \begin{aligned} \alpha&=\sup_{\mu\leq \mu_0}\beta(\mu)\\ .05&=\beta(\mu_0)\Rightarrow\beta(\mu_0)=1-\Phi(c')\\ .95&=\Phi(c')\Rightarrow c'=1.645. \end{aligned} $$ Since, $$ \begin{aligned} \Phi(1.645)=\int_{-\infty}^{1.645}\frac{1}{\sqrt{2\pi}}\exp\left[-\frac{x^2}{2}\right]\operatorname{d}x=.9500151. \end{aligned} $$ Therefore for $c'=1.645,\mu_0=0,\sigma=1$, the plot of the power function as a function of $\mu$ for different sample size, $n$, is shown in Figure 2. For example, for $n=1$ we compute for the function \begin{equation} \label{eq:powcomp} \begin{aligned} \beta(\mu)&=1-\Phi\left[c'+ \frac{\mu_0-\mu}{\sigma/\sqrt{n}}\right]\\ &=1-\Phi\left[1.645+ \frac{0-\mu}{1/\sqrt{1}}\right]\\ &=1-\int_{-\infty}^{\left(1.645+ \frac{0-\mu}{1/\sqrt{1}}\right)}\frac{1}{\sqrt{2\pi}}\exp\left[-\frac{x^2}{2}\right]\operatorname{d}x. \end{aligned} \end{equation} The obtained values would be the $y$. For $n = 64$ $$ \begin{aligned} \beta(\mu)&=1-\Phi\left[c'+ \frac{\mu_0-\mu}{\sigma/\sqrt{n}}\right]\\ &=1-\Phi\left[1.645+ \frac{0-\mu}{1/\sqrt{64}}\right]\\ &=1-\int_{-\infty}^{\left(1.645+ \frac{0-\mu}{1/\sqrt{64}}\right)}\frac{1}{\sqrt{2\pi}}\exp\left[-\frac{x^2}{2}\right]\operatorname{d}x, \end{aligned} $$ and so on.
    Figure 2: Power Function for Different Values of $n$.

  2. The LRT statistic is given by $$ \lambda(\mathbf{x})=\frac{\displaystyle\sup_{\mu= 0}\mathcal{L}(\mu|\mathbf{x})}{\displaystyle\sup_{-\infty<\mu<\infty}\mathcal{L}(\mu|\mathbf{x})}, \;\text{since }\sigma^2\text{ is known}. $$ The denominator can be expanded as follows: $$ \begin{aligned} \sup_{-\infty<\mu<\infty}\mathcal{L}(\mu|\mathbf{x})&=\sup_{-\infty<\mu<\infty}\prod_{i=1}^{n}\frac{1}{\sqrt{2\pi}\sigma}\exp\left[-\frac{(x_i-\mu)^2}{2\sigma^2}\right]\\ &=\sup_{-\infty<\mu<\infty}\frac{1}{(2\pi\sigma^2)^{1/n}}\exp\left[-\displaystyle\sum_{i=1}^{n}\frac{(x_i-\mu)^2}{2\sigma^2}\right]\\ &=\frac{1}{(2\pi\sigma^2)^{1/n}}\exp\left[-\displaystyle\sum_{i=1}^{n}\frac{(x_i-\bar{x})^2}{2\sigma^2}\right],\\ &\quad\;\text{since }\bar{x}\text{ is the MLE of }\mu.\\ &=\frac{1}{(2\pi\sigma^2)^{1/n}}\exp\left[-\frac{n-1}{n-1}\displaystyle\sum_{i=1}^{n}\frac{(x_i-\bar{x})^2}{2\sigma^2}\right]\\ &=\frac{1}{(2\pi\sigma^2)^{1/n}}\exp\left[-\frac{(n-1)s^2}{2\sigma^2}\right],\\ \end{aligned} $$ and the numerator is evaluated as follows: $$ \begin{aligned} \sup_{\mu=0}\mathcal{L}(\mu|\mathbf{x})&=\sup_{\mu=0}\prod_{i=1}^{n}\frac{1}{\sqrt{2\pi}\sigma}\exp\left[-\frac{(x_i-\mu)^2}{2\sigma^2}\right]\\ &=\sup_{\mu=0}\frac{1}{(2\pi\sigma^2)^{1/n}}\exp\left[-\displaystyle\sum_{i=1}^{n}\frac{(x_i-\mu)^2}{2\sigma^2}\right]\\ &=\frac{1}{(2\pi\sigma^2)^{1/n}}\exp\left[-\displaystyle\sum_{i=1}^{n}\frac{(x_i-0)^2}{2\sigma^2}\right]\\ &=\frac{1}{(2\pi\sigma^2)^{1/n}}\exp\left[-\frac{(n-1)s^2+n\bar{x}^2}{2\sigma^2}\right], \end{aligned} $$ we skip some lines in the above simplification since we've done this already in part (a). And by Equation (1), $\lambda(\mathbf{x})=\exp\left[-\frac{n\bar{x}^2}{2\sigma^2}\right]$. So that $\lambda(\mathbf{x})\leq c$ would be $$ \begin{aligned} \exp\left[-\frac{n\bar{x}^2}{2\sigma^2}\right]&\leq c\\ -\frac{n\bar{x}^2}{2\sigma^2}&\leq \log c\\ \frac{\lvert\bar{x}-\mu_0\rvert}{\sigma/\sqrt{n}}&\geq\sqrt{-2\log c}=c',\quad \mu_0=0. \end{aligned} $$ So rejecting the null hypothesis if $\lambda(\mathbf{x})\leq c'$ is equivalent to rejecting $H_0$ if $\frac{\lvert\bar{x}\rvert}{\sigma/\sqrt{n}}\geq c'$. And since $H_1$ is two-sided, then we reject $H_0$ if $\frac{\bar{x}}{\sigma/\sqrt{n}}\geq c'$ or $\frac{\bar{x}}{\sigma/\sqrt{n}}\leq -c'$. To illustrate this, consider Figure 3 where the two shaded regions are the lower and upper rejection regions.
    Figure 3: Plot of Likelihood Ratio Test Statistic for $n = 4,\sigma = 1$.

    So that the power function is, $$ \begin{aligned} \beta(\mu)&=\mathrm{P}\left[\frac{\lvert\bar{x}\rvert}{\sigma/\sqrt{n}}\geq c'\right]\\ &=1 - \mathrm{P}\left[\frac{\lvert\bar{x}\rvert}{\sigma/\sqrt{n}}< c'\right]\\ &=1 - \mathrm{P}\left[-c'<\frac{\bar{x}}{\sigma/\sqrt{n}}< c'\right]\\ &=1 - \left\{\mathrm{P}\left[\frac{\bar{x}}{\sigma/\sqrt{n}}< c'\right]-\mathrm{P}\left[\frac{\bar{x}}{\sigma/\sqrt{n}}< -c'\right]\right\}\\ &=1 - \left\{\mathrm{P}\left[\frac{\bar{x}+\mu-\mu}{\sigma/\sqrt{n}}< c'\right]-\mathrm{P}\left[\frac{\bar{x}+\mu-\mu}{\sigma/\sqrt{n}}< -c'\right]\right\}\\ &=1 - \mathrm{P}\left[\frac{\bar{x}-\mu}{\sigma/\sqrt{n}}< c'-\frac{\mu}{\sigma/\sqrt{n}}\right]+\mathrm{P}\left[\frac{\bar{x}-\mu}{\sigma/\sqrt{n}}< -c'-\frac{\mu}{\sigma/\sqrt{n}}\right]\\ &=\underbrace{1 - \Phi\left[c'-\frac{\mu}{\sigma/\sqrt{n}}\right]}_{\Phi_1}+\underbrace{\Phi\left[-c'-\frac{\mu}{\sigma/\sqrt{n}}\right]}_{\Phi_2}. \end{aligned} $$ Notice that $\Phi_1$ is an increasing function, while $\Phi_2$ is decreasing as a function of $\mu$. We expect this since the alternative hypothesis is a two-sided one, so does the power. To see this, consider Figure 4 for different values of $n$.
    Figure 4: Two-Sided Power Function for Different $n$.

    The points in the plot are computed by substituting values of $\mu=0,\sigma=1$ and $n$ to the power function just like we did in Equation (2).


  1. Casella, G. and Berger, R.L. (2001). Statistical Inference. Thomson Learning, Inc.
  2. Felix Schönbrodt. Shading regions of the normal: The Stanine scale. Retrieved May 2015.

by Al-Ahmadgaid Asaad ( at May 23, 2015 08:42 AM

May 22, 2015


Parametric Inference: Likelihood Ratio Test Problem 1

Another post for mathematical statistics, the problem below is originally from Casella and Berger (2001) (see Reference 1), exercise 8.6.


  1. Suppose that we have two independent random samples $X_1,\cdots, X_n$ are exponential($\theta$), and $Y_1,\cdots, Y_m$ are exponential($\mu$).
    1. Find the LRT (Likelihood Ratio Test) of $H_0:\theta=\mu$ versus $H_1:\theta\neq\mu$.
    2. Show that the test in part (a) can be based on the statistic
    3. $$ T=\frac{\sum X_i}{\sum X_i+\sum Y_i}. $$
    4. Find the distribution of $T$ when $H_0$ is true.


    1. The Likelihood Ratio Test is given by $$ \lambda(\mathbf{x},\mathbf{y}) = \frac{\displaystyle\sup_{\theta = \mu,\mu>0}\mathrm{P}(\mathbf{x},\mathbf{y}|\theta,\mu)}{\displaystyle\sup_{\theta > 0,\mu>0}\mathrm{P}(\mathbf{x}, \mathbf{y}|\theta,\mu)}, $$ where the denominator is evaluated as follows: $$ \sup_{\theta > 0,\mu>0}\mathrm{P}(\mathbf{x}, \mathbf{y}|\theta,\mu)= \sup_{\theta > 0}\mathrm{P}(\mathbf{x}|\theta)\sup_{\mu > 0}\mathrm{P}(\mathbf{y}|\mu),\quad\text{by independence.} $$ So that, $$ \begin{aligned} \sup_{\theta > 0}\mathrm{P}(\mathbf{x}|\theta)&=\sup_{\theta>0}\prod_{i=1}^{n}\frac{1}{\theta}\exp\left[-\frac{x_i}{\theta}\right]=\sup_{\theta>0}\frac{1}{\theta^n}\exp\left[-\frac{\sum_{i=1}^{n}x_i}{\theta}\right]\\ &=\frac{1}{\bar{x}^n}\exp\left[-\frac{\sum_{i=1}^{n}x_i}{\bar{x}}\right]=\frac{1}{\bar{x}^n}\exp[-n], \end{aligned} $$ since $\bar{x}$, or the sample mean is the MLE of $\theta$. Also, $$ \begin{aligned} \sup_{\mu > 0}\mathrm{P}(\mathbf{y}|\mu)&=\sup_{\mu>0}\prod_{j=1}^{m}\frac{1}{\mu}\exp\left[-\frac{y_j}{\mu}\right]=\sup_{\mu>0}\frac{1}{\mu^m}\exp\left[-\frac{\sum_{j=1}^{m}y_j}{\mu}\right]\\ &=\frac{1}{\bar{y}^m}\exp\left[-\frac{\sum_{j=1}^{m}y_j}{\bar{y}}\right]=\frac{1}{\bar{y}^m}\exp[-m]. \end{aligned} $$ Now the numerator is evaluated as follows, $$ \begin{aligned} \sup_{\theta = \mu,\mu>0}\mathrm{P}(\mathbf{x},\mathbf{y}|\theta,\mu)&=\sup_{\theta=\mu,\mu>0}\mathrm{P}(\mathbf{x}|\theta)\mathrm{P}(\mathbf{y}|\mu),\quad\text{by independence.}\\ &=\sup_{\theta=\mu,\mu>0}\prod_{i=1}^{n}\frac{1}{\theta}\exp\left[-\frac{x_i}{\theta}\right]\prod_{j=1}^{m}\frac{1}{\mu}\exp\left[-\frac{y_j}{\mu}\right]\\ &=\sup_{\theta=\mu,\mu>0}\frac{1}{\theta^n}\exp\left[-\frac{\sum_{i=1}^nx_i}{\theta}\right]\frac{1}{\mu^m}\exp\left[-\frac{\sum_{j=1}^m y_j}{\mu}\right]\\ &=\sup_{\mu>0}\frac{1}{\mu^n}\exp\left[-\frac{\sum_{i=1}^nx_i}{\mu}\right]\frac{1}{\mu^m}\exp\left[-\frac{\sum_{j=1}^m y_j}{\mu}\right]\\ &=\sup_{\mu>0}\frac{1}{\mu^{n+m}}\exp\left\{-\frac{1}{\mu}\left[\sum_{i=1}^nx_i+\sum_{j=1}^m y_j\right]\right\} \end{aligned} $$ Note that $\mu$ is a nuisance parameter, and so we will also maximize this over its domain. And to do that we take the log-likeihood function first, $$ \begin{aligned} \ell(\mu|\mathbf{x},\mathbf{y})&=-\log(\mu^{n+m})-\frac{1}{\mu}\left[\sum_{i=1}^nx_i+\sum_{j=1}^m y_j\right]\\ &=-(n+m)\log(\mu)-\frac{1}{\mu}\left[\sum_{i=1}^nx_i+\sum_{j=1}^m y_j\right]. \end{aligned} $$ Taking the derivative with respect to $\mu$, gives us $$ \frac{\operatorname{d}}{\operatorname{d}\mu}\ell(\mu|\mathbf{x},\mathbf{y})=-(n+m)\frac{1}{\mu}+\frac{1}{\mu^2}\left[\sum_{i=1}^nx_i+\sum_{j=1}^m y_j\right], $$ equate this to zero to obtain the stationary point, $$ \begin{aligned} -(n+m)\frac{1}{\mu}+\frac{1}{\mu^2}\left[\sum_{i=1}^nx_i+\sum_{j=1}^m y_j\right]&=0\\ -(n+m)\mu+\left[\sum_{i=1}^nx_i+\sum_{j=1}^m y_j\right]&=0\\ \mu&=\frac{1}{n+m}\left[\sum_{i=1}^nx_i+\sum_{j=1}^m y_j\right]. \end{aligned} $$ To verify if this is the MLE, we take the second derivative test for the log-likelihood function, $$ \frac{\operatorname{d}^2}{\operatorname{d}\mu^2}\ell(\mu|\mathbf{x},\mathbf{y})=(n+m)\frac{1}{\mu^2}-\frac{2}{\mu^3}\left[\sum_{i=1}^nx_i+\sum_{j=1}^m y_j\right]<0, $$ since $\frac{1}{\mu^2}<\frac{2}{\mu^3}$, implying $\hat{\mu}=\displaystyle\frac{1}{n+m}\left[\sum_{i=1}^nx_i+\sum_{j=1}^m y_j\right]$ is the MLE of $\mu$. Thus the LRT, $\lambda(\mathbf{x},\mathbf{y})$ would be, $$ \begin{aligned} \lambda(\mathbf{x},\mathbf{y})&=\frac{\sup_{\mu>0}\displaystyle\frac{1}{\mu^{n+m}}\exp\left\{-\frac{1}{\mu}\left[\sum_{i=1}^nx_i+\sum_{j=1}^m y_j\right]\right\}}{\displaystyle\frac{1}{\bar{x}^n}\frac{1}{\bar{y}^m}\exp[-(n+m)]}\\ &=\left(\frac{1}{\frac{1}{{(n+m)}^{n+m}}\left[\sum_{i=1}^nx_i+\sum_{j=1}^m y_j\right]^{n+m}}\times\right.\\ &\qquad\left.\exp\left\{-\frac{1}{\frac{1}{n+m}\left[\sum_{i=1}^nx_i+\sum_{j=1}^m y_j\right]}\left[\sum_{i=1}^nx_i+\sum_{j=1}^m y_j\right]\right\}\right)\bigg/\\ &\qquad\qquad\qquad\displaystyle\frac{1}{\bar{x}^n}\frac{1}{\bar{y}^m}\exp[-(n+m)] \end{aligned} $$ $$ \begin{aligned} &=\frac{\displaystyle\frac{1}{\displaystyle\frac{1}{{(n+m)}^{n+m}}\left[\displaystyle\sum_{i=1}^nx_i+\sum_{j=1}^m y_j\right]^{n+m}}\times\exp[-(n+m)]}{\displaystyle\frac{1}{\bar{x}^n}\frac{1}{\bar{y}^m}\exp[-(n+m)]}\\[.3cm] &=\frac{\displaystyle \bar{x}^n \bar{y}^m}{\displaystyle\frac{1}{{(n+m)}^{n+m}}\left[\sum_{i=1}^nx_i+\sum_{j=1}^m y_j\right]^{n+m}}. \end{aligned} $$ And we say that $H_0$ is rejected if $\lambda(\mathbf{x},\mathbf{y})\leq c$.
    2. If we do some algebra on the LRT in part (a), we obtain the following: $$ \begin{aligned} \lambda(\mathbf{x},\mathbf{y})&=\frac{\displaystyle \bar{x}^n \bar{y}^m}{\displaystyle\frac{1}{{(n+m)}^{n+m}}\left[\sum_{i=1}^nx_i+\sum_{j=1}^m y_j\right]^{n+m}}\\ &=\frac{\displaystyle\frac{1}{n^n}\left(\sum_{i=1}^{n}x_i\right)^{n}\frac{1}{m^{m}}\left(\sum_{j=1}^{m}y_j\right)^{m}}{\displaystyle\frac{1}{(n+m)^{n+m}}\left[\sum_{i=1}^{n}x_i+\sum_{j=1}^{m}y_j\right]^{n+m}}\\ &=\frac{\displaystyle (n+m)^{n+m}\left(\sum_{i=1}^{n}x_i\right)^{n}\left(\sum_{j=1}^{m}y_j\right)^{m}}{\displaystyle n^{n}m^{m}\left[\sum_{i=1}^{n}x_i+\sum_{j=1}^{m}y_j\right]^{n+m}}\\ &=\frac{(n+m)^{n+m}}{n^nm^{m}}\left[\frac{\displaystyle \sum_{j=1}^{m}y_j}{\displaystyle\sum_{i=1}^{n}x_i+\sum_{j=1}^{m}y_j}\right]^{m}\left[\frac{\displaystyle \sum_{i=1}^{n}x_i}{\displaystyle\sum_{i=1}^{n}x_i+\sum_{j=1}^{m}y_j}\right]^{n}\\ &=\frac{(n+m)^{n+m}}{n^nm^{m}}\left[1-\frac{\displaystyle \sum_{i=1}^{n}x_i}{\displaystyle\sum_{i=1}^{n}x_i+\sum_{j=1}^{m}y_j}\right]^{m}\left[\frac{\displaystyle \sum_{i=1}^{n}x_i}{\displaystyle\sum_{i=1}^{n}x_i+\sum_{j=1}^{m}y_j}\right]^{n}\\ &=\frac{(n+m)^{n+m}}{n^nm^{m}}\left[1-T\right]^{m}\left[T\right]^{n}. \end{aligned} $$ Hence, the LRT can be based on the statistic $T$.
    3. The distribution of $\sum X_i$ is obtain using the MGF (Moment Generating Function) technique, that is $$ \begin{aligned} \mathrm{M}_{\Sigma X_i}(t)&=\mathrm{E}\exp[t\Sigma X_i]=\mathrm{E}\exp[tX_1 +\cdots + tX_n]\\ &=\mathrm{E}\exp[tX_1]\times\cdots\times\mathrm{E}\exp[tX_n],\quad\text{by independence.}\\ &=\frac{1}{1-\theta t}\times\cdots\times\frac{1}{1-\theta t}\\ &=\left(\frac{1}{1-\theta t}\right)^{n}=\text{MGF of gamma}(n,\theta). \end{aligned} $$ Now, when $H_0$ is true then $\sum X_i$ is gamma($m,\theta$). For brevity, let $X=\sum_{i=1}^{n} X_i$ and $Y=\sum_{j=1}^{m}Y_j$. The joint distribution of $X$ and $Y$ is given below, $$ f_{XY}(x, y)=\frac{1}{\Gamma (n)\theta^{n}}x^{n-1}\exp[-x/\theta]\times\frac{1}{\Gamma (m)\theta^{m}}y^{m-1}\exp[-y/\theta]. $$ Let $U=\frac{X}{X+Y}$ and $V=X+Y$, then the support of $(X,Y)$ is $\mathcal{A}=\left\{(x,y)\in \mathbb{R}^{+}\times \mathbb{R}^{+}\right\}$. Since the transformations $U$ and $V$ is one-to-one and onto, then $\mathcal{B}=\left\{(u,v)\in [0,1]\times \mathbb{R}^{+}\right\}$. Consider the following transformations $$ u=g_{1}(x,y)=\frac{x}{x+y}\quad\text{and}\quad v=g_{2}(x,y)=x+y. $$ Then, \begin{equation} \label{eq:bvt1} u=\frac{x}{x+y}\Rightarrow x=\frac{uy}{1-u} \end{equation} and \begin{equation} \label{eq:bvt2} v=x+y\Rightarrow y = v-x. \end{equation} Substitute Equation (\ref{eq:bvt2}) to Equation (\ref{eq:bvt1}), then $$ \begin{aligned} x=\frac{u(v-x)}{1-u}&\Rightarrow x(1-u)=u(v-x)\\ x-ux=uv-ux&\Rightarrow x=uv=h_{1}(u,v). \end{aligned} $$ Substitute $x$ above to Equation (\ref{eq:bvt2}) to obtain, $$y=v(1-u)=h_2(u,v).$$ And the Jacobian matrix is, $$ \mathbf{J}=\bigg| \begin{array}{cc} v&u\\[.2cm] -v&1-u \end{array} \bigg|=v(1-u)+uv=v. $$ So that, $$ \begin{aligned} f_{UV}(u,v)&=f_{XY}(h_1(u,v),h_2(u,v))\lvert \mathbf{J}\rvert=f_{XY}(uv,v(1-u))\lvert v\rvert\\ &=\frac{1}{\Gamma (n)\theta^{n}}(uv)^{n-1}\exp[-uv/\theta]\times\\ &\quad\;\frac{1}{\Gamma (m)\theta^{m}}(v(1-u))^{m-1}\exp[-v(1-u)/\theta]v\\ &=\frac{1}{\Gamma (n)\theta^{n}}(uv)^{n-1}\exp[-uv/\theta]\times\\ &\quad\;\frac{1}{\Gamma (m)\theta^{m}}(v(1-u))^{m-1}\exp[-v/\theta]\exp[uv/\theta]v\\ &=\frac{1}{\Gamma (n)\theta^{n}}u^{n-1}v^{n-1}\times\frac{1}{\Gamma (m)\theta^{m}}v^{m-1}(1-u)^{m-1}\exp[-v/\theta]v\\ &=\frac{1}{\Gamma (n)}\underbrace{u^{n-1}(1-u)^{m-1}}_{\text{Beta}(n,m)\text{ kernel}}\frac{1}{\Gamma (m)\theta^{m+n}}v^{m-1}v^{n-1}\exp[-v/\theta]v\\ &=\frac{\Gamma(m)\Gamma(m+n)}{\Gamma(m)\Gamma(m+n)}\frac{u^{n-1}(1-u)^{m-1}}{\Gamma (n)}\times\\ &\quad\;\frac{1}{\Gamma (m)\theta^{m+n}}v^{m-1}v^{n}\exp[-v/\theta]\\ &=\underbrace{\frac{\Gamma(m+n)}{\Gamma (n)\Gamma(m)}u^{n-1}(1-u)^{m-1}}_{\text{Beta}(n,m)}\times\\ &\quad\;\underbrace{\frac{1}{\Gamma(m+n)\theta^{m+n}}v^{m+n-1}\exp[-v/\theta]}_{\text{Gamma}(m+n,\theta)}. \end{aligned} $$ So that the marginal density of $U=\displaystyle\frac{\sum X_i}{\sum X_i +\sum Y_i}$ is Beta($n,m$).


  1. Casella, G. and Berger, R.L. (2001). Statistical Inference. Thomson Learning, Inc.

by Al-Ahmadgaid Asaad ( at May 22, 2015 06:19 AM

May 15, 2015


Parametric Inference: The Power Function of the Test

In Statistics, we model random phenomenon and make conclusions about its population. For example, in an experiment of determining the true heights of the students in the university. Suppose we take sample from the population of the students, and consider testing the null hypothesis that the average height is 5.4 ft against an alternative hypothesis that the average height is greater than 5.4 ft. Mathematically, we can represent this as $H_0:\theta=\theta_0$ vs $H_1:\theta>\theta_0$, where $\theta$ is the true value of the parameter, and $\theta_0=5.4$ is the testing value set by the experimenter. And because we only consider subset (the sample) of the population for testing the hypotheses, then we expect for errors we commit. To understand these errors, consider if the above test results into rejecting $H_0$ given that $\theta\in\Theta_0$, where $\Theta_0$ is the parameter space of the null hypothesis, in other words we mistakenly reject $H_0$, then in this case we committed a Type I error. Another is, if the above test results into accepting $H_0$ given that $\theta\in\Theta_0^c$, where $\Theta_0^c$ is the parameter space of the alternative hypothesis, then we committed a Type II error. To summarize this consider the following table,

Table 1: Two Types of Errors in Hypothesis Testing.
Accept $H_0$Reject $H_0$
$H_0$Correct DecisionType I Error
$H_1$Type II ErrorCorrect Decision

Let's formally define the power function, from Casella and Berger (2001), see reference 1.
Definition 1. The power function of a hypothesis test with rejection region $R$ is the function of $\theta$ defined by $\beta(\theta)=\mathrm{P}_{\theta}(\mathbf{X}\in R)$.
To relate the definition to the above problem, if $R$ is the rejection region of $H_0$. Then we make mistake if the sample observed, $\mathbf{x}$, $\mathbf{x}\in R$ given that $\theta\in\Theta_0$. That is, $\beta(\theta)=\mathrm{P}_{\theta}(\mathbf{X}\in R)$ is the probability of Type I error. Let's consider an example, one that is popularly used in testing the sample mean. The example below is the combined problem of Example 8.3.3 and Exercise 8.37 (a) of reference 1.

Example 1. Let $X_1,\cdots, X_n\overset{r.s.}{\sim}N(\mu,\sigma^2)$ -- normal population where $\sigma^2$ is known. Consider testing $H_0:\theta\leq \theta_0$ vs $H_1:\theta> \theta_0$, obtain the likelihood ratio test (LRT) statistic and its power function.

Solution:The LRT statistic is given by $$ \lambda(\mathbf{x})=\frac{\displaystyle\sup_{\theta\leq\theta_0}L(\theta|\mathbf{x})}{\displaystyle\sup_{-\infty<\theta<\infty}L(\theta|\mathbf{x})}, $$ where $$ \begin{aligned} \sup_{\theta\leq\theta_0}L(\theta|\mathbf{x})&=\sup_{\theta\leq\theta_0}\prod_{i=1}^{n}\frac{1}{\sqrt{2\pi}\sigma}\exp\left[-\frac{(x_i-\theta)^2}{2\sigma^2}\right]\\ &=\sup_{\theta\leq\theta_0}\frac{1}{(2\pi\sigma^2)^{1/n}}\exp\left[-\displaystyle\sum_{i=1}^{n}\frac{(x_i-\theta)^2}{2\sigma^2}\right]\\ &=\frac{1}{(2\pi\sigma^2)^{1/n}}\exp\left[-\displaystyle\sum_{i=1}^{n}\frac{(x_i-\theta_0)^2}{2\sigma^2}\right]\\ &=\frac{1}{(2\pi\sigma^2)^{1/n}}\exp\left[-\displaystyle\sum_{i=1}^{n}\frac{(x_i-\bar{x}+\bar{x}-\theta_0)^2}{2\sigma^2}\right]\\ &=\frac{1}{(2\pi\sigma^2)^{1/n}}\exp\left\{-\displaystyle\sum_{i=1}^{n}\left[\frac{(x_i-\bar{x})^2+2(x_i-\bar{x})(\bar{x}-\theta_0)+(\bar{x}-\theta_0)^2}{2\sigma^2}\right]\right\}\\ &=\frac{1}{(2\pi\sigma^2)^{1/n}}\exp\left[-\frac{(n-1)s^2+n(\bar{x}-\theta_0)^2}{2\sigma^2}\right], \text{since the middle term is 0.} \end{aligned} $$ And $$ \begin{aligned} \sup_{-\infty<\theta<\infty}L(\theta|\mathbf{x})&=\sup_{-\infty<\theta<\infty}\prod_{i=1}^{n}\frac{1}{\sqrt{2\pi}\sigma}\exp\left[-\frac{(x_i-\theta)^2}{2\sigma^2}\right]\\ &=\sup_{-\infty<\theta<\infty}\frac{1}{(2\pi\sigma^2)^{1/n}}\exp\left[-\displaystyle\sum_{i=1}^{n}\frac{(x_i-\theta)^2}{2\sigma^2}\right]\\ &=\frac{1}{(2\pi\sigma^2)^{1/n}}\exp\left[-\displaystyle\sum_{i=1}^{n}\frac{(x_i-\bar{x})^2}{2\sigma^2}\right],\quad\text{since }\bar{x}\text{ is the MLE of }\theta.\\ &=\frac{1}{(2\pi\sigma^2)^{1/n}}\exp\left[-\frac{n-1}{n-1}\displaystyle\sum_{i=1}^{n}\frac{(x_i-\bar{x})^2}{2\sigma^2}\right]\\ &=\frac{1}{(2\pi\sigma^2)^{1/n}}\exp\left[-\frac{(n-1)s^2}{2\sigma^2}\right],\\ \end{aligned} $$ so that $$ \begin{aligned} \lambda(\mathbf{x})&=\frac{\frac{1}{(2\pi\sigma^2)^{1/n}}\exp\left[-\frac{(n-1)s^2+n(\bar{x}-\theta_0)^2}{2\sigma^2}\right]}{\frac{1}{(2\pi\sigma^2)^{1/n}}\exp\left[-\frac{(n-1)s^2}{2\sigma^2}\right]}\\ &=\exp\left[-\frac{n(\bar{x}-\theta_0)^2}{2\sigma^2}\right].\\ \end{aligned} $$ And from my previous entry, $\lambda(\mathbf{x})$ is rejected if it is small, such that $\lambda(\mathbf{x})\leq c$ for some $c\in[0,1]$. Hence, $$ \begin{aligned} \lambda(\mathbf{x})&=\exp\left[-\frac{n(\bar{x}-\theta_0)^2}{2\sigma^2}\right]< c\\&\Rightarrow-\frac{n(\bar{x}-\theta_0)^2}{2\sigma^2}<\log c\\ &\Rightarrow\frac{\bar{x}-\theta_0}{\sigma/\sqrt{n}}>\sqrt{-2\log c}. \end{aligned} $$ So that $H_0$ is rejected if $\frac{\bar{x}-\theta_0}{\sigma/\sqrt{n}}> c'$ for some $c'=\sqrt{-2\log c}\in[0,\infty)$. Now the power function of the test, is the probability of rejecting the null hypothesis given that it is true, or the probability of the Type I error given by, $$ \begin{aligned} \beta(\theta)&=\mathrm{P}\left[\frac{\bar{x}-\theta_0}{\sigma/\sqrt{n}}> c'\right]\\ &=\mathrm{P}\left[\frac{\bar{x}-\theta+\theta-\theta_0}{\sigma/\sqrt{n}}> c'\right]\\ &=\mathrm{P}\left[\frac{\bar{x}-\theta}{\sigma/\sqrt{n}}+\frac{\theta-\theta_0}{\sigma/\sqrt{n}}> c'\right]\\ &=\mathrm{P}\left[\frac{\bar{x}-\theta}{\sigma/\sqrt{n}}> c'-\frac{\theta-\theta_0}{\sigma/\sqrt{n}}\right]\\ &=1-\mathrm{P}\left[\frac{\bar{x}-\theta}{\sigma/\sqrt{n}}\leq c'+\frac{\theta_0-\theta}{\sigma/\sqrt{n}}\right]\\ &=1-\Phi\left[c'+\frac{\theta_0-\theta}{\sigma/\sqrt{n}}\right]. \end{aligned} $$ To illustrate this, consider $\theta_0=5.4,\sigma = 1,n=30$ and $c'=1.645$. Then the plot of the power function as a function of $\theta$ is,
Power Function
Since $\beta$ is an increasing function with unit range, then $$ \alpha = \sup_{\theta\leq\theta_0}\beta(\theta)=\beta(\theta_0)=1-\Phi(c'). $$ So that using values we set for the above graph, $\alpha=0.049985\approx 0.05$, $\alpha$ here is called the size of the test since it is the supremum of the power function over $\theta\leq\theta_0$, see reference 1 for level of the test. Now let's investigate the power function above, the probability of committing Type I error, $\beta(\theta), \forall \theta\leq \theta_0$, is acceptably small. However, the probability of committing Type II error, $1-\beta(\theta), \forall \theta > \theta_0$, is too high as we can see in the following plot,
Type II Error
Therefore, it's better to investigate the error structure when considering the power of the test. From Casella and Berger (2001), the ideal power function is 0 $\forall\theta\in\Theta_0$ and 1 $\forall\theta\in\Theta_0^c$. Except in trivial situations, this ideal cannot be attained. Qualitatively, a good test has power function near 1 for most $\theta\in\Theta_0^c$ and $\theta\in\Theta_0$. Implying, one that has steeper power curve.

Now an interesting fact about power function is that it depends on the sample size $n$. Suppose in our experiment above we want the Type I error to be 0.05 and the Type II error to be 0.1 if $\theta\geq \theta_0+\sigma/2$. Since the power function is increasing, then we have $$ \beta(\theta_0)=0.05\Rightarrow c'=1.645\quad\text{and}\quad 1 - \beta(\theta_0+\sigma/2)=0.1\Rightarrow\beta(\theta_0+\sigma/2)=0.9. $$ Where $$ \begin{aligned} \beta(\theta_0+\sigma/2)&=1-\Phi\left[c' +\frac{\theta_0-\sigma/2-\theta_0}{\sigma/\sqrt{n}}\right]\\ &=1-\Phi\left[c' - \frac{\sqrt{n}}{2}\right]\\ 0.9&=1-\Phi\left[1.645 - \frac{\sqrt{n}}{2}\right]\\ 0.1&=\Phi\left[1.645 - \frac{\sqrt{n}}{2}\right].\\ \end{aligned} $$ Hence, $n$ is chosen such that it solves the above equation. That is, $$ \begin{aligned} 1.645 - \frac{\sqrt{n}}{2}&=-1.28155,\quad\text{since }\Phi(-1.28155)=0.1\\ \frac{3.29 - \sqrt{n}}{2}&=-1.28155\\ 3.29 - \sqrt{n}&=-2.5631\\ n&=(3.29+2.5631)^2=34.25878,\;\text{take }n=35. \end{aligned} $$ For purpose of illustration, we'll consider the non-rounded value of $n$. Below is the plot of this,
Power Function with Sample Size
And for different values of $n$, consider the following power functions
Effect of Sample Size on Power Function
From the above plot, the larger the sample size, $n$, the steeper the curve implying a better error structure. To see this, try hovering over the lines in the plot, and you'll witness a fast departure for values of large $n$ on the unit range, this characteristics contribute to the sensitivity of the test.

Plot's Python Codes

In case you want to reproduce the above plots, click here for the source code.


  1. Casella, G. and Berger, R.L. (2001). Statistical Inference. Thomson Learning, Inc.
  2. Plotly Python Library Documentation

by Al-Ahmadgaid Asaad ( at May 15, 2015 12:13 PM

May 12, 2015

Chris Lawrence

That'll leave a mark

Here’s a phrase you never want to see in print (in a legal decision, no less) pertaining to your academic research: “The IRB process, however, was improperly engaged by the Dartmouth researcher and ignored completely by the Stanford researchers.”

Whole thing here; it’s a doozy.

by Chris Lawrence at May 12, 2015 12:00 AM

April 27, 2015


Parametric Inference: Likelihood Ratio Test by Example

Hypothesis testing have been extensively used on different discipline of science. And in this post, I will attempt on discussing the basic theory behind this, the Likelihood Ratio Test (LRT) defined below from Casella and Berger (2001), see reference 1.
Definition. The likelihood ratio test statistic for testing $H_0:\theta\in\Theta_0$ versus $H_1:\theta\in\Theta_0^c$ is \begin{equation} \label{eq:lrt} \lambda(\mathbf{x})=\frac{\displaystyle\sup_{\theta\in\Theta_0}L(\theta|\mathbf{x})}{\displaystyle\sup_{\theta\in\Theta}L(\theta|\mathbf{x})}. \end{equation} A likelihood ratio test (LRT) is any test that has a rejection region of the form $\{\mathbf{x}:\lambda(\mathbf{x})\leq c\}$, where $c$ is any number satisfying $0\leq c \leq 1$.
The numerator of equation (\ref{eq:lrt}) gives us the supremum probability of the parameter, $\theta$, over the restricted domain (null hypothesis, $\Theta_0$) of the parameter space $\Theta$, that maximizes the joint probability of the sample, $\mathbf{x}$. While the denominator of the LRT gives us the supremum probability of the parameter, $\theta$, over the unrestricted domain, $\Theta$, that maximizes the joint probability of the sample, $\mathbf{x}$. Therefore, if the value of $\lambda(\mathbf{x})$ is small such that $\lambda(\mathbf{x})\leq c$, for some $c\in [0, 1]$, then the true value of the parameter that is plausible in explaining the sample is likely to be in the alternative hypothesis, $\Theta_0^c$.

Example 1. Let $X_1,X_2,\cdots,X_n\overset{r.s.}{\sim}f(x|\theta)=\frac{1}{\theta}\exp\left[-\frac{x}{\theta}\right],x>0,\theta>0$. From this sample, consider testing $H_0:\theta = \theta_0$ vs $H_1:\theta<\theta_0$.

The parameter space $\Theta$ is the set $(0,\Theta_0]$, where $\Theta_0=\{\theta_0\}$. Hence, using the likelihood ratio test, we have $$ \lambda(\mathbf{x})=\frac{\displaystyle\sup_{\theta=\theta_0}L(\theta|\mathbf{x})}{\displaystyle\sup_{\theta\leq\theta_0}L(\theta|\mathbf{x})}, $$ where, $$ \begin{aligned} \sup_{\theta=\theta_0}L(\theta|\mathbf{x})&=\sup_{\theta=\theta_0}\prod_{i=1}^{n}\frac{1}{\theta}\exp\left[-\frac{x_i}{\theta}\right]\\ &=\sup_{\theta=\theta_0}\left(\frac{1}{\theta}\right)^n\exp\left[-\displaystyle\frac{\sum_{i=1}^{n}x_i}{\theta}\right]\\ &=\left(\frac{1}{\theta_0}\right)^n\exp\left[-\displaystyle\frac{\sum_{i=1}^{n}x_i}{\theta_0}\right], \end{aligned} $$ and $$ \begin{aligned} \sup_{\theta\leq\theta_0}L(\theta|\mathbf{x})&=\sup_{\theta\leq\theta_0}\prod_{i=1}^{n}\frac{1}{\theta}\exp\left[-\frac{x_i}{\theta}\right]\\ &=\sup_{\theta\leq\theta_0}\left(\frac{1}{\theta}\right)^n\exp\left[-\displaystyle\frac{\sum_{i=1}^{n}x_i}{\theta}\right]=\sup_{\theta\leq\theta_0}f(\mathbf{x}|\theta). \end{aligned} $$ Now the supremum of $f(\mathbf{x}|\theta)$ over all values of $\theta\leq\theta_0$ is the MLE (maximum likelihood estimator) of $f(x|\theta)$, which is $\bar{x}$, provided that $\bar{x}\leq \theta_0$.

So that, $$ \begin{aligned} \lambda(\mathbf{x})&=\frac{\left(\frac{1}{\theta_0}\right)^n\exp\left[-\displaystyle\frac{\sum_{i=1}^{n}x_i}{\theta_0}\right]} {\left(\frac{1}{\bar{x}}\right)^n\exp\left[-\displaystyle\frac{\sum_{i=1}^{n}x_i}{\bar{x}}\right]},\quad\text{provided that}\;\bar{x}\leq \theta_0\\ &=\left(\frac{\bar{x}}{\theta_0}\right)^n\exp\left[-\displaystyle\frac{\sum_{i=1}^{n}x_i}{\theta_0}\right]\exp[n]. \end{aligned} $$ And we say that, if $\lambda(\mathbf{x})\leq c$, $H_0$ is rejected. That is, $$ \begin{aligned} \left(\frac{\bar{x}}{\theta_0}\right)^n\exp\left[-\displaystyle\frac{\sum_{i=1}^{n}x_i}{\theta_0}\right]\exp[n]&\leq c\\ \left(\frac{\bar{x}}{\theta_0}\right)^n\exp\left[-\displaystyle\frac{\sum_{i=1}^{n}x_i}{\theta_0}\right]&\leq c',\quad\text{where}\;c'=\frac{c}{\exp[n]}\\ n\log\left(\frac{\bar{x}}{\theta_0}\right)-\frac{n}{\theta_0}\bar{x}&\leq \log c'\\ \log\left(\frac{\bar{x}}{\theta_0}\right)-\frac{\bar{x}}{\theta_0}&\leq \frac{1}{n}\log c'\\ \log\left(\frac{\bar{x}}{\theta_0}\right)-\frac{\bar{x}}{\theta_0}&\leq \frac{1}{n}\log c-1. \end{aligned} $$ Now let $h(x)=\log x - x$, then $h'(x)=\frac{1}{x}-1$. So the critical point of $h'(x)$ is $x=1$. And to test if this is maximum or minimum, we apply second derivative test. That is, $$ h''(x)=-\frac{1}{x^2}<0,\forall x. $$ Thus, $x=1$ is a maximum. Hence, $ \log\left(\frac{\bar{x}}{\theta_0}\right)-\frac{\bar{x}}{\theta_0} $ is maximized if $\frac{\bar{x}}{\theta_0}=1\Rightarrow\bar{x}=\theta_0$. To see this consider the following plot,
LRT and its Critical Value
Above figure is the plot of $h(\bar{x})$ function with $\theta_0=1$. Given the assumption that $\bar{x}\leq \theta_0$ then assuming $R=\frac{1}{n}\log c-1$ designates the orange line above, then we reject $H_0$ if $h(\bar{x})\leq R$, if and only if $\bar{x}\leq k$. In practice, $k$ is specified to satisfy, $$ \mathrm{P}(\bar{x}\leq k|\theta=\theta_0)\leq \alpha, $$ where $\alpha$ is called the level of the test.

It follows that $X_i|\theta = \theta_0\overset{r.s.}{\sim}\exp[\theta_0]$, then $\mathrm{E}X_i=\theta_0$ and $\mathrm{Var}X_i=\theta_0^2$. If $\bar{x}=\frac{1}{n}\sum_{i=1}^{n}X_i$ and if $G_n$ is the distribution of $\frac{(\bar{x}_n-\theta_0)}{\sqrt{\frac{\theta_0^2}{n}}}$. By CLT (central limit theorem) $\lim_{n\to\infty}G_n$ converges to standard normal distribution. That is, $\bar{x}|\theta = \theta_0\overset{r.s.}{\sim}AN\left(\theta_0,\frac{\theta_0^2}{n}\right)$. $AN$ - assymptotically normal.

Thus, $$ \mathrm{P}(\bar{x}\leq k|\theta=\theta_0)=\Phi\left(\frac{k-\theta_0}{\theta_0/\sqrt{n}}\right),\quad\text{for large }n. $$ So that, $$ \mathrm{P}(\bar{x}\leq k|\theta=\theta_0)=\Phi\left(\frac{k-\theta_0}{\theta_0/\sqrt{n}}\right)\leq \alpha. $$ Plotting this gives us,
with corresponding PDF given by,
Density Function
Implying, $$ \frac{k-\theta_0}{\theta_0/\sqrt{n}}=z_{\alpha}\Rightarrow k=\theta_0+z_{\alpha}\frac{\theta_0}{\sqrt{n}}. $$ Therefore, a level-$\alpha$ test of $H_0:\theta=\theta_0$ vs $H_1:\theta<\theta_0$ is the test that rejects $H_0$ when $\bar{x}\leq\theta_0+z_{\alpha}\frac{\theta_0}{\sqrt{n}}$.

Plot's Python Codes

In case you might ask how above plots were generated:


  1. Casella, G. and Berger, R.L. (2001). Statistical Inference. Thomson Learning, Inc.
  2. Plotly Python Library Documentation

by Al-Ahmadgaid Asaad ( at April 27, 2015 12:44 PM

April 25, 2015

RCpp Gallery

Stochastic SIR Epidemiological Compartment Model


This post is a simple introduction to Rcpp for disease ecologists, epidemiologists, or dynamical systems modelers - the sorts of folks who will benefit from a simple but fully-working example. My intent is to provide a complete, self-contained introduction to modeling with Rcpp. My hope is that this model can be easily modified to run any dynamical simulation that has dependence on the previous time step (and can therefore not be vectorized).

This post uses a classic Susceptible-Infected-Recovered (SIR) epidemiological compartment model. Compartment models are simple, commonly-used dynamical systems models. Here I demonstrate the tau-leap method, where a discrete number of individuals move probabilistically between compartments at fixed intervals in time. In this model, the wait-times within class are exponentially distributed, and the number of transitions between states in a fixed time step are Poisson distributed.

This model is parameterized for the spread of measles in a closed population, where the birth rate (nu) = death rate (mu). The transmission rate (beta) describes how frequently susceptible (S) and infected (I) individuals come into contact, and the recovery rate (gamma) describes the the average time an individual spends infected before recovering.

C++ Code

Note: C++ Functions must be marked with the following comment for use in R: // [[Rcpp::export]].

When functions are exported in this way via sourceCpp(), RNG setup is automatically handled to use R’s engine. For details on random number generation with Rcpp, see the this Rcpp Gallery post.

#include <Rcpp.h>
using namespace Rcpp;

// This function will be used in R! Evaluates the number of events 
// and updates the states at each time step
// [[Rcpp::export]]
List tauleapCpp(List params) {
    // chained operations are tricky in cpp
    // pull out list w/in list into its own object
    List init = params["init"];
    // use Rcpp as() function to "cast" R vector to cpp scalar
    int nsteps = as<int>(params["nsteps"]);

    // initialize each state vector in its own vector
    // set all vals to initial vals
    // I use doubles (NumericVector) rather than 
    // ints (IntegerVector), since rpois returns double,
    // and the domain of double is a superset of int
    NumericVector SS(nsteps, init["S"]);
    NumericVector II(nsteps, init["I"]);
    NumericVector RR(nsteps, init["R"]);
    NumericVector NN(nsteps, init["pop"]);
    // fill time w/zeros
    NumericVector time(nsteps);

    // pull out params for easy reading 
    double nu = params["nu"];
    double mu = params["mu"];
    double beta = params["beta"];
    double gamma = params["gamma"];
    double tau = params["tau"];

    // Calculate the number of events for each step, update state vectors
    for (int istep = 0; istep < (nsteps-1); istep++) {
        // pull out this step's scalars for easier reading
        // and to avoid compiler headaches
        double iS = SS[istep];
        double iI = II[istep];
        double iR = RR[istep];
        double iN = NN[istep];
        // State Equations
        // R::rpois always returns a single value
        // to return multiple (e.g. Integer/NumericVector, 
        // use Rcpp::rpois(int ndraw, param) and friends
        double births = R::rpois(nu*iN*tau);
        // Prevent negative states
        double Sdeaths = std::min(iS, R::rpois(mu*iS*tau));
        double maxtrans = R::rpois(beta*(iI/iN)*iS*tau);
        double transmission = std::min(iS-Sdeaths, maxtrans);
        double Ideaths = std::min(iI, R::rpois(mu*iI*tau));
        double recovery = std::min(iI-Ideaths, R::rpois(gamma*iI*tau));
        double Rdeaths = std::min(iR, R::rpois(mu*iR*tau));
        // Calculate the change in each state variable
        double dS = births-Sdeaths-transmission;
        double dI = transmission-Ideaths-recovery;
        double dR = recovery-Rdeaths;
        // Update next timestep
        SS[istep+1] = iS + dS;
        II[istep+1] = iI + dI;
        RR[istep+1] = iR + dR;
        // Sum population
        NN[istep+1] = iS + iI + iR + dS + dI + dR;
        // time in fractional years (ie units parameters are given in)
        time[istep+1] = (istep+1)*tau;
    // Return results as data.frame
    DataFrame sim = DataFrame::create(
        Named("time") = time,
        Named("S") = SS,
        Named("I") = II,
        Named("R") = RR,
        Named("N") = NN
    return sim;

R Code

Next we need to parameterize the model. Modelers often deal with many named parameters, some of which are dependent on each other. My goal here is to specify parameters in R once (and only once), and then pass all of them together to the main cpp function.

## Specify model parameters use within() to make assignments *inside* an 
## empty (or existing) list. Yhis is a handy R trick that allows you to 
## refer to existing list elements on right hand side (RHS)
## Note the braces, <-, and and no commas here:  everything in braces is a
## regular code block, except that assignments happen *inside* the list
params <- list()
params <- within(params, {
    ## set rng state
    seed <- 0
    tau <- 0.001 # in years
    nyears <- 10
    ## total number of steps
    nsteps <- nyears/tau
    mu <- 1/70 #death rate
    gamma <- 365/10 #recovery rate
    R0 <- 10
    ## refers to R0 above
    beta <- R0*(gamma+mu) #transmission rate
    nu <- mu #birth rate
    ## initial conditions, list within list
    ## use within() to modify empty list, as above
    init <- within(list(), {
        pop <- 1e6
        S <- round(pop/R0)
        I <- round(pop*mu*(1-1/R0)/(gamma+mu))
        ## refers to S,I above
        R <- pop-S-I


## run the model once
result.df <- tauleapCpp(params)
nsim <- 12

## run many sims, combine all results into one data.frame
## plyr will combine results for us
result.rep <- ldply(1:nsim, function(.nn) {
    ## run the model
    result <- tauleapCpp(params)
    ## this wastes space, but is very simple and aids plotting
    result$nsim <- .nn

Plot Results

Note that the model contains no seasonality. Rather, the system experiences stochastic resonance, where the “noise” of stochastic state transitions stimulates a resonant frequency of the system (here, 2-3 years). For more information see here.

Sometimes epidemics die out. In fact, for this model, they will die out with probability = 1 as time goes to infinity!


## lattice plot of results
    xyplot(I ~ time | sprintf("Simulation %02d",nsim), 
        data=result.rep, type=c('l','g'), as.table=T,
        ylab='Infected', xlab='Year',

plot of chunk unnamed-chunk-4

April 25, 2015 12:00 AM

April 14, 2015

R you ready?

Beautiful plots while simulating loss in two-part procrustes problem

loss2Today I was working on a two-part procrustes problem and wanted to find out why my minimization algorithm sometimes does not converge properly or renders unexpected results. The loss function to be minimized is

\displaystyle  L(\mathbf{Q},c) = \| c \mathbf{A_1Q} - \mathbf{B_1} \|^2 + \| \mathbf{A_2Q} - \mathbf{B_2} \|^2 \rightarrow min

with \| \cdot \| denoting the Frobenius norm, c is an unknown scalar and \mathbf{Q} an unknown rotation matrix, i.e. \mathbf{Q}^T\mathbf{Q}=\mathbf{I}. \;\mathbf{A_1}, \mathbf{A_2}, \mathbf{B_1}, and \mathbf{B_1} are four real valued matrices. The minimum for c is easily found by setting the partial derivation of L(\mathbf{Q},c) w.r.t c equal to zero.

\displaystyle  c = \frac {tr \; \mathbf{Q}^T \mathbf{A_1}^T \mathbf{B_1}}  { \| \mathbf{A_1} \|^2 }

By plugging c into the loss function L(\mathbf{Q},c) we get a new loss function L(\mathbf{Q}) that only depends on \mathbf{Q}. This is the starting situation.

When trying to find out why the algorithm to minimize L(\mathbf{Q}) did not work as expected, I got stuck. So I decided to conduct a small simulation and generate random rotation matrices to study the relation between the parameter c and the value of the loss function L(\mathbf{Q}). Before looking at the results for the entire two-part procrustes problem from above, let’s visualize the results for the first part of the loss function only, i.e.

\displaystyle  L(\mathbf{Q},c) = \| c \mathbf{A_1Q} - \mathbf{B_1} \|^2 \rightarrow min

Here, c has the same minimum as for the whole formula above. For the simulation I used

\mathbf{A_1}= \begin{pmatrix}  0.0 & 0.4 & -0.5 \\  -0.4 & -0.8 & -0.5 \\  -0.1 & -0.5 & 0.2 \\  \end{pmatrix} \mkern18mu \qquad \text{and} \qquad \mkern36mu \mathbf{B_1}= \begin{pmatrix}  -0.1 & -0.8 & -0.1 \\  0.3 & 0.2 & -0.9 \\  0.1 & -0.3 & -0.5 \\  \end{pmatrix}

as input matrices. Generating many random rotation matrices \mathbf{Q} and plotting c against the value of the loss function yields the following plot.

This is a well behaved relation, for each scaling parameter c the loss is identical. Now let’s look at the full two-part loss function. As input matrices I used

\displaystyle  A1= \begin{pmatrix}  0.0 & 0.4 & -0.5 \\  -0.4 & -0.8 & -0.5 \\  -0.1 & -0.5 & 0.2 \\  \end{pmatrix} \mkern18mu , \mkern36mu B1= \begin{pmatrix}  -0.1 & -0.8 & -0.1 \\  0.3 & 0.2 & -0.9 \\  0.1 & -0.3 & -0.5 \\  \end{pmatrix}
A2= \begin{pmatrix}  0 & 0 & 1 \\  1 & 0 & 0 \\  0 & 1 & 0 \\  \end{pmatrix} \mkern18mu , \mkern36mu B2= \begin{pmatrix}  0 & 0 & 1 \\  1 & 0 & 0 \\  0 & 1 & 0 \\  \end{pmatrix}

and the following R-code.

# trace function
tr <- function(X) sum(diag(X))

# random matrix type 1
rmat_1 <- function(n=3, p=3, min=-1, max=1){
  matrix(runif(n*p, min, max), ncol=p)

# random matrix type 2, sparse
rmat_2 <- function(p=3) {
  diag(p)[, sample(1:p, p)]

# generate random rotation matrix Q. Based on Q find 
# optimal scaling factor c and calculate loss function value
one_sample <- function(n=2, p=2)
  Q <- mixAK::rRotationMatrix(n=1, dim=p) %*%         # random rotation matrix det(Q) = 1
    diag(sample(c(-1,1), p, rep=T))                   # additional reflections, so det(Q) in {-1,1}
  s <- tr( t(Q) %*% t(A1) %*% B1 ) / norm(A1, "F")^2  # scaling factor c
  rss <- norm(s*A1 %*% Q - B1, "F")^2 +               # get residual sum of squares
         norm(A2 %*% Q - B2, "F")^2 
  c(s=s, rss=rss)

# find c and rss or many random rotation matrices
set.seed(10)  # nice case for 3 x 3
n <- 3
p <- 3
A1 <- round(rmat_1(n, p), 1)
B1 <- round(rmat_1(n, p), 1)
A2 <- rmat_2(p)
B2 <- rmat_2(p)

x <- plyr::rdply(40000, one_sample(3,3)) 
plot(x$s, x$rss, pch=16, cex=.4, xlab="c", ylab="L(Q)", col="#00000010")

This time the result turns out to be very different and … beautiful :)

Here, we do not have a one to one relation between the scaling parameter and the loss function any more. I do not quite know what to make of this yet. But for now I am happy that it has aestethic value. Below you find some more beautiful graphics with different matrices as inputs.


by markheckmann at April 14, 2015 04:53 PM

April 02, 2015

RCpp Gallery

Call matplotlib from R


I often use Python and matplotlib for exploring measurement data (from e.g. accelerometers), even if I use R for the actual analysis. The reason is that I like to be able to flexibly zoom into different parts of the plot using the mouse and this works well for me with matplotlib. So I decided to try to call matplotlib from R using Rcpp and Python/C API.

It was surprisingly simple to get it working and I put together a small R-package Rpyplot. The package seems to work well on Ubuntu and Windows 7 for my use cases.

A lot of the code is based on the informative Call Python from R through Rcpp post in Rcpp gallery. I decided not use Boost.Python to make compiling on Windows simpler.

This post explains how I implemented the package and hopefully it will also allow others to expand the package for their needs. If you do implement additional functionality for Rpyplot I’d appreaciate a pull request on Github.

Set up

You’ll need to have Python in your path and Python headers and matplotlib installed (id sudo apt-get install python-dev python-matplotlib in Ubuntu). In Windows I have used the Anaconda Python distribution.

The following sets the compiler flags for Ubuntu:

py_cflags <- system("python2.7-config --cflags", intern=TRUE)
Sys.setenv("PKG_CXXFLAGS"=sprintf("%s %s", Sys.getenv("PKG_CXXFLAGS"), py_cflags))
py_ldflags <- system("python2.7-config --ldflags", intern=TRUE)
Sys.setenv("PKG_LIBS"=sprintf("%s", py_ldflags))

You can have a look at the in Rpyplot-package to see how to set the flags for Windows.

Calling Python from R


The snippet below contains code required to initialize Python and imports pyplot from matplotlib and pyrun function that can be used to call Python from R. All code executed with pyrun (or PyRun_SimpleString in C++) runs Python source code in the scope of __main__ module.

#include <Rcpp.h>
#include <Python.h>
#include <stdlib.h>
#ifndef WIN32
#include <dlfcn.h>

using namespace Rcpp;

//Run Python commands from R
void pyrun(std::string command) {

//You need to call this first
void initialize_python() {
#ifndef WIN32
   dlopen("", RTLD_LAZY |RTLD_GLOBAL); //Required to import matplotlib
    pyrun("import matplotlib");
    pyrun("import matplotlib.pyplot as plt");

//Call after you're done
void finalize_python() {

Copying data

It is not enough to be able to just run Python commands from strings, but we also need to pass data from R to Python. The numvec_to_python function below copies a numeric vector from R to a Python list and adds it to Python’s __main__ module. It is then accessible to Python commands executed with pyrun.

#include <Rcpp.h>
#include <Python.h>
#include <stdlib.h>n
using namespace Rcpp;

//Convert NumericVector to Python List
PyObject* numvec_to_list(NumericVector x) {
    int n = x.length();
    PyObject *xpy = PyList_New(n); //Make new list
    PyObject *f;
    for (int i=0; i<n; i++) {
      f = PyFloat_FromDouble(x[i]);
      PyList_SetItem(xpy, i, f); //Fill list from NumericVector

//Copy a numeric vector from R to Python
void numvec_to_python(std::string name, NumericVector x) {
    PyObject *xpy = numvec_to_list(x);
    PyObject *m = PyImport_AddModule("__main__");
    PyObject *main = PyModule_GetDict(m); //Get the locals dictionary of __main__ module
    PyDict_SetItemString(main, name.c_str(), xpy); //Add variable to that dictionary


Using the functions defined above makes calling matplotlib simple. First you will need to copy a vector to Python and then you are able to plot it running Python commands using pyrun. You can see how different plots are created in Rpyplot by looking at Plot.R. The implemenation of pycontourf function also shows how to copy an R matrix to Python and convert it to a NumPy array.

Here is a small example on how the above functions can be used to create two line plots. You’ll need to call in order to open the plot, but when you do the program will hang until all opened figure windows are closed so make sure to only call it at the end of a script.

x <- seq(0, 2*pi, length = 100)
sx <- sin(x)
cx <- cos(x)

#Copy variables to Python
numvec_to_python("x", x) 
numvec_to_python("sx", sx)
numvec_to_python("cx", cx)
#Set plot size
pyrun("plt.rcParams.update({'figure.figsize' : (7,4)})") 
#Create plots
pyrun("plt.plot(x, sx)")
pyrun("plt.plot(x, cx, '--r', linewidth=2) ")
pyrun("plt.legend(('sin(x)', 'cos(x)'))")
#pyrun("") #Uncomment this line to show the plot

And here is the generated plot:

April 02, 2015 12:00 AM

March 22, 2015

RCpp Gallery

Parsing Dates and Times


R has excellent support for dates and times via the built-in Date and POSIXt classes. Their usage, however, is not always as straightforward as one would want. Certain conversions are more cumbersome than we would like: while as.Date("2015-03-22"), would it not be nice if as.Date("20150322") (a format often used in logfiles) also worked, or for that matter as.Date(20150322L) using an integer variable, or even as.Date("2015-Mar-22") and as.Date("2015Mar22")?

Similarly, many date and time formats suitable for POSIXct (the short form) and POSIXlt (the long form with accessible components) often require rather too much formatting, and/or defaults. Why for example does as.POSIXct(as.numeric(Sys.time()), origin="1970-01-01") require the origin argument on the conversion back (from fractional seconds since the epoch) into datetime—when it is not required when creating the double-precision floating point representation of time since the epoch?

But thanks to Boost and its excellent Boost Date_Time library—which we already mentioned in this post about the BH package— we can address parsing of dates and times. It permitted us to write a new function toPOSIXct() which now part of the RcppBDT package (albeit right now just the GitHub version but we expect this to migrate to CRAN “soon” as well).


We will now discuss the outline of this implementation. For full details, see the source file.

Headers and Constants

#include <boost/date_time.hpp>
#include <boost/lexical_cast.hpp>
#include <Rcpp.h>

// [[Rcpp::depends(BH)]]

namespace bt = boost::posix_time;

const std::locale formats[] = {    // this shows a subset only, see the source file for full list
    std::locale(std::locale::classic(), new bt::time_input_facet("%Y-%m-%d %H:%M:%S%f")),
    std::locale(std::locale::classic(), new bt::time_input_facet("%Y/%m/%d %H:%M:%S%f")),

    std::locale(std::locale::classic(), new bt::time_input_facet("%Y-%m-%d")),
    std::locale(std::locale::classic(), new bt::time_input_facet("%b/%d/%Y")),
const size_t nformats = sizeof(formats)/sizeof(formats[0]);

Note that we show only two datetime formats along with two date formats. The actual implementation has many more.

Core Converter

The actual conversion from string to a double (the underlying format in POSIXct) is done by the following function. It loops over all given formats, and returns the computed value after the first match. In case of failure, a floating point NA is returned.

double stringToTime(const std::string s) {

    bt::ptime pt, ptbase;

    // loop over formats and try them til one fits
    for (size_t i=0; pt == ptbase && i < nformats; ++i) {
        std::istringstream is(s);
        is >> pt;
    if (pt == ptbase) {
        return NAN;
    } else { 
        const bt::ptime timet_start(boost::gregorian::date(1970,1,1));
        bt::time_duration diff = pt - timet_start;

        // Define BOOST_DATE_TIME_POSIX_TIME_STD_CONFIG to use nanoseconds
        // (and then use diff.total_nanoseconds()/1.0e9;  instead)
        return diff.total_microseconds()/1.0e6;

Convenience Wrappers

We want to be able to convert from numeric as well as string formats. For this, we write a templated (and vectorised) function which invokes the actual conversion function for each argument. It also deals (somewhat heuristically) with two corner cases: we want 20150322 be converted from either integer or numeric, but need in the latter case distinguish this value and its rangue from the (much larger) value for seconds since the epoch. That creates a minir ambiguity: we will not be able to convert back for inputs from seconds since the epoch for the first few years since January 1, 1970. But as these are rare in the timestamp form we can accept the trade-off.

template <int RTYPE>
Rcpp::DatetimeVector toPOSIXct_impl(const Rcpp::Vector<RTYPE>& sv) {

    int n = sv.size();
    Rcpp::DatetimeVector pv(n);
    for (int i=0; i<n; i++) {
        std::string s = boost::lexical_cast<std::string>(sv[i]);
        //Rcpp::Rcout << sv[i] << " -- " << s << std::endl;

        // Boost Date_Time gets the 'YYYYMMDD' format wrong, even
        // when given as an explicit argument. So we need to test here.
        // While we are at it, may as well test for obviously wrong data.
        int l = s.size();
        if ((l < 8) ||          // impossibly short
            (l == 9)) {         // 8 or 10 works, 9 cannot
            Rcpp::stop("Inadmissable input: %s", s);
        } else if (l == 8) {    // turn YYYYMMDD into YYYY/MM/DD
            s = s.substr(0, 4) + "/" + s.substr(4, 2) + "/" + s.substr(6,2);
        pv[i] = stringToTime(s);
    return pv;

User-facing Function

Finally, we can look at the user-facing function. It accepts input in either integer, numeric or character vector form, and then dispatches accordingly to the templated internal function we just discussed. Other inputs are unsuitable and trigger an error.

// [[Rcpp::export]]
Rcpp::DatetimeVector toPOSIXct(SEXP x) {
    if (Rcpp::is<Rcpp::CharacterVector>(x)) {
        return toPOSIXct_impl<STRSXP>(x);
    } else if (Rcpp::is<Rcpp::IntegerVector>(x)) {
        return toPOSIXct_impl<INTSXP>(x); 
    } else if (Rcpp::is<Rcpp::NumericVector>(x)) {
        // here we have two cases: either we are an int like
        // 200150315 'mistakenly' cast to numeric, or we actually
        // are a proper large numeric (ie as.numeric(Sys.time())
        Rcpp::NumericVector v(x);
        if (v[0] < 21990101) {  // somewhat arbitrary cuttoff
            // actual integer date notation: convert to string and parse
            return toPOSIXct_impl<REALSXP>(x);
        } else {
            // we think it is a numeric time, so treat it as one
            return Rcpp::DatetimeVector(x);
    } else {
        Rcpp::stop("Unsupported Type");
        return R_NilValue;//not reached


A simply illustration follows. A fuller demonstration is part of the RcppBDT package. This already shows support for subsecond granularity and a variety of date formats.

## parsing character
s <- c("2004-03-21 12:45:33.123456",    # ISO
       "2004/03/21 12:45:33.123456",    # variant
       "20040321",                      # just dates work fine as well
       "Mar/21/2004",                   # US format, also support month abbreviation or full
       "rapunzel")                      # will produce a NA

p <- toPOSIXct(s)

options("digits.secs"=6)                # make sure we see microseconds in output
print(format(p, tz="UTC"))              # format UTC times as UTC (helps for Date types too)
[1] "2004-03-21 12:45:33.123456" "2004-03-21 12:45:33.123456"
[3] "2004-03-21 00:00:00.000000" "2004-03-21 00:00:00.000000"
[5] NA                          

We can also illustrate integer and numeric inputs:

## parsing integer types
s <- c(20150315L, 20010101L, 20141231L)
p <- toPOSIXct(s)
print(format(p, tz="UTC"))
[1] "2015-03-15" "2001-01-01" "2014-12-31"
## parsing numeric types
s <- c(20150315, 20010101, 20141231)
print(format(p, tz="UTC"))
[1] "2015-03-15" "2001-01-01" "2014-12-31"

Note that we always forced display using UTC rather local time, the R default.

March 22, 2015 12:00 AM

March 09, 2015

Journal of the Royal Statistical Society: Series C

February 24, 2015

Douglas Bates

RCall: Running an embedded R in Julia

I have used R (and S before it) for a couple of decades. In the last few years most of my coding has been in Julia, a language for technical computing that can provide remarkable performance for a dynamically typed language via Just-In-Time (JIT) compilation of functions and via multiple dispatch.

Nonetheless there are facilities in R that I would like to have access to from Julia. I created the RCall package for Julia to do exactly that. This IJulia notebook provides an introduction to RCall.

This is not a novel idea by any means. Julia already has PyCall and JavaCall packages that provide access to Python and to Java. These packages are used extensively and are much more sophisticated than RCall, at present. Many other languages have facilities to run an embedded instance of R. In fact, Python has several such interfaces.

The things I plan to do using RCall is to access datasets from R and R packages, to fit models that are not currently implemented in Julia and to use R graphics, especially the ggplot2 and lattice packages. Unfortunately I am not currently able to start a graphics device from the embedded R but I expect that to be fixed soon.

I can tell you the most remarkable aspect of RCall although it may not mean much if you haven't tried to do this kind of thing. It is written entirely in Julia. There is absolutely no "glue" code written in a compiled language like C or C++. As I said, this may not mean much to you unless you have tried to do something like this, in which case it is astonishing.

by Douglas Bates ( at February 24, 2015 11:05 PM

January 16, 2015

Modern Toolmaking


My package caretEnsemble, for making ensembles of caret models, is now on CRAN.

Check it out, and let me know what you think! (Submit bug reports and feature requests to the issue tracker)

by Zachary Deane-Mayer ( at January 16, 2015 10:22 PM

January 15, 2015

Gregor Gorjanc

cpumemlog: Monitor CPU and RAM usage of a process (and its children)

Long time no see ...

Today I pushed the cpumemlog script to GitHub Read more about this useful utility at the GitHub site.

by Gregor Gorjanc ( at January 15, 2015 10:16 PM

January 01, 2015

Journal of the Royal Statistical Society: Series B

Journal of the Royal Statistical Society: Series A

December 27, 2014

RCpp Gallery

Create an R-tree data structure using Rcpp and Boost::Geometry


The purpose of this post is to show how to use Boost::Geometry library which was introduced recently in Rcpp. Especially, we focus on R-tree data structure for searching objects in space because only one spatial index is implemented - R-tree Currently in this library.

Boost.Geometry which is part of the Boost C++ Libraries gives us algorithms for solving geometry problems. In this library, the Boost.Geometry.Index which is one of components is intended to gather data structures called spatial indexes which are often used to searching for objects in space quickly. Generally speaking, spatial indexes stores geometric objects’ representations and allows searching for objects occupying some space or close to some point in space.


R-tree is a tree data structure used for spatial searching, i.e., for indexing multi-dimensional information such as geographical coordinates, rectangles or polygons. R-tree was proposed by Antonin Guttman in 1984 as an expansion of B-tree for multi-dimensional data and plays significant role in both theoretical and applied contexts. It is only one spatial index implemented in Boost::Geometry.

As a real application of this, It is often used to store spatial objects such as restaurant locations or the polygons which typical maps are made of: streets, buildings, outlines of lakes, coastlines, etc in order to perform a spatial query like “Find all stations within 1 km of my current location”, “Let me know all road segments in 2 km of my location” or “find the nearest gas station” which we often ask google seach by your voice recenlty. In this way, the R-tree can be used (nearest neighbor) search for some places.

You can find more explanations about R-tree in Wikipedia.

Write a wrapper class of rtree in Boost::Geometry using Rcpp

Now, we write a simple C++ wrapper class of rtree class in Boost::Geometry::Index that we can use in R.

The most important feature to mention here is the use of Rcpp module to expose your own class to R. Although almost all classes in Boost library have a lot of functions, , you do not use all in many cases. In that case, you should write your wrapper class for making your code simple.

Rcpp code

// [[Rcpp::depends(BH)]]

// Enable C++11 via this plugin to suppress 'long long' errors
// [[Rcpp::plugins("cpp11")]]

#include <vector>
#include <Rcpp.h>
#include <boost/geometry.hpp>
#include <boost/geometry/geometries/point.hpp>
#include <boost/geometry/geometries/box.hpp>
#include <boost/geometry/index/rtree.hpp>
using namespace Rcpp;

// Mnemonics
namespace bg = boost::geometry;
namespace bgi = boost::geometry::index;
typedef bg::model::point<float, 2, bg::cs::cartesian> point_t;
typedef bg::model::box<point_t> box;
typedef std::pair<box, unsigned int> value_t;

class RTreeCpp {
  // Constructor.
  // You have to give spatial data as a data frame.
  RTreeCpp(DataFrame df) {
      int size = df.nrows();
      NumericVector id   = df[0]; 
      NumericVector bl_x = df[1]; 
      NumericVector bl_y = df[2]; 
      NumericVector ur_x = df[3]; 
      NumericVector ur_y = df[4]; 
      for(int i = 0; i < size; ++i) {        
          // create a box
          box b(point_t(bl_x[i], bl_y[i]), point_t(ur_x[i], ur_y[i]));
          // insert new value
          rtree_.insert(std::make_pair(b, static_cast<unsigned int>(id[i])));   
  // This method(query) is k-nearest neighbor search. 
  // It returns some number of values nearest to some point(point argument) in space.

  std::vector<int> knn(NumericVector point, unsigned int n) {
      std::vector<value_t> result_n;
      rtree_.query(bgi::nearest(point_t(point[0], point[1]), n), std::back_inserter(result_n));    
      std::vector<int> indexes;
      std::vector<value_t>::iterator itr;
      for ( itr = result_n.begin(); itr != result_n.end(); ++itr ) {
          value_t value = *itr;
          indexes.push_back( value.second );
      return indexes;
    // R-tree can be created using various algorithm and parameters
    // You can change the algorithm by template parameter. 
    // In this example we use quadratic algorithm. 
    // Maximum number of elements in nodes in R-tree is set to 16.
    bgi::rtree<value_t, bgi::quadratic<16> > rtree_;

// [[Rcpp::export]]
std::vector<int> showKNN(Rcpp::DataFrame df, NumericVector point, unsigned int n) {
  RTreeCpp tree(df);            // recreate tree each time
  return tree.knn(point, n);

R code using RTreeCpp

First, we create a sample data set of spatial data.

# Sample spatial data(boxes)
points <- data.frame(
    bl_x=c(0, 2, 4), 
    bl_y=c(0, 2, 4), 
    ur_x=c(1, 3, 5), 
    ur_y=c(1, 3, 5))
 * To visually the data, we use the following code: 
size <- nrow(points)
#colors for rectangle area
colors <- rainbow(size)
#Plot these points
plot(c(0, 5), c(0, 5), type= "n", xlab="", ylab="")
for(i in 1:size){
    rect(points[i, "bl_x"], points[i, "bl_y"], points[i, "ur_x"], points[i, "ur_y"], col=colors[i])  
legend(4, 2, legend=points$id, fill=colors)

plot of chunk unnamed-chunk-5

One can use the RTreeCpp class as follows:

# new RTreeCpp object
# Search nearest neighbor points(return value : id of points data)
showKNN(points, c(0, 0), 1)
[1] 0
showKNN(points, c(0, 0), 2)
[1] 1 0
showKNN(points, c(0, 0), 3)
[1] 2 1 0

Note the re-creation of the RTreeCpp object is of course inefficient, but the Rcpp Gallery imposes some constraints on how we present code. For actual application a stateful and persistent object would be created. This could be done via Rcpp Modules as well a number of different ways. Here, however, we need to recreate the object for each call as knitr (which is used behind the scenes) cannot persist objects between code chunks. This is simply a technical limitation of the Rcpp Gallery—but not of Rcpp itself.

December 27, 2014 12:00 AM

December 15, 2014

R you ready?

QQ-plots in R vs. SPSS – A look at the differences


We teach two software packages, R and SPSS, in Quantitative Methods 101 for psychology freshman at Bremen University (Germany). Sometimes confusion arises, when the software packages produce different results. This may be due to specifics in the implemention of a method or, as in most cases, to different default settings. One of these situations occurs when the QQ-plot is introduced. Below we see two QQ-plots, produced by SPSS and R, respectively. The data used in the plots were generated by:

x <- sample(0:9, 100, rep=T)    


QQ-plot in SPSS using Blom's method


qqnorm(x, datax=T)      # uses Blom's method by default
qqline(x, datax=T)

There are some obvious differences:

  1. The most obvious one is that the R plot seems to contain more data points than the SPSS plot. Actually, this is not the case. Some data points are plotted on top of each in SPSS while they are spread out vertically in the R plot. The reason for this difference is that SPSS uses a different approach assigning probabilities to the values. We will expore the two approaches below.
  2. The scaling of the y-axis differs. R uses quantiles from the standard normal distribution. SPSS by default rescales these values using the mean and standard deviation from the original data. This allows to directly compare the original and theoretical values. This is a simple linear transformation and will not be explained any further here.
  3. The QQ-lines are not identical. R uses the 1st and 3rd quartile from both distributions to draw the line. This is different in SPSS where of a line is drawn for identical values on both axes. We will expore the differences below.

QQ-plots from scratch

To get a better understanding of the difference we will build the R and SPSS-flavored QQ-plot from scratch.

R type

In order to calculate theoretical quantiles corresponding to the observed values, we first need to find a way to assign a probability to each value of the original data. A lot of different approaches exist for this purpose (for an overview see e.g. Castillo-Gutiérrez, Lozano-Aguilera, & Estudillo-Martínez, 2012b). They usually build on the ranks of the observed data points to calculate corresponding p-values, i.e. the plotting positions for each point. The qqnorm function uses two formulae for this purpose, depending on the number of observations n (Blom’s mfethod, see ?qqnorm; Blom, 1958). With r being the rank, for n > 10 it will use the formula p = (r - 1/2) / n, for n \leq 10 the formula p = (r - 3/8) / (n + 1/4) to determine the probability value p for each observation (see the help files for the functions qqnorm and ppoint). For simplicity reasons, we will only implement the n > 10 case here.

n <- length(x)          # number of observations
r <- order(order(x))    # order of values, i.e. ranks without averaged ties
p <- (r - 1/2) / n      # assign to ranks using Blom's method
y <- qnorm(p)           # theoretical standard normal quantiles for p values
plot(x, y)              # plot empirical against theoretical values

Before we take at look at the code, note that our plot is identical to the plot generated by qqnorm above, except that the QQ-line is missing. The main point that makes the difference between R and SPSS is found in the command order(order(x)). The command calculates ranks for the observations using ordinal ranking. This means that all observations get different ranks and no average ranks are calculated for ties, i.e. for observations with equal values. Another approach would be to apply fractional ranking and calculate average values for ties. This is what the function rank does. The following codes shows the difference between the two approaches to assign ranks.

v <- c(1,1,2,3,3)
order(order(v))     # ordinal ranking used by R
## [1] 1 2 3 4 5
rank(v)             # fractional ranking used by SPSS
## [1] 1.5 1.5 3.0 4.5 4.5

R uses ordinal ranking and SPSS uses fractional ranking by default to assign ranks to values. Thus, the positions do not overlap in R as each ordered observation is assigned a different rank and therefore a different p-value. We will pick up the second approach again later, when we reproduce the SPSS-flavored plot in R.1

The second difference between the plots concerned the scaling of the y-axis and was already clarified above.

The last point to understand is how the QQ-line is drawn in R. Looking at the probs argument of qqline reveals that it uses the 1st and 3rd quartile of the original data and theoretical distribution to determine the reference points for the line. We will draw the line between the quartiles in red and overlay it with the line produced by qqline to see if our code is correct.

plot(x, y)                      # plot empirical against theoretical values
ps <- c(.25, .75)               # reference probabilities
a <- quantile(x, ps)            # empirical quantiles
b <- qnorm(ps)                  # theoretical quantiles
lines(a, b, lwd=4, col="red")   # our QQ line in red
qqline(x, datax=T)              # R QQ line

The reason for different lines in R and SPSS is that several approaches to fitting a straight line exist (for an overview see e.g. Castillo-Gutiérrez, Lozano-Aguilera, & Estudillo-Martínez, 2012a). Each approach has different advantages. The method used by R is more robust when we expect values to diverge from normality in the tails, and we are primarily interested in the normality of the middle range of our data. In other words, the method of fitting an adequate QQ-line depends on the purpose of the plot. An explanation of the rationale of the R approach can e.g. be found here.

SPSS type

The default SPSS approach also uses Blom’s method to assign probabilities to ranks (you may choose other methods is SPSS) and differs from the one above in the following aspects:

  • a) As already mentioned, SPSS uses ranks with averaged ties (fractional rankings) not the plain order ranks (ordinal ranking) as in R to derive the corresponding probabilities for each data point. The rest of the code is identical to the one above, though I am not sure if SPSS distinguishes between the n  10 case.
  • b) The theoretical quantiles are scaled to match the estimated mean and standard deviation of the original data.
  • c) The QQ-line goes through all quantiles with identical values on the x and y axis.
n <- length(x)                # number of observations
r <- rank(x)                  # a) ranks using fractional ranking (averaging ties)
p <- (r - 1/2) / n            # assign to ranks using Blom's method
y <- qnorm(p)                 # theoretical standard normal quantiles for p values
y <- y * sd(x) + mean(x)      # b) transform SND quantiles to mean and sd from original data
plot(x, y)                    # plot empirical against theoretical values

Lastly, let us add the line. As the scaling of both axes is the same, the line goes through the origin with a slope of 1.

abline(0,1)                   # c) slope 0 through origin

The comparison to the SPSS output shows that they are (visually) identical.

Function for SPSS-type QQ-plot

The whole point of this demonstration was to pinpoint and explain the differences between a QQ-plot generated in R and SPSS, so it will no longer be a reason for confusion. Note, however, that SPSS offers a whole range of options to generate the plot. For example, you can select the method to assign probabilities to ranks and decide how to treat ties. The plots above used the default setting (Blom’s method and averaging across ties). Personally I like the SPSS version. That is why I implemented the function qqnorm_spss in the ryouready package, that accompanies the course. The formulae for the different methods to assign probabilities to ranks can be found in Castillo-Gutiérrez et al. (2012b). The implentation is a preliminary version that has not yet been thoroughly tested. You can find the code here. Please report any bugs or suggestions for improvements (which are very welcome) in the github issues section.

install_github("markheckmann/ryouready")                # install from github repo
library(ryouready)                                      # load package
qq <- qqnorm_spss(x, method=1, ties.method="average")   # Blom's method with averaged ties
plot(qq)                                                # generate QQ-plot
ggplot(qq)                                              # use ggplot2 to generate QQ-plot


  1. Technical sidenote: Internally, qqnorm uses the function ppoints to generate the p-values. Type in stats:::qqnorm.default to the console to have a look at the code. 

by markheckmann at December 15, 2014 08:55 AM

October 20, 2014

Modern Toolmaking

For faster R on a mac, use veclib

Update: The links to all my github gists on blogger are broken, and I can't figure out how to fix them.  If you know how to insert gitub gists on a dynamic blogger template, please let me known.

In the meantime, here are instructions with links to the code:
First of all, use homebrew to compile openblas.  It's easy!  Second of all, you can also use homebrew to install R! (But maybe stick with the CRAN version unless you really want to compile your own R binary)

To use openblas with R, follow these instructions:

To use veclib with R, follow these intructions:


Inspired by this post, I decided to try using OpenBLAS for R on my mac.  However, it turns out there's a simpler option, using the vecLib BLAS library, which is provided by Apple as part of the accelerate framework.

If you are using R 2.15, follow these instructions to change your BLAS from the default to vecLib:

However, as noted in r-sig-mac, these instructions do not work for R 3.0.  You have to directly link to the accelerate framework's version of vecLib:

Finally, test your new blas using this script:

On my system (a retina macbook pro), the default BLAS takes 141 seconds and vecLib takes 43 seconds, which is a significant speedup.  If you plan to use vecLib, note the following warning from the R development team "Although fast, it is not under our control and may possibly deliver inaccurate results."

So far, I have not encountered any issues using vecLib, but it's only been a few hours :-).

UPDATE: you can also install OpenBLAS on a mac:

If you do this, make sure to change the directories to point to the correct location on your system  (e.g. change /users/zach/source to whatever directory you clone the git repo into).  On my system, the benchmark script takes ~41 seconds when using openBLAS, which is a small but significant speedup.

by Zachary Deane-Mayer ( at October 20, 2014 04:24 PM

September 19, 2014

Chris Lawrence

What could a federal UK look like?

Assuming that the “no” vote prevails in the Scottish independence referendum, the next question for the United Kingdom is to consider constitutional reform to implement a quasi-federal system and resolve the West Lothian question once and for all. In some ways, it may also provide an opportunity to resolve the stalled reform of the upper house as well. Here’s the rough outline of a proposal that might work.

  • Devolve identical powers to England, Northern Ireland, Scotland, and Wales, with the proviso that local self-rule can be suspended if necessary by the federal legislature (by a supermajority).

  • The existing House of Commons becomes the House of Commons for England, which (along with the Sovereign) shall comprise the English Parliament. This parliament would function much as the existing devolved legislatures in Scotland and Wales; the consociational structure of the Northern Ireland Assembly (requiring double majorities) would not be replicated.

  • The House of Lords is abolished, and replaced with a directly-elected Senate of the United Kingdom. The Senate will have authority to legislate on the non-devolved powers (in American parlance, “delegated” powers) such as foreign and European Union affairs, trade and commerce, national defense, and on matters involving Crown dependencies and territories, the authority to legislate on devolved matters in the event self-government is suspended in a constituent country, and dilatory powers including a qualified veto (requiring a supermajority) over the legislation proposed by a constituent country’s parliament. The latter power would effectively replace the review powers of the existing House of Lords; it would function much as the Council of Revision in Madison’s original plan for the U.S. Constitution.

    As the Senate will have relatively limited powers, it need not be as large as the existing Lords or Commons. To ensure the countries other than England have a meaningful voice, given that nearly 85% of the UK’s population is in England, two-thirds of the seats would be allocated proportionally based on population and one-third allocated equally to the four constituent countries. This would still result in a chamber with a large English majority (around 64.4%) but nonetheless would ensure the other three countries would have meaningful representation as well.

by Chris Lawrence at September 19, 2014 12:00 AM

September 12, 2014

R you ready?

Using colorized PNG pictograms in R base plots

Today I stumbled across a figure in an explanation on multiple factor analysis which contained pictograms.



Figure 1 from Abdi & Valentin (2007), p. 8.

I wanted to reproduce a similar figure in R using pictograms and additionally color them e.g. by group membership . I have almost no knowledge about image processing, so I tried out several methods of how to achieve what I want. The first thing I did was read in an PNG file and look at the data structure. The package png allows to read in PNG files. Note that all of the below may not work on Windows machines, as it does not support semi-transparency (see ?readPNG).

img <- readPNG(system.file("img", "Rlogo.png", package="png"))
## [1] "array"
## [1]  76 100   4

The object is a numerical array with four layers (red, green, blue, alpha; short RGBA). Let’s have a look at the first layer (red) and replace all non-zero entries by a one and the zeros by a dot. This will show us the pattern of non-zero values and we already see the contours.

l4 <- img[,,1]
l4[l4 > 0] <- 1
l4[l4 == 0] <- "."
d <- apply(l4, 1, function(x) {
 cat(paste0(x, collapse=""), "\n") 

To display the image in R one way is to raster the image (i.e. the RGBA layers are collapsed into a layer of single HEX value) and print it using rasterImage.


rimg <- as.raster(img) # raster multilayer object
r <- nrow(rimg) / ncol(rimg) # image ratio
plot(c(0,1), c(0,r), type = "n", xlab = "", ylab = "", asp=1)
rasterImage(rimg, 0, 0, 1, r) 


Let’s have a look at a small part the rastered image object. It is a matrix of HEX values.

rimg[40:50, 1:6]
## [1,] "#C4C5C202" "#858981E8" "#838881FF" "#888D86FF" "#8D918AFF" "#8F938CFF"
## [2,] "#00000000" "#848881A0" "#80847CFF" "#858A83FF" "#898E87FF" "#8D918BFF"
## [3,] "#00000000" "#8B8E884C" "#7D817AFF" "#82867EFF" "#868B84FF" "#8A8E88FF"
## [4,] "#00000000" "#9FA29D04" "#7E827BE6" "#7E817AFF" "#838780FF" "#878C85FF"
## [5,] "#00000000" "#00000000" "#81857D7C" "#797E75FF" "#7F827BFF" "#838781FF"
## [6,] "#00000000" "#00000000" "#898C8510" "#787D75EE" "#797E76FF" "#7F837BFF"
## [7,] "#00000000" "#00000000" "#00000000" "#7F837C7B" "#747971FF" "#797E76FF"
## [8,] "#00000000" "#00000000" "#00000000" "#999C9608" "#767C73DB" "#747971FF"
## [9,] "#00000000" "#00000000" "#00000000" "#00000000" "#80847D40" "#71766EFD"
## [10,] "#00000000" "#00000000" "#00000000" "#00000000" "#00000000" "#787D7589"
## [11,] "#00000000" "#00000000" "#00000000" "#00000000" "#00000000" "#999C9604"

And print this small part.

plot(c(0,1), c(0,.6), type = "n", xlab = "", ylab = "", asp=1)
rasterImage(rimg[40:50, 1:6], 0, 0, 1, .6) 


Now we have an idea of how the image object and the rastered object look like from the inside. Let’s start to modify the images to suit our needs.

In order to change the color of the pictograms, my first idea was to convert the graphics to greyscale and remap the values to a color ramp of may choice. To convert to greyscale there are tons of methods around (see e.g. here). I just pick one of them I found on SO by chance. With R=Red, G=Green and B=Blue we have

brightness = sqrt(0.299 * R^2 + 0.587 * G^2 + 0.114 * B^2)

This approach modifies the PNG files after they have been coerced into a raster object.

# function to calculate brightness values
brightness <- function(hex) {
  v <- col2rgb(hex)
  sqrt(0.299 * v[1]^2 + 0.587 * v[2]^2 + 0.114 * v[3]^2) /255

# given a color ramp, map brightness to ramp also taking into account 
# the alpha level. The defaul color ramp is grey
img_to_colorramp <- function(img, ramp=grey) {
  cv <- as.vector(img)
  b <- sapply(cv, brightness)
  g <- ramp(b)
  a <- substr(cv, 8,9)     # get alpha values
  ga <- paste0(g, a)       # add alpha values to new colors
  img.grey <- matrix(ga, nrow(img), ncol(img), byrow=TRUE)  

# read png and modify
img <- readPNG(system.file("img", "Rlogo.png", package="png"))
img <- as.raster(img)           # raster multilayer object
r <- nrow(img) / ncol(img)      # image ratio
s <- 3.5                        # size

plot(c(0,10), c(0,3.5), type = "n", xlab = "", ylab = "", asp=1)

rasterImage(img, 0, 0, 0+s/r, 0+s)  # original
img2 <- img_to_colorramp(img)       # modify using grey scale
rasterImage(img2, 5, 0, 5+s/r, 0+s)


Great, it works! Now Let’s go and try out some other color palettes using colorRamp to create a color ramp.

plot(c(0,10),c(0,8.5), type = "n", xlab = "", ylab = "", asp=1)

img1 <- img_to_colorramp(img)
rasterImage(img1, 0, 5, 0+s/r, 5+s)

reds <- function(x) 
  rgb(colorRamp(c("darkred", "white"))(x), maxColorValue = 255)
img2 <- img_to_colorramp(img, reds)
rasterImage(img2, 5, 5, 5+s/r, 5+s)

greens <- function(x) 
  rgb(colorRamp(c("darkgreen", "white"))(x), maxColorValue = 255)
img3 <- img_to_colorramp(img, greens)
rasterImage(img3, 0, 0, 0+s/r, 0+s)

single_color <- function(...) "#0000BB"
img4 <- img_to_colorramp(img, single_color)
rasterImage(img4, 5, 0, 5+s/r, 0+s)


Okay, that basically does the job. Now we will apply it to the wine pictograms.
Let’s use this wine glass from Wikimedia Commons. It’s quite big so I uploaded a reduced size version to imgur . We will use it for our purposes.

# load file from web
f <- tempfile()
download.file("", f)
img <- readPNG(f)
img <- as.raster(img)
r <- nrow(img) / ncol(img)
s <- 1

# let's create a function that returns a ramp function to save typing
ramp <- function(colors) 
  function(x) rgb(colorRamp(colors)(x), maxColorValue = 255)

# create dataframe with coordinates and colors
x <- data.frame(x=rnorm(16, c(2,2,4,4)), 
                y=rnorm(16, c(1,3)), 
                colors=c("black", "darkred", "garkgreen", "darkblue"))

plot(c(1,6), c(0,5), type="n", xlab="", ylab="", asp=1)
for (i in 1L:nrow(x)) {
  colorramp <- ramp(c(x[i,3], "white"))
  img2 <- img_to_colorramp(img, colorramp)
  rasterImage(img2, x[i,1], x[i,2], x[i,1]+s/r, x[i,2]+s)


Another approach would be to modifying the RGB layers before rastering to HEX values.

img <- readPNG(system.file("img", "Rlogo.png", package="png"))
img2 <- img
img[,,1] <- 0    # remove Red component
img[,,2] <- 0    # remove Green component
img[,,3] <- 1    # Set Blue to max
img <- as.raster(img)
r <- nrow(img) / ncol(img)  # size ratio
s <- 3.5   # size
plot(c(0,10), c(0,3.5), type = "n", xlab = "", ylab = "", asp=1)
rasterImage(img, 0, 0, 0+s/r, 0+s)

img2[,,1] <- 1   # Red to max
img2[,,2] <- 0
img2[,,3] <- 0
rasterImage(as.raster(img2), 5, 0, 5+s/r, 0+s)


To just colorize the image, we could weight each layer.

# wrap weighting into function
weight_layers <- function(img, w) {
  for (i in seq_along(w))
    img[,,i] <- img[,,i] * w[i]

plot(c(0,10), c(0,3.5), type = "n", xlab = "", ylab = "", asp=1)
img <- readPNG(system.file("img", "Rlogo.png", package="png"))
img2 <- weight_layers(img, c(.2, 1,.2))
rasterImage(img2, 0, 0, 0+s/r, 0+s)

img3 <- weight_layers(img, c(1,0,0))
rasterImage(img3, 5, 0, 5+s/r, 0+s)


After playing around and hard-coding the modifications I started to search and found the EBimage package which has a lot of features for image processing that make ones life (in this case only a bit) easier.

f <- system.file("img", "Rlogo.png", package="png")
img <- readImage(f) 
img2 <- img

img[,,2] = 0      # zero out green layer
img[,,3] = 0      # zero out blue layer
img <- as.raster(img)

img2[,,1] = 0
img2[,,3] = 0
img2 <- as.raster(img2)

r <- nrow(img) / ncol(img)
s <- 3.5
plot(c(0,10), c(0,3.5), type = "n", xlab = "", ylab = "", asp=1)
rasterImage(img, 0, 0, 0+s/r, 0+s)
rasterImage(img2, 5, 0, 5+s/r, 0+s)


EBImage is a good choice and fairly easy to handle. Now let’s again print the pictograms.

f <- tempfile(fileext=".png")
download.file("", f)
img <- readImage(f)

# will replace whole image layers by one value
# only makes sense if there is a alpha layer that 
# gives the contours
mod_color <- function(img, col) {
  v <- col2rgb(col) / 255
  img = channel(img, 'rgb')
  img[,,1] = v[1]   # Red
  img[,,2] = v[2]   # Green
  img[,,3] = v[3]   # Blue

r <- nrow(img) / ncol(img)  # get image ratio
s <- 1                      # size

# create random data
x <- data.frame(x=rnorm(16, c(2,2,4,4)), 
                y=rnorm(16, c(1,3)), 

# plot pictograms 
plot(c(1,6), c(0,5), type="n", xlab="", ylab="", asp=1)
for (i in 1L:nrow(x)) {
  img2 <- mod_color(img, x[i, 3])
  rasterImage(img2, x[i,1], x[i,2], x[i,1]+s*r, x[i,2]+s)


Note, that above I did not bother to center each pictogram to position it correctly. This still needs to be done. Anyway, that’s it! Mission completed.


Abdi, H., & Valentin, D. (2007). Multiple factor analysis (MFA). In N. Salkind (Ed.), Encyclopedia of Measurement and Statistics (pp. 1–14). Thousand Oaks, CA: Sage Publications. Retrieved from

by markheckmann at September 12, 2014 09:19 AM

June 18, 2014

Chris Lawrence

Soccer queries answered

Kevin Drum asks a bunch of questions about soccer:

  1. Outside the penalty area there’s a hemisphere about 20 yards wide. I can’t recall ever seeing it used for anything. What’s it for?
  2. On several occasions, I’ve noticed that if the ball goes out of bounds at the end of stoppage time, the referee doesn’t whistle the match over. Instead, he waits for the throw-in, and then immediately whistles the match over. What’s the point of this?
  3. Speaking of stoppage time, how has it managed to last through the years? I know, I know: tradition. But seriously. Having a timekeeper who stops the clock for goals, free kicks, etc. has lots of upside and no downside. Right? It wouldn’t change the game in any way, it would just make timekeeping more accurate, more consistent, and more transparent for the fans and players. Why keep up the current pretense?
  4. What’s the best way to get a better sense of what’s a foul and what’s a legal tackle? Obviously you can’t tell from the players’ reactions, since they all writhe around like landed fish if they so much as trip over their own shoelaces. Reading the rules provides the basics, but doesn’t really help a newbie very much. Maybe a video that shows a lot of different tackles and explains why each one is legal, not legal, bookable, etc.?

The first one’s easy: there’s a general rule that no defensive player can be within 10 yards of the spot of a direct free kick. A penalty kick (which is a type of direct free kick) takes place in the 18-yard box, and no players other than the player taking the kick and the goalkeeper are allowed in the box. However, owing to geometry, the 18 yard box and the 10 yard exclusion zone don’t fully coincide, hence the penalty arc. (That’s also why there are two tiny hash-marks on the goal line and side line 10 yards from the corner flag. And why now referees have a can of shaving cream to mark the 10 yards for other free kicks, one of the few MLS innovations that has been a good idea.)

Second one’s also easy: the half and the game cannot end while the ball is out of play.

Third one’s harder. First, keeping time inexactly forestalls the silly premature celebrations that are common in most US sports. You’d never see the Stanford-Cal play happen in a soccer game. Second, it allows some slippage for short delays and doesn’t require exact timekeeping; granted, this was more valuable before instant replays and fourth officials, but most US sports require a lot of administrative record-keeping by ancillary officials. A soccer game can be played with one official (and often is, particularly at the amateur level) without having to change timing rules;* in developing countries in particular this lowers the barriers to entry for the sport (along with the low equipment requirements) without changing the nature of the game appreciably. Perhaps most importantly, if the clock was allowed to stop regularly it would create an excuse for commercial timeouts and advertising breaks, which would interrupt the flow of the game and potentially reduce the advantages of better-conditioned and more skilled athletes. (MLS tried this, along with other exciting American ideas like “no tied games,” and it was as appealing to actual soccer fans as ketchup on filet mignon would be to a foodie, and perhaps more importantly didn’t make any non-soccer fans watch.)

Fourth, the key distinction is usually whether there was an obvious attempt to play the ball; in addition, in the modern game, even some attempts to play the ball are considered inherently dangerous (tackling from behind, many sliding tackles, etc.) and therefore are fouls even if they are successful in getting more ball than human.

* To call offside, you’d also probably need what in my day we called a “linesman.”

by Chris Lawrence at June 18, 2014 12:00 AM

May 07, 2014

Chris Lawrence

The mission and vision thing

Probably the worst-kept non-secret is that the next stage of the institutional evolution of my current employer is to some ill-defined concept of “university status,” which mostly involves the establishment of some to-be-determined master’s degree programs. In the context of the University System of Georgia, it means a small jump from the “state college” prestige tier (a motley collection of schools that largely started out as two-year community colleges and transfer institutions) to the “state university” tier (which is where most of the ex-normal schools hang out these days). What is yet to be determined is how that transition will affect the broader institution that will be the University of Middle Georgia.* People on high are said to be working on these things; in any event, here are my assorted random thoughts on what might be reasonable things to pursue:

  • Marketing and positioning: Unlike the situation facing many of the other USG institutions, the population of the two anchor counties of our core service area (Bibb and Houston) is growing, and Houston County in particular has a statewide reputation for the quality of its public school system. Rather than conceding that the most prepared students from these schools will go to Athens or Atlanta or Valdosta, we should strongly market our institutional advantages over these more “prestigious” institutions, particularly in terms of the student experience in the first two years and the core curriculum: we have no large lecture courses, no teaching assistants, no lengthy bus rides to and from class every day, and the vast majority of the core is taught by full-time faculty with terminal degrees. Not to mention costs to students are much lower, particularly in the case of students who do not qualify for need-based aid. Even if we were to “lose” these students as transfers to the top-tier institutions after 1–4 semesters, we’d still benefit from the tuition and fees they bring in and we would not be penalized in the upcoming state performance funding formula. Dual enrollment in Warner Robins in particular is an opportunity to showcase our institution as a real alternative for better prepared students rather than a safety school.
  • Comprehensive offerings at the bachelor’s level: As a state university, we will need to offer a comprehensive range of options for bachelor’s students to attract and retain students, both traditional and nontraditional. In particular, B.S. degrees in political science and sociology with emphasis in applied empirical skills would meet public and private employer demand for workers who have research skills and the ability to collect, manage, understand, and use data appropriately. There are other gaps in the liberal arts and sciences as well that need to be addressed to become a truly comprehensive state university.
  • Create incentives to boost the residential population: The college currently has a heavy debt burden inherited from the overbuilding of dorms at the Cochran campus. We need to identify ways to encourage students to live in Cochran, which may require public-private partnerships to try to build a “college town” atmosphere in the community near campus. We also need to work with wireless providers like Sprint and T-Mobile to ensure that students from the “big city” can fully use their cell phones and tablets in Cochran and Eastman without roaming fees or changing wireless providers.
  • Tie the institution more closely to the communities we serve: This includes both physical ties and psychological ties. The Macon campus in particular has poor physical links to the city itself for students who might walk or ride bicycles; extending the existing bike/walking trail from Wesleyan to the Macon campus should be a priority, as should pedestrian access and bike facilities along Columbus Road. Access to the Warner Robins campus is somewhat better but still could be improved. More generally, the institution is perceived as an afterthought or alternative of last resort in the community. Improving this situation and perception among community leaders and political figures may require a physical presence in or near downtown Macon, perhaps in partnership with the GCSU Graduate Center.

* There is no official name-in-waiting, but given that our former interim president seemed to believe he could will this name into existence by repeating it enough I’ll stick with it. The straw poll of faculty trivia night suggests that it’s the least bad option available, which inevitably means the regents will choose something else instead (if the last name change is anything to go by).

by Chris Lawrence at May 07, 2014 12:00 AM

February 17, 2014

Seth Falcon

Have Your SHA and Bcrypt Too


I've been putting off sharing this idea because I've heard the rumors about what happens to folks who aren't security experts when they post about security on the internet. If this blog is replaced with cat photos and rainbows, you'll know what happened.

The Sad Truth

It's 2014 and chances are you have accounts on websites that are not properly handling user passwords. I did no research to produce the following list of ways passwords are mishandled in decreasing order of frequency:

  1. Site uses a fast hashing algorithm, typically SHA1(salt + plain-password).
  2. Site doesn't salt password hashes
  3. Site stores raw passwords

We know that sites should be generating secure random salts and using an established slow hashing algorithm (bcrypt, scrypt, or PBKDF2). Why are sites not doing this?

While security issues deserve a top spot on any site's priority list, new features often trump addressing legacy security concerns. The immediacy of the risk is hard to quantify and it's easy to fall prey to a "nothing bad has happened yet, why should we change now" attitude. It's easy for other bugs, features, or performance issues to win out when measured by immediate impact. Fixing security or other "legacy" issues is the Right Thing To Do and often you will see no measurable benefit from the investment. It's like having insurance. You don't need it until you do.

Specific to the improper storage of user password data is the issue of the impact to a site imposed by upgrading. There are two common approaches to upgrading password storage. You can switch cold turkey to the improved algorithms and force password resets on all of your users. Alternatively, you can migrate incrementally such that new users and any user who changes their password gets the increased security.

The cold turkey approach is not a great user experience and sites might choose to delay an upgrade to avoid admitting to a weak security implementation and disrupting their site by forcing password resets.

The incremental approach is more appealing, but the security benefit is drastically diminished for any site with a substantial set of existing users.

Given the above migration choices, perhaps it's (slightly) less surprising that businesses choose to prioritize other work ahead of fixing poorly stored user password data.

The Idea

What if you could upgrade a site so that both new and existing users immediately benefited from the increased security, but without the disruption of password resets? It turns out that you can and it isn't very hard.

Consider a user table with columns:


Where the hashed_pass column is computed using a weak fast algorithm, for example SHA1(salt + plain_pass).

The core of the idea is to apply a proper algorithm on top of the data we already have. I'll use bcrypt to make the discussion concrete. Add columns to the user table as follows:


Process the existing user table by computing bcrypt(salt2 + hashed_pass) and storing the result in the hashed_pass column (overwriting the less secure value); save the new salt value to salt2 and set hash_type to bycrpt+sha1.

To verify a user where hash_type is bcrypt+sha1, compute bcrypt(salt2 + SHA1(salt + plain_pass)) and compare to the hashed_pass value. Note that bcrypt implementations encode the salt as a prefix of the hashed value so you could avoid the salt2 column, but it makes the idea easier to explain to have it there.

You can take this approach further and have any user that logs in (as well as new users) upgrade to a "clean" bcrypt only algorithm since you can now support different verification algorithms using hash_type. With the proper application code changes in place, the upgrade can be done live.

This scheme will also work for sites storing non-salted password hashes as well as those storing plain text passwords (THE HORROR).

Less Sadness, Maybe

Perhaps this approach makes implementing a password storage security upgrade more palatable and more likely to be prioritized. And if there's a horrible flaw in this approach, maybe you'll let me know without turning this blog into a tangle of cat photos and rainbows.

February 17, 2014 07:08 PM

December 26, 2013

Seth Falcon

A Rebar Plugin for Locking Deps: Reproducible Erlang Project Builds For Fun and Profit

What's this lock-deps of which you speak?

If you use rebar to generate an OTP release project and want to have reproducible builds, you need the rebar_lock_deps_plugin plugin. The plugin provides a lock-deps command that will generate a rebar.config.lock file containing the complete flattened set of project dependencies each pegged to a git SHA. The lock file acts similarly to Bundler's Gemfile.lock file and allows for reproducible builds (*).

Without lock-deps you might rely on the discipline of using a tag for all of your application's deps. This is insufficient if any dep depends on something not specified as a tag. It can also be a problem if a third party dep doesn't provide a tag. Generating a rebar.config.lock file solves these issues. Moreover, using lock-deps can simplify the work of putting together a release consisting of many of your own repos. If you treat the master branch as shippable, then rather than tagging each subproject and updating rebar.config throughout your project's dependency chain, you can run get-deps (without the lock file), compile, and re-lock at the latest versions throughout your project repositories.

The reproducibility of builds when using lock-deps depends on the SHAs captured in rebar.config.lock. The plugin works by scanning the cloned repos in your project's deps directory and extracting the current commit SHA. This works great until a repository's history is rewritten with a force push. If you really want reproducible builds, you need to not nuke your SHAs and you'll need to fork all third party repos to ensure that someone else doesn't screw you over in this fashion either. If you make a habit of only depending on third party repos using a tag, assume that upstream maintainers are not completely bat shit crazy, and don't force push your master branch, then you'll probably be fine.

Getting Started

Install the plugin in your project by adding the following to your rebar.config file:

%% Plugin dependency
{deps, [
    {rebar_lock_deps_plugin, ".*",
     {git, "git://", {branch, "master"}}}

%% Plugin usage
{plugins, [rebar_lock_deps_plugin]}.

To test it out do:

rebar get-deps
# the plugin has to be compiled so you can use it
rebar compile
rebar lock-deps

If you'd like to take a look at a project that uses the plugin, take a look at CHEF's erchef project.

Bonus features

If you are building an OTP release project using rebar generate then you can use rebar_lock_deps_plugin to enhance your build experience in three easy steps.

  1. Use rebar bump-rel-version version=$BUMP to automate the process of editing rel/reltool.config to update the release version. The argument $BUMP can be major, minor, or patch (default) to increment the specified part of a semver X.Y.Z version. If $BUMP is any other value, it is used as the new version verbatim. Note that this function rewrites rel/reltool.config using ~p. I check-in the reformatted version and maintain the formatting when editing. This way, the general case of a version bump via bump-rel-version results in a minimal diff.

  2. Autogenerate a change summary commit message for all project deps. Assuming you've generated a new lock file and bumped the release version, use rebar commit-release to commit the changes to rebar.config.lock and rel/reltool.config with a commit message that summarizes the changes made to each dependency between the previously locked version and the newly locked version. You can get a preview of the commit message via rebar log-changed-deps.

  3. Finally, create an annotated tag for your new release with rebar tag-release which will read the current version from rel/reltool.config and create an annotated tag named with the version.

The dependencies, they are ordered

Up to version 2.0.1 of rebar_lock_deps_plugin, the dependencies in the generated lock file were ordered alphabetically. This was a side-effect of using filelib:wildcard/1 to list the dependencies in the top-level deps directory. In most cases, the order of the full dependency set does not matter. However, if some of the code in your project uses parse transforms, then it will be important for the parse transform to be compiled and on the code path before attempting to compile code that uses the parse transform.

This issue was recently discovered by a colleague who ran into build issues using the lock file for a project that had recently integrated lager for logging. He came up with the idea of maintaining the order of deps as they appear in the various rebar.config files along with a prototype patch proving out the idea. As of rebar_lock_deps_plugin 3.0.0, the lock-deps command will (mostly) maintain the relative order of dependencies as found in the rebar.config files.

The "mostly" is that when a dep is shared across two subprojects, it will appear in the expected order for the first subproject (based on the ordering of the two subprojects). The deps for the second subproject will not be in strict rebar.config order, but the resulting order should address any compile-time dependencies and be relatively stable (only changing when project deps alter their deps with larger impact when shared deps are introduced or removed).

Digression: fun with dependencies

There are times, as a programmer, when a real-world problem looks like a text book exercise (or an interview whiteboard question). Just the other day at work we had to design some manhole covers, but I digress.

Fixing the order of the dependencies in the generated lock file is (nearly) the same as finding an install order for a set of projects with inter-dependencies. I had some fun coding up the text book solution even though the approach doesn't handle the constraint of respecting the order provided by the rebar.config files. Onward with the digression.

We have a set of "packages" where some packages depend on others and we want to determine an install order such that a package's dependencies are always installed before the package. The set of packages and the relation "depends on" form a directed acyclic graph or DAG. The topological sort of a DAG produces an install order for such a graph. The ordering is not unique. For example, with a single package C depending on A and B, valid install orders are [A, B, C] and [B, A, C].

To setup the problem, we load all of the project dependency information into a proplist mapping each package to a list of its dependencies extracted from the package's rebar.config file.

read_all_deps(Config, Dir) ->
    TopDeps = rebar_config:get(Config, deps, []),
    Acc = [{top, dep_names(TopDeps)}],
    DepDirs = filelib:wildcard(filename:join(Dir, "*")),
    Acc ++ [
     {filename:basename(D), dep_names(extract_deps(D))}
     || D <- DepDirs ].

Erlang's standard library provides the digraph and digraph_utils modules for constructing and operating on directed graphs. The digraph_utils module includes a topsort/1 function which we can make use of for our "exercise". The docs say:

Returns a topological ordering of the vertices of the digraph Digraph if such an ordering exists, false otherwise. For each vertex in the returned list, there are no out-neighbours that occur earlier in the list.

To figure out which way to point the edges when building our graph, consider two packages A and B with A depending on B. We know we want to end up with an install order of [B, A]. Rereading the topsort/1 docs, we must want an edge B => A. With that, we can build our DAG and obtain an install order with the topological sort:

load_digraph(Config, Dir) ->
    AllDeps = read_all_deps(Config, Dir),
    G = digraph:new(),
    Nodes = all_nodes(AllDeps),
    [ digraph:add_vertex(G, N) || N <- Nodes ],
    %% If A depends on B, then we add an edge A <= B
      [ digraph:add_edge(G, Dep, Item)
        || Dep <- DepList ]
      || {Item, DepList} <- AllDeps, Item =/= top ],

%% extract a sorted unique list of all deps
all_nodes(AllDeps) ->
    lists:usort(lists:foldl(fun({top, L}, Acc) ->
                                    L ++ Acc;
                               ({K, L}, Acc) ->
                                    [K|L] ++ Acc
                            end, [], AllDeps)).

The digraph module manages graphs using ETS giving it a convenient API, though one that feels un-erlang-y in its reliance on side-effects.

The above gives an install order, but doesn't take into account the relative order of deps as specified in the rebar.config files. The solution implemented in the plugin is a bit less fancy, recursing over the deps and maintaining the desired ordering. The only tricky bit being that shared deps are ignored until the end and the entire linearized list is de-duped which required a . Here's the code:

order_deps(AllDeps) ->
    Top = proplists:get_value(top, AllDeps),
    order_deps(lists:reverse(Top), AllDeps, []).

order_deps([], _AllDeps, Acc) ->
order_deps([Item|Rest], AllDeps, Acc) ->
    ItemDeps = proplists:get_value(Item, AllDeps),
    order_deps(lists:reverse(ItemDeps) ++ Rest, AllDeps, [Item | Acc]).

de_dup(AccIn) ->
    WithIndex = lists:zip(AccIn, lists:seq(1, length(AccIn))),
    UWithIndex = lists:usort(fun({A, _}, {B, _}) ->
                                     A =< B
                             end, WithIndex),
    Ans0 = lists:sort(fun({_, I1}, {_, I2}) ->
                              I1 =< I2
                      end, UWithIndex),
    [ V || {V, _} <- Ans0 ].

Conclusion and the end of this post

The great thing about posting to your blog is, you don't have to have a proper conclusion if you don't want to.

December 26, 2013 04:20 PM

December 09, 2013

Leandro Penz

Probabilistic bug hunting

Probabilistic bug hunting

Have you ever run into a bug that, no matter how careful you are trying to reproduce it, it only happens sometimes? And then, you think you've got it, and finally solved it - and tested a couple of times without any manifestation. How do you know that you have tested enough? Are you sure you were not "lucky" in your tests?

In this article we will see how to answer those questions and the math behind it without going into too much detail. This is a pragmatic guide.

The Bug

The following program is supposed to generate two random 8-bit integer and print them on stdout:

  #include <stdio.h>
  #include <fcntl.h>
  /* Returns -1 if error, other number if ok. */
  int get_random_chars(char *r1, char*r2)
  	int f = open("/dev/urandom", O_RDONLY);
  	if (f < 0)
  		return -1;
  	if (read(f, r1, sizeof(*r1)) < 0)
  		return -1;
  	if (read(f, r2, sizeof(*r2)) < 0)
  		return -1;
  	return *r1 & *r2;
  int main(void)
  	char r1;
  	char r2;
  	int ret;
  	ret = get_random_chars(&r1, &r2);
  	if (ret < 0)
  		fprintf(stderr, "error");
  		printf("%d %d\n", r1, r2);
  	return ret < 0;

On my architecture (Linux on IA-32) it has a bug that makes it print "error" instead of the numbers sometimes.

The Model

Every time we run the program, the bug can either show up or not. It has a non-deterministic behaviour that requires statistical analysis.

We will model a single program run as a Bernoulli trial, with success defined as "seeing the bug", as that is the event we are interested in. We have the following parameters when using this model:

  • \(n\): the number of tests made;
  • \(k\): the number of times the bug was observed in the \(n\) tests;
  • \(p\): the unknown (and, most of the time, unknowable) probability of seeing the bug.

As a Bernoulli trial, the number of errors \(k\) of running the program \(n\) times follows a binomial distribution \(k \sim B(n,p)\). We will use this model to estimate \(p\) and to confirm the hypotheses that the bug no longer exists, after fixing the bug in whichever way we can.

By using this model we are implicitly assuming that all our tests are performed independently and identically. In order words: if the bug happens more ofter in one environment, we either test always in that environment or never; if the bug gets more and more frequent the longer the computer is running, we reset the computer after each trial. If we don't do that, we are effectively estimating the value of \(p\) with trials from different experiments, while in truth each experiment has its own \(p\). We will find a single value anyway, but it has no meaning and can lead us to wrong conclusions.

Physical analogy

Another way of thinking about the model and the strategy is by creating a physical analogy with a box that has an unknown number of green and red balls:

  • Bernoulli trial: taking a single ball out of the box and looking at its color - if it is red, we have observed the bug, otherwise we haven't. We then put the ball back in the box.
  • \(n\): the total number of trials we have performed.
  • \(k\): the total number of red balls seen.
  • \(p\): the total number of red balls in the box divided by the total number of green balls in the box.

Some things become clearer when we think about this analogy:

  • If we open the box and count the balls, we can know \(p\), in contrast with our original problem.
  • Without opening the box, we can estimate \(p\) by repeating the trial. As \(n\) increases, our estimate for \(p\) improves. Mathematically: \[p = \lim_{n\to\infty}\frac{k}{n}\]
  • Performing the trials in different conditions is like taking balls out of several different boxes. The results tell us nothing about any single box.

Estimating \(p\)

Before we try fixing anything, we have to know more about the bug, starting by the probability \(p\) of reproducing it. We can estimate this probability by dividing the number of times we see the bug \(k\) by the number of times we tested for it \(n\). Let's try that with our sample bug:

  $ ./hasbug
  67 -68
  $ ./hasbug
  79 -101
  $ ./hasbug

We know from the source code that \(p=25%\), but let's pretend that we don't, as will be the case with practically every non-deterministic bug. We tested 3 times, so \(k=1, n=3 \Rightarrow p \sim 33%\), right? It would be better if we tested more, but how much more, and exactly what would be better?

\(p\) precision

Let's go back to our box analogy: imagine that there are 4 balls in the box, one red and three green. That means that \(p = 1/4\). What are the possible results when we test three times?

Red balls Green balls \(p\) estimate
0 3 0%
1 2 33%
2 1 66%
3 0 100%

The less we test, the smaller our precision is. Roughly, \(p\) precision will be at most \(1/n\) - in this case, 33%. That's the step of values we can find for \(p\), and the minimal value for it.

Testing more improves the precision of our estimate.

\(p\) likelihood

Let's now approach the problem from another angle: if \(p = 1/4\), what are the odds of seeing one error in four tests? Let's name the 4 balls as 0-red, 1-green, 2-green and 3-green:

The table above has all the possible results for getting 4 balls out of the box. That's \(4^4=256\) rows, generated by this python script. The same script counts the number of red balls in each row, and outputs the following table:

k rows %
0 81 31.64%
1 108 42.19%
2 54 21.09%
3 12 4.69%
4 1 0.39%

That means that, for \(p=1/4\), we see 1 red ball and 3 green balls only 42% of the time when getting out 4 balls.

What if \(p = 1/3\) - one red ball and two green balls? We would get the following table:

k rows %
0 16 19.75%
1 32 39.51%
2 24 29.63%
3 8 9.88%
4 1 1.23%

What about \(p = 1/2\)?

k rows %
0 1 6.25%
1 4 25.00%
2 6 37.50%
3 4 25.00%
4 1 6.25%

So, let's assume that you've seen the bug once in 4 trials. What is the value of \(p\)? You know that can happen 42% of the time if \(p=1/4\), but you also know it can happen 39% of the time if \(p=1/3\), and 25% of the time if \(p=1/2\). Which one is it?

The graph bellow shows the discrete likelihood for all \(p\) percentual values for getting 1 red and 3 green balls:

The fact is that, given the data, the estimate for \(p\) follows a beta distribution \(Beta(k+1, n-k+1) = Beta(2, 4)\) (1) The graph below shows the probability distribution density of \(p\):

The R script used to generate the first plot is here, the one used for the second plot is here.

Increasing \(n\), narrowing down the interval

What happens when we test more? We obviously increase our precision, as it is at most \(1/n\), as we said before - there is no way to estimate that \(p=1/3\) when we only test twice. But there is also another effect: the distribution for \(p\) gets taller and narrower around the observed ratio \(k/n\):

Investigation framework

So, which value will we use for \(p\)?

  • The smaller the value of \(p\), the more we have to test to reach a given confidence in the bug solution.
  • We must, then, choose the probability of error that we want to tolerate, and take the smallest value of \(p\) that we can.

    A usual value for the probability of error is 5% (2.5% on each side).
  • That means that we take the value of \(p\) that leaves 2.5% of the area of the density curve out on the left side. Let's call this value \(p_{min}\).
  • That way, if the observed \(k/n\) remains somewhat constant, \(p_{min}\) will raise, converging to the "real" \(p\) value.
  • As \(p_{min}\) raises, the amount of testing we have to do after fixing the bug decreases.

By using this framework we have direct, visual and tangible incentives to test more. We can objectively measure the potential contribution of each test.

In order to calculate \(p_{min}\) with the mentioned properties, we have to solve the following equation:

\[\sum_{k=0}^{k}{n\choose{k}}p_{min} ^k(1-p_{min})^{n-k}=\frac{\alpha}{2} \]

\(alpha\) here is twice the error we want to tolerate: 5% for an error of 2.5%.

That's not a trivial equation to solve for \(p_{min}\). Fortunately, that's the formula for the confidence interval of the binomial distribution, and there are a lot of sites that can calculate it:

Is the bug fixed?

So, you have tested a lot and calculated \(p_{min}\). The next step is fixing the bug.

After fixing the bug, you will want to test again, in order to confirm that the bug is fixed. How much testing is enough testing?

Let's say that \(t\) is the number of times we test the bug after it is fixed. Then, if our fix is not effective and the bug still presents itself with a probability greater than the \(p_{min}\) that we calculated, the probability of not seeing the bug after \(t\) tests is:

\[\alpha = (1-p_{min})^t \]

Here, \(\alpha\) is also the probability of making a type I error, while \(1 - \alpha\) is the statistical significance of our tests.

We now have two options:

  • arbitrarily determining a standard statistical significance and testing enough times to assert it.
  • test as much as we can and report the achieved statistical significance.

Both options are valid. The first one is not always feasible, as the cost of each trial can be high in time and/or other kind of resources.

The standard statistical significance in the industry is 5%, we recommend either that or less.

Formally, this is very similar to a statistical hypothesis testing.

Back to the Bug

Testing 20 times

This file has the results found after running our program 5000 times. We must never throw out data, but let's pretend that we have tested our program only 20 times. The observed \(k/n\) ration and the calculated \(p_{min}\) evolved as shown in the following graph:

After those 20 tests, our \(p_{min}\) is about 12%.

Suppose that we fix the bug and test it again. The following graph shows the statistical significance corresponding to the number of tests we do:

In words: we have to test 24 times after fixing the bug to reach 95% statistical significance, and 35 to reach 99%.

Now, what happens if we test more before fixing the bug?

Testing 5000 times

Let's now use all the results and assume that we tested 5000 times before fixing the bug. The graph bellow shows \(k/n\) and \(p_{min}\):

After those 5000 tests, our \(p_{min}\) is about 23% - much closer to the real \(p\).

The following graph shows the statistical significance corresponding to the number of tests we do after fixing the bug:

We can see in that graph that after about 11 tests we reach 95%, and after about 16 we get to 99%. As we have tested more before fixing the bug, we found a higher \(p_{min}\), and that allowed us to test less after fixing the bug.

Optimal testing

We have seen that we decrease \(t\) as we increase \(n\), as that can potentially increases our lower estimate for \(p\). Of course, that value can decrease as we test, but that means that we "got lucky" in the first trials and we are getting to know the bug better - the estimate is approaching the real value in a non-deterministic way, after all.

But, how much should we test before fixing the bug? Which value is an ideal value for \(n\)?

To define an optimal value for \(n\), we will minimize the sum \(n+t\). This objective gives us the benefit of minimizing the total amount of testing without compromising our guarantees. Minimizing the testing can be fundamental if each test costs significant time and/or resources.

The graph bellow shows us the evolution of the value of \(t\) and \(t+n\) using the data we generated for our bug:

We can see clearly that there are some low values of \(n\) and \(t\) that give us the guarantees we need. Those values are \(n = 15\) and \(t = 24\), which gives us \(t+n = 39\).

While you can use this technique to minimize the total number of tests performed (even more so when testing is expensive), testing more is always a good thing, as it always improves our guarantee, be it in \(n\) by providing us with a better \(p\) or in \(t\) by increasing the statistical significance of the conclusion that the bug is fixed. So, before fixing the bug, test until you see the bug at least once, and then at least the amount specified by this technique - but also test more if you can, there is no upper bound, specially after fixing the bug. You can then report a higher confidence in the solution.


When a programmer finds a bug that behaves in a non-deterministic way, he knows he should test enough to know more about the bug, and then even more after fixing it. In this article we have presented a framework that provides criteria to define numerically how much testing is "enough" and "even more." The same technique also provides a method to objectively measure the guarantee that the amount of testing performed provides, when it is not possible to test "enough."

We have also provided a real example (even though the bug itself is artificial) where the framework is applied.

As usual, the source code of this page (R scripts, etc) can be found and downloaded in

December 09, 2013 12:00 AM

December 01, 2013

Gregor Gorjanc

Read line by line of a file in R

Are you using R for data manipulation for later use with other programs, i.e., a workflow something like this:
  1. read data sets from a disk,
  2. modify the data, and
  3. write it back to a disk.
All fine, but of data set is really big, then you will soon stumble on memory issues. If data processing is simple and you can read only chunks, say only line by line, then the following might be useful:

## File
file <- "myfile.txt"
## Create connection
con <- file(description=file, open="r")
## Hopefully you know the number of lines from some other source or
com <- paste("wc -l ", file, " | awk '{ print $1 }'", sep="")
n <- system(command=com, intern=TRUE)
## Loop over a file connection
for(i in 1:n) {
tmp <- scan(file=con, nlines=1, quiet=TRUE)
## do something on a line of data
Created by Pretty R at

by Gregor Gorjanc ( at December 01, 2013 10:55 PM

November 20, 2013

R you ready?

Sending data from client to server and back using shiny

post-logoAfter some time of using shiny I got to the point where I needed to send some arbitrary data from the client to the server, process it with R and return some other data to the client. As a client/server programming newbie this was a challenge for me as I did not want to dive too deep into the world of web programming. I wanted to get the job done using shiny and preferably as little JS/PHP etc. scripting as possible.

It turns out that the task is quite simple as shiny comes with some currently undocumented functions under the hood that will make this task quite easy. You can find some more information on these functions here.

As mentioned above, I am a web programming newbie. So this post may be helpful for people with little web programming experience (just a few lines of JavaScript are needed) and who want to see a simple way of how to get the job done.

Sending data from client to server

Sending the data from the client to the server is accomplished by the JS function Shiny.onInputChange. This function takes a JS object and sends it to the shiny server. On the server side the object will be accessible as an R object under the name which is given as the second argument to the Shiny.onInputChange function. Let’s start by sending a random number to the server. The name of the object on the server side will be mydata.

Let’s create the shiny user interface file (ui.R). I will add a colored div, another element for verbatim text output called results and add the JavaScript code to send the data. The workhorse line is Shiny.onInputChange(“mydata”, number);. The JS code is included by passing it as a string to the tags$script function.

# ui.R

shinyUI( bootstrapPage(

  # a div named mydiv
  tags$div(id="mydiv", style="width: 50px; height :50px;
           left: 100px; top: 100px;
           background-color: gray; position: absolute"),

  # a shiny element to display unformatted text

  # javascript code to send data to shiny server
    document.getElementById("mydiv").onclick = function() {
      var number = Math.random();
      Shiny.onInputChange("mydata", number);


Now, on the server side, we can simply access the data that was sent by addressing it the usual way via the input object (i.e. input$mydata. The code below will make the verbatimTextOutput element results show the value that was initially passed to the server.

# server.R

shinyServer(function(input, output, session) {

    output$results = renderPrint({


You can copy the above files from here or run the code directly. When you run the code you will find that the random value in the upper box is updated if you click on the div.


What we have achieved so far is to pass some data to the server, access it and pass it back to a display on the client side. For the last part however, we have used a standard shiny element to send back the data to the client.

Sending data from server to client

Now let’s add a component to send custom data from the server back to the client. This task has two parts. On the client side we need to define a handler function. This is a function that will receive the data from the server and perform some task with it. In other words, the function will handle the received data. To register a handler the function Shiny.addCustomMessageHandler is used. I will name our handler function myCallbackHandler. Our handler function will use the received data and execute some JS code. In our case it will change the color of our div called mydiv according to the color value that is passed from the server to the handler. Let’s add the JS code below to the ui.R file.

# ui.R

# handler to receive data from server
        function(color) {
          document.getElementById("mydiv").style.backgroundColor = color;

Let’s move to the server side. I want the server to send the data to the handler function whenever the div is clicked, i.e. when the value of input$mydata changes. The sending of the data to the client is accomplished by an R function called sendCustomMessage which can be found in the session object. The function is passed the name of the client side handler function and the R object we want to pass to the function. Here, I create a random hex color value string that gets sent to a client handler function myCallbackHandler. The line sending the data to the client is contained in an observer. The observer includes the reactive object input$mydata, so the server will send someting to the client side handler function whenever the values of input$mydata changes. And it changes each time we click on the div. Let’s add the code below to the server.R file.

# server.R

# observes if value of mydata sent from the client changes.  if yes
# generate a new random color string and send it back to the client
# handler function called 'myCallbackHandler'
    color = rgb(runif(1), runif(1), runif(1))
    session$sendCustomMessage(type = "myCallbackHandler", color)

You can copy the above files from here or run the code directly. When you run the code you will see that the div changes color when you click on it.


That’s it. We have passed custom data from the client to the server and back. The following graphics sums up the functions that were used.


Passing more complex objects

The two functions also do a good job passing more complex JS or R objects. If you modify your code to send a JS object to shiny, it will be converted into an R list object on the server side. Let’s replace the JS object we send to the server (in ui.R) with following lines. On the server side, we will get a list.

document.getElementById("mydiv").onclick = function() {
  var obj = {one: [1,2,3,4],
             two: ["a", "b", "c"]};
  Shiny.onInputChange("mydata", obj);

Note that now however the shiny server will only execute the function once (on loading), not each time the click event is fired. The reason is, that now the input data is static, i.e. the JS object we send via onInputChange does not change. To reduce workload on the server side, the code in the observer will only be executed if the reactive value under observation (i.e. the value of input$mydata) changes. As this is not the case anymore as the value we pass is static, the observer that sends back the color information to the client to change the color of the div is not executed a second time.

The conversion also works nicely the other way round. We can pass an R list object to the sendCustomMessage function and on the client side it will appear as a JS object. So we are free to pass almost any type of data we need to.

Putting the JS code in a separate file

To keep things simple I included the JS code directly into the ui.R file using tags$script. This does not look very nice and you may want to put the JS code in a separate file instead. For this purpose I will create a JS file called mycode.js and include all the above JS code in it. Additionally, this file has another modification: All the code is wrapped into some JS/jQuery code ($(document).ready(function() { })that will make sure the JS code is run after the DOM (that is all the HTML elements) is loaded. Before, I simply placed the JS code below the HTML elements to make sure they are loaded, but I guess this is no good practice.

// mycode.js

$(document).ready(function() {

  document.getElementById("mydiv").onclick = function() {
    var number = Math.random();
    Shiny.onInputChange("mydata", number);

    function(color) {
      document.getElementById("mydiv").style.backgroundColor = color;


To include the JS file shiny offers the includeScript function to include JS files. The server.R file has not changed, the ui.R file now looks like this.

# server.R


shinyUI( bootstrapPage(

  # include the js code

  # a div named mydiv
           style="width: 50px; height :50px; left: 100px; top: 100px;
           background-color: gray; position: absolute"),

  # an element for unformatted text

You can copy the above files from here or run the gist directly from within R.


The above examples are purely artifical as it will not make much sense to let the server generate a random color value and send it back to the client. JS might just do all this on the client side without any need for client/server communiation at all. The examples are just for demonstration purposes to outline the mechanisms you may use for sending custom data to the server or client using the functions supplied by the marvellous shiny package. Winston Chang (one of the RStudio and shiny guys) has some more examples in his testapp repo. Have a look at the message-handler-inline and the message-handler-jsfile folders.


by markheckmann at November 20, 2013 06:26 PM

August 13, 2013

Gregor Gorjanc

Setup up the inverse of additive relationship matrix in R

Additive genetic covariance between individuals is one of the key concepts in (quantitative) genetics. When doing the prediction of additive genetic values for pedigree members, we need the inverse of the so called numerator relationship matrix (NRM) or simply A. Matrix A has off-diagonal entries equal to numerator of Wright's relationship coefficient and diagonal elements equal to 1 + inbreeding coefficient. I have blogged before about setting up such inverse in R using routine from the ASReml-R program or importing the inverse from the CFC program. However, this is not the only way to "skin this cat" in R. I am aware of the following attempts to provide this feature in R for various things (the list is probably incomplete and I would grateful if you point me to other implementations):
  • pedigree R package has function makeA() and makeAinv() with obvious meanings; there is also calcG() if you have a lot of marker data instead of pedigree information; there are also some other very handy functions calcInbreeding(), orderPed(), trimPed(), etc.
  • pedigreemm R package does not have direct implementation to get A inverse, but has all the needed ingredients, which makes the package even more interesting
  • MCMCglmm R package has function inverseA() which works with pedigree or phlyo objects; there are also handy functions such as prunePed(), rbv()sm2asreml(), etc.
  • kinship and kinship2 R packages have function kinship() to setup kinship matrix, which is equal to the half of A; there are also nice functions for plotting pedigrees etc. (see also here)
  • see also a series of R scripts for relationship matrices 
As I described before, the interesting thing is that setting up inverse of A is easier and cheaper than setting up A and inverting it. This is very important for large applications. This is an old result using the following matrix theory. We can decompose symmetric positive definite matrix as A = LU = LL' (Cholesky decomposition) or as A = LDU = LDL' (Generalized Cholesky decomposition), where L (U) is lower (upper) triangular, and D is diagonal matrix. Note that L and U in previous two equations are not the same thing (L from Cholesky is not equal to L from Generalized Cholesky decomposition)! Sorry for sloppy notation. In order to confuse you even more note that Henderson usually wrote A = TDT'. We can even do A = LSSU, where S diagonal is equal to the square root of D diagonal. This can get us back to A = LU = LL' as LSSU = LSSL' = LSS'L' = LS(LS)' = L'L (be ware of sloppy notation)! The inverse rule says that inv(A) = inv(LDU) = inv(U) inv(D) inv(L) = inv(L)' inv(D) inv(L) = inv(L)' inv(S) inv(S) inv(L). I thank to Martin Maechler for pointing out to the last (obviously) bit to me. In Henderson's notation this would be inv(A) = inv(T)' inv(D) inv(T) = inv(T)' inv(S) inv(S) inv(T) Uf ... The important bit is that with NRM (aka A) inv(L) has nice simple structure - it shows the directed graph of additive genetic values in pedigree, while inv(D) tells us about the precision (inverse variance) of additive genetic values given the additive genetic values of parents and therefore depends on knowledge of parents and their inbreeding (the more they are inbred less variation can we expect in their progeny). Both inv(L) and inv(D) are easier to setup.

Packages MCMCglmm and pedigree give us inv(A) directly (we can also get inv(D) in MCMCglmm), but pedigreemm enables us to play around with the above matrix algebra and graph theory. First we need a small example pedigree. Bellow is an example with 10 members and there is also some inbreeding and some individuals have both, one, or no parents known. It is hard to see inbreeding directly from the table, but we will improve that later (see also here).

ped <- data.frame( id=c(  1,   2,   3,   4,   5,   6,   7,   8,   9,  10),
fid=c( NA, NA, 2, 2, 4, 2, 5, 5, NA, 8),
mid=c( NA, NA, 1, NA, 3, 3, 6, 6, NA, 9))

Now we will create an object of a pedigree class and show the A = U'U stuff:

## install.packages(pkgs="pedigreemm")
ped2 <- with(ped, pedigree(sire=fid, dam=mid, label=id))
U <-
A &lt
;- crossprod(U)
round(U, digits=2)
## 10 x 10 sparse Matrix of class "dtCMatrix"
## [1,] 1 . 0.50 . 0.25 0.25 0.25 0.25 . 0.12
## [2,] . 1 0.50 0.50 0.50 0.75 0.62 0.62 . 0.31
## [3,] . . 0.71 . 0.35 0.35 0.35 0.35 . 0.18
## [4,] . . . 0.87 0.43 . 0.22 0.22 . 0.11
## [5,] . . . . 0.71 . 0.35 0.35 . 0.18
## [6,] . . . . . 0.71 0.35 0.35 . 0.18
## [7,] . . . . . . 0.64 . . .
## [8,] . . . . . . . 0.64 . 0.32
## [9,] . . . . . . . . 1 0.50
## [10,] . . . . . . . . . 0.66

## To check
U - chol(A)

round(A, digits=2)
## 10 x 10 sparse Matrix of class "dsCMatrix"
## [1,] 1.00 . 0.50 . 0.25 0.25 0.25 0.25 . 0.12
## [2,] . 1.00 0.50 0.50 0.50 0.75 0.62 0.62 . 0.31
## [3,] 0.50 0.50 1.00 0.25 0.62 0.75 0.69 0.69 . 0.34
## [4,] . 0.50 0.25 1.00 0.62 0.38 0.50 0.50 . 0.25
## [5,] 0.25 0.50 0.62 0.62 1.12 0.56 0.84 0.84 . 0.42
## [6,] 0.25 0.75 0.75 0.38 0.56 1.25 0.91 0.91 . 0.45
## [7,] 0.25 0.62 0.69 0.50 0.84 0.91 1.28 0.88 . 0.44
## [8,] 0.25 0.62 0.69 0.50 0.84 0.91 0.88 1.28 . 0.64
## [9,] . . . . . . . . 1.0 0.50
## [10,] 0.12 0.31 0.34 0.25 0.42 0.45 0.44 0.64 0.5

ote tha
m package uses Matrix classes in order to store only what we need to store, e.g., matrix U is triangular (t in "dtCMatrix") and matrix A is symmetric (s in "dsCMatrix"). To show the generalized Cholesky A = LDU (or using Henderson notation A = TDT') we use gchol() from the bdsmatrix R package. Matrix T shows the "flow" of genes in pedigree.

## install.packages(pkgs="bdsmatrix")
&lt;- gchol(as.matrix(A))
D &lt;- diag(tmp)
(T <- as(as.matrix(tmp), "dtCMatrix"))
## 10 x 10 sparse Matrix of class "dtCMatrix"
## [1,] 1.000 . . . . . . . . .
## [2,] . 1.0000 . . . . . . . .
## [3,] 0.500 0.5000 1.00 . . . . . . .
## [4,] . 0.5000 . 1.000 . . . . . .
## [5,] 0.250 0.5000 0.50 0.500 1.00 . . . . .
## [6,] 0.250 0.7500 0.50 . . 1.00 . . . .
## [7,] 0.250 0.6250 0.50 0.250 0.50 0.50 1 . . .
## [8,] 0.250 0.6250 0.50 0.250 0.50 0.50 . 1.0 . .
## [9,] . . . . . . . . 1.0 .
## [10,] 0.125 0.3125 0.25 0.125 0.25 0.25 . 0.5 0.5 1

## To chec
- T %*% diag(sqrt(D))
L - t(U)
Now the A inverse part (inv(A) = inv(T)' inv(D) inv(T) = inv(T)' inv(S) inv(S) inv(T) using Henderson's notation, note that ). The nice thing is that pedigreemm authors provided functions to get inv(T) and D.

(TInv <- as(ped2, "sparseMatrix"))
## 10 x 10 sparse Matrix of class "dtCMatrix" (unitriangular)
## 1 1.0 . . . . . . . . .
## 2 . 1.0 . . . . . . . .
## 3 -0.5 -0.5 1.0 . . . . . . .
## 4 . -0.5 . 1.0 . . . . . .
## 5 . . -0.5 -0.5 1.0 . . . . .
## 6 . -0.5 -0.5 . . 1.0 . . . .
## 7 . . . . -0.5 -0.5 1 . . .
## 8 . . . . -0.5 -0.5 . 1.0 . .
## 9 . . . . . . . . 1.0 .
## 10 . . . . . . . -0.5 -0.5 1
round(DInv <- Diagonal(x=1/Dmat(ped2)), digits=2)
## 10 x 10 diagonal matrix of class "ddiMatrix"
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 1 . . . . . . . . .
## [2,] . 1 . . . . . . . .
## [3,] . . 2 . . . . . . .
## [4,] . . . 1.33 . . . . . .
## [5,] . . . . 2 . . . . .
## [6,] . . . . . 2 . . . .
## [7,] . . . . . . 2.46 . . .
## [8,] . . . . . . . 2.46 . .
## [9,] . . . . . . . . 1 .
## [10,] . . . . . . . . . 2.33
round(t(TInv) %*% DInv %*% TInv, digits=2)
## 10 x 10 sparse Matrix of class "dgCMatrix"
## .
und(crossprod(sqrt(DInv) %*% TInv), digits=2)
## 10 x 10 sparse Matrix of class "dsCMatrix"
##  [1,]  1.5  0.50 -1.0  .     .     .     .     .     .     .   
## [2,] 0.5 2.33 -0.5 -0.67 . -1.00 . . . .
## [3,] -1.0 -0.50 3.0 0.50 -1.00 -1.00 . . . .
## [4,] . -0.67 0.5 1.83 -1.00 . . . . .
## [5,] . . -1.0 -1.00 3.23 1.23 -1.23 -1.23 . .
## [6,] . -1.00 -1.0 . 1.23 3.23 -1.23 -1.23 . .
## [7,] . . . . -1.23 -1.23 2.46 . . .
## [8,] . . . . -1.23 -1.23 . 3.04 0.58 -1.16
## [9,] . . . . . . . 0.58 1.58 -1.16
## [10,] . . . . . . . -1.16 -1.16

# T
o c
) - crossprod(sqrt(DInv) %*% TInv)

The second method (using crossprod) is preferred as it leads directly to symmetric matrix (dsCMatrix), which stores only upper or lower triangle. And make sure you do not do crossprod(TInv %*% sqrt(DInv)) as it is the wrong order of matrices.

As promised we will display (plot) pedigree by use of conversion functions of matrix objects to graph objects using the following code. Two examples are provided using the graph and igraph packages. The former does a very good job on this example, but otherwise igraph seems to have much nicer support for editing etc.

## source("")
## biocLite(pkgs=c("graph", "Rgraphviz"))
g <- as(t(TInv), "graph")

## install.packages(pkgs="igraph")
i &l
t;- igraph.from.graphNEL(graphNEL=g)
i)$label <- 1:10
(i, layout=layout.kamada.kawai)
## tkplot(i)

by Gregor Gorjanc ( at August 13, 2013 02:28 PM

July 02, 2013

Gregor Gorjanc

Parse arguments of an R script

R can be used also as a scripting tool. We just need to add shebang in the first line of a file (script):


and then the R code should follow.

Often we want to pass arguments to such a script, which can be collected in the script by the commandArgs() function. Then we need to parse the arguments and conditional on them do something. I came with a rather general way of parsing these arguments using simply these few lines:

## Collect arguments
args <- commandArgs(TRUE)
## Default setting when no arguments passed
if(length(args) < 1) {
args <- c("--help")
## Help section
if("--help" %in% args) {
The R Script
--arg1=someValue - numeric, blah blah
--arg2=someValue - character, blah blah
--arg3=someValue - logical, blah blah
--help - print this text
./test.R --arg1=1 --arg2="
output.txt" --arg3=TRUE \n\n")
## Parse arguments (we expect the form --arg=value)
parseArgs <- function(x) strsplit(sub("^--", "", x), "=")
argsDF <-"rbind", parseArgs(args)))
argsL <- as.list(as.character(argsDF$V2))
names(argsL) <- argsDF$V1
## Arg1 default
if(is.null(args$arg1)) {
## do something
## Arg2 default
if(is.null(args$arg2)) {
## do something
## Arg3 default
if(is.null(args$arg3)) {
## do something
## ... your code here ...
Created by Pretty R at

It is some work, but I find it pretty neat and use it for quite a while now. I do wonder what others have come up for this task. I hope I did not miss some very general solution.

by Gregor Gorjanc ( at July 02, 2013 04:55 PM

March 24, 2013

Romain Francois



This blog is moving to The new one is powered by wordpress and gets a subdomain of

See you there

by romain francois at March 24, 2013 03:52 PM

March 17, 2013

Modern Toolmaking

caretEnsemble Classification example

Here's a quick demo of how to fit a binary classification model with caretEnsemble.  Please note that I haven't spent as much time debugging caretEnsemble for classification models, so there's probably more bugs than my last post.  Also note that multi class models are not yet supported.

Right now, this code fails for me if I try a model like a nnet or an SVM for stacking, so there's clearly bugs to fix.

The greedy model relies 100% on the gbm, which makes sense as the gbm has an AUC of 1 on the training set.  The linear model uses all of the models, and achieves an AUC of .5.  This is a little weird, as the gbm, rf, SVN, and knn all achieve an AUC of close to 1.0 on the training set, and I would have expected the linear model to focus on these predictions. I'm not sure if this is a bug, or a failure of my stacking model.

by Zachary Deane-Mayer ( at March 17, 2013 04:04 AM