Planet R

March 27, 2015

Removed CRANberries

Package rsprng (with last version 1.0) was removed from CRAN

Previous versions (as known to CRANberries) which should be available via the Archive link are:

2010-03-09 1.0
2008-06-08 0.4
2007-05-10 0.3-4

March 27, 2015 09:13 AM

Package hgm (with last version 1.10) was removed from CRAN

Previous versions (as known to CRANberries) which should be available via the Archive link are:

2015-03-27 1.10

March 27, 2015 09:13 AM

New package PAFit with initial version 0.7.2

Package: PAFit
Type: Package
Title: Nonparametric Estimation of Preferential Attachment and Node Fitness in Temporal Complex Networks
Version: 0.7.2
Date: 2015-03-28
Author: Thong Pham, Paul Sheridan, Hidetoshi Shimodaira
Maintainer: Thong Pham
Description: A statistically sound method for estimating jointly the attachment function and node fitness in a temporal complex network by maximizing a suitable penalized log-likelihood function is implemented in this package.
License: GPL-3
Depends: R(>= 2.10.0)
Imports: Rcpp (>= 0.11.3)
LinkingTo: Rcpp
LazyData: True
NeedsCompilation: yes
Suggests: R.rsp
VignetteBuilder: R.rsp
Packaged: 2015-03-27 06:49:35 UTC; pham
Repository: CRAN
Date/Publication: 2015-03-27 08:31:40

More information about PAFit at CRAN

March 27, 2015 09:13 AM

New package ELYP with initial version 0.7-1

Package: ELYP
Maintainer: Mai Zhou
Version: 0.7-1
Depends: R (>= 2.15.0), survival
Suggests: emplik
Author: Mai Zhou
Description: Empirical likelihood ratio tests for the Yang and Prentice (short/long term hazard ratio) models. Empirical likelihood tests within a Cox model, for parameters defined via both baseline hazard function and regression parameters.
Title: Empirical Likelihood Analysis for the Cox Model and Yang-Prentice (2005) Model
License: GPL (>= 2)
NeedsCompilation: yes
Packaged: 2015-03-27 06:41:46 UTC; mai
Repository: CRAN
Date/Publication: 2015-03-27 08:31:37

More information about ELYP at CRAN

March 27, 2015 09:13 AM

March 26, 2015

Statistical Modelling

Regression with compositional response having unobserved components or below detection limit values

The typical way to deal with zeros and missing values in compositional data sets is to impute them with a reasonable value, and then the desired statistical model is estimated with the imputed data set, e.g., a regression model. This contribution aims at presenting alternative approaches to this problem within the framework of Bayesian regression with a compositional response. In the first step, a compositional data set with missing data is considered to follow a normal distribution on the simplex, which mean value is given as an Aitchison affine linear combination of some fully observed explanatory variables. Both the coefficients of this linear combination and the missing values can be estimated with standard Gibbs sampling techniques. In the second step, a normally distributed additive error is considered superimposed on the compositional response, and values are taken as ‘below the detection limit’ (BDL) if they are ‘too small’ in comparison with the additive standard deviation of each variable. Within this framework, the regression parameters and all missing values (including BDL) can be estimated with a Metropolis-Hastings algorithm. Both methods estimate the regression coefficients without need of any preliminary imputation step, and adequately propagate the uncertainty derived from the fact that the missing values and BDL are not actually observed, something imputation methods cannot achieve.

by van den Boogaart, K. G., Tolosana-Delgado, R., Templ, M. at March 26, 2015 05:51 AM

Tools for compositional data with a total

Compositional data analysis usually deals with relative information between parts where the total (abundances, mass, amount, etc.) is unknown or uninformative. This article addresses the question of what to do when the total is known and is of interest. Tools used in this case are reviewed and analysed, in particular the relationship between the positive orthant of D-dimensional real space, the product space of the real line times the D-part simplex, and their Euclidean space structures. The first alternative corresponds to data analysis taking logarithms on each component, and the second one to treat a log-transformed total jointly with a composition describing the distribution of component amounts. Real data about total abundances of phytoplankton in an Australian river motivated the present study and are used for illustration.

by Pawlowsky-Glahn, V., Egozcue, J. J., Lovell, D. at March 26, 2015 05:51 AM

Sparse principal balances

Compositional data analysis deals with situations where the relevant information is contained only in the ratios between the measured variables, and not in the reported values. This article focuses on high-dimensional compositional data (in the sense of hundreds or even thousands of variables), as they appear in chemometrics (e.g., mass spectral data), proteomics or genomics. The goal of this contribution is to perform a dimension reduction of such data, where the new directions should allow for interpretability. An approach named principal balances turned out to be successful for low dimensions. Here, the concept of sparse principal component analysis is proposed for constructing principal directions, the so-called sparse principal balances. They are sparse (contain many zeros), build an orthonormal basis in the sample space of the compositional data, are efficient for dimension reduction and are applicable to high-dimensional data.

by Mert, M. C., Filzmoser, P., Hron, K. at March 26, 2015 05:51 AM

March 22, 2015

RCpp Gallery

Parsing Dates and Times


R has excellent support for dates and times via the built-in Date and POSIXt classes. Their usage, however, is not always as straightforward as one would want. Certain conversions are more cumbersome than we would like: while as.Date("2015-03-22"), would it not be nice if as.Date("20150322") (a format often used in logfiles) also worked, or for that matter as.Date(20150322L) using an integer variable, or even as.Date("2015-Mar-22") and as.Date("2015Mar22")?

Similarly, many date and time formats suitable for POSIXct (the short form) and POSIXlt (the long form with accessible components) often require rather too much formatting, and/or defaults. Why for example does as.POSIXct(as.numeric(Sys.time()), origin="1970-01-01") require the origin argument on the conversion back (from fractional seconds since the epoch) into datetime—when it is not required when creating the double-precision floating point representation of time since the epoch?

But thanks to Boost and its excellent Boost Date_Time library—which we already mentioned in this post about the BH package— we can address parsing of dates and times. It permitted us to write a new function toPOSIXct() which now part of the RcppBDT package (albeit right now just the GitHub version but we expect this to migrate to CRAN “soon” as well).


We will now discuss the outline of this implementation. For full details, see the source file.

Headers and Constants

#include <boost/date_time.hpp>
#include <boost/lexical_cast.hpp>
#include <Rcpp.h>

// [[Rcpp::depends(BH)]]

namespace bt = boost::posix_time;

const std::locale formats[] = {    // this shows a subset only, see the source file for full list
    std::locale(std::locale::classic(), new bt::time_input_facet("%Y-%m-%d %H:%M:%S%f")),
    std::locale(std::locale::classic(), new bt::time_input_facet("%Y/%m/%d %H:%M:%S%f")),

    std::locale(std::locale::classic(), new bt::time_input_facet("%Y-%m-%d")),
    std::locale(std::locale::classic(), new bt::time_input_facet("%b/%d/%Y")),
const size_t nformats = sizeof(formats)/sizeof(formats[0]);

Note that we show only two datetime formats along with two date formats. The actual implementation has many more.

Core Converter

The actual conversion from string to a double (the underlying format in POSIXct) is done by the following function. It loops over all given formats, and returns the computed value after the first match. In case of failure, a floating point NA is returned.

double stringToTime(const std::string s) {

    bt::ptime pt, ptbase;

    // loop over formats and try them til one fits
    for (size_t i=0; pt == ptbase && i < nformats; ++i) {
        std::istringstream is(s);
        is >> pt;
    if (pt == ptbase) {
        return NAN;
    } else { 
        const bt::ptime timet_start(boost::gregorian::date(1970,1,1));
        bt::time_duration diff = pt - timet_start;

        // Define BOOST_DATE_TIME_POSIX_TIME_STD_CONFIG to use nanoseconds
        // (and then use diff.total_nanoseconds()/1.0e9;  instead)
        return diff.total_microseconds()/1.0e6;

Convenience Wrappers

We want to be able to convert from numeric as well as string formats. For this, we write a templated (and vectorised) function which invokes the actual conversion function for each argument. It also deals (somewhat heuristically) with two corner cases: we want 20150322 be converted from either integer or numeric, but need in the latter case distinguish this value and its rangue from the (much larger) value for seconds since the epoch. That creates a minir ambiguity: we will not be able to convert back for inputs from seconds since the epoch for the first few years since January 1, 1970. But as these are rare in the timestamp form we can accept the trade-off.

template <int RTYPE>
Rcpp::DatetimeVector toPOSIXct_impl(const Rcpp::Vector<RTYPE>& sv) {

    int n = sv.size();
    Rcpp::DatetimeVector pv(n);
    for (int i=0; i<n; i++) {
        std::string s = boost::lexical_cast<std::string>(sv[i]);
        //Rcpp::Rcout << sv[i] << " -- " << s << std::endl;

        // Boost Date_Time gets the 'YYYYMMDD' format wrong, even
        // when given as an explicit argument. So we need to test here.
        // While we are at it, may as well test for obviously wrong data.
        int l = s.size();
        if ((l < 8) ||          // impossibly short
            (l == 9)) {         // 8 or 10 works, 9 cannot
            Rcpp::stop("Inadmissable input: %s", s);
        } else if (l == 8) {    // turn YYYYMMDD into YYYY/MM/DD
            s = s.substr(0, 4) + "/" + s.substr(4, 2) + "/" + s.substr(6,2);
        pv[i] = stringToTime(s);
    return pv;

User-facing Function

Finally, we can look at the user-facing function. It accepts input in either integer, numeric or character vector form, and then dispatches accordingly to the templated internal function we just discussed. Other inputs are unsuitable and trigger an error.

// [[Rcpp::export]]
Rcpp::DatetimeVector toPOSIXct(SEXP x) {
    if (Rcpp::is<Rcpp::CharacterVector>(x)) {
        return toPOSIXct_impl<STRSXP>(x);
    } else if (Rcpp::is<Rcpp::IntegerVector>(x)) {
        return toPOSIXct_impl<INTSXP>(x); 
    } else if (Rcpp::is<Rcpp::NumericVector>(x)) {
        // here we have two cases: either we are an int like
        // 200150315 'mistakenly' cast to numeric, or we actually
        // are a proper large numeric (ie as.numeric(Sys.time())
        Rcpp::NumericVector v(x);
        if (v[0] < 21990101) {  // somewhat arbitrary cuttoff
            // actual integer date notation: convert to string and parse
            return toPOSIXct_impl<REALSXP>(x);
        } else {
            // we think it is a numeric time, so treat it as one
            return Rcpp::DatetimeVector(x);
    } else {
        Rcpp::stop("Unsupported Type");
        return R_NilValue;//not reached


A simply illustration follows. A fuller demonstration is part of the RcppBDT package. This already shows support for subsecond granularity and a variety of date formats.

## parsing character
s <- c("2004-03-21 12:45:33.123456",    # ISO
       "2004/03/21 12:45:33.123456",    # variant
       "20040321",                      # just dates work fine as well
       "Mar/21/2004",                   # US format, also support month abbreviation or full
       "rapunzel")                      # will produce a NA

p <- toPOSIXct(s)

options("digits.secs"=6)                # make sure we see microseconds in output
print(format(p, tz="UTC"))              # format UTC times as UTC (helps for Date types too)
[1] "2004-03-21 12:45:33.123456" "2004-03-21 12:45:33.123456"
[3] "2004-03-21 00:00:00.000000" "2004-03-21 00:00:00.000000"
[5] NA                          

We can also illustrate integer and numeric inputs:

## parsing integer types
s <- c(20150315L, 20010101L, 20141231L)
p <- toPOSIXct(s)
print(format(p, tz="UTC"))
[1] "2015-03-15" "2001-01-01" "2014-12-31"
## parsing numeric types
s <- c(20150315, 20010101, 20141231)
print(format(p, tz="UTC"))
[1] "2015-03-15" "2001-01-01" "2014-12-31"

Note that we always forced display using UTC rather local time, the R default.

March 22, 2015 12:00 AM

March 20, 2015

Journal of Statistical Software

BatchJobs and BatchExperiments: Abstraction Mechanisms for Using R in Batch Environments

Vol. 64, Issue 11, Mar 2015


Empirical analysis of statistical algorithms often demands time-consuming experiments. We present two R packages which greatly simplify working in batch computing environments. The package BatchJobs implements the basic objects and procedures to control any batch cluster from within R. It is structured around cluster versions of the well-known higher order functions Map, Reduce and Filter from functional programming. Computations are performed asynchronously and all job states are persistently stored in a database, which can be queried at any point in time. The second package, BatchExperiments, is tailored for the still very general scenario of analyzing arbitrary algorithms on problem instances. It extends package BatchJobs by letting the user define an array of jobs of the kind “apply algorithm A to problem instance P and store results”. It is possible to associate statistical designs with parameters of problems and algorithms and therefore to systematically study their influence on the results.

The packages’ main features are: (a) Convenient usage: All relevant batch system operations are either handled internally or mapped to simple R functions. (b) Portability: Both packages use a clear and well-defined interface to the batch system which makes them applicable in most high-performance computing environments. (c) Reproducibility: Every computational part has an associated seed to ensure reproducibility even when the underlying batch system changes. (d) Abstraction and good software design: The code layers for algorithms, experiment definitions and execution are cleanly separated and enable the writing of readable and maintainable code.

March 20, 2015 07:00 AM

gems: An R Package for Simulating from Disease Progression Models

Vol. 64, Issue 10, Mar 2015


Mathematical models of disease progression predict disease outcomes and are useful epidemiological tools for planners and evaluators of health interventions. The R package gems is a tool that simulates disease progression in patients and predicts the effect of different interventions on patient outcome. Disease progression is represented by a series of events (e.g., diagnosis, treatment and death), displayed in a directed acyclic graph. The vertices correspond to disease states and the directed edges represent events. The package gems allows simulations based on a generalized multistate model that can be described by a directed acyclic graph with continuous transition-specific hazard functions. The user can specify an arbitrary hazard function and its parameters. The model includes parameter uncertainty, does not need to be a Markov model, and may take the history of previous events into account. Applications are not limited to the medical field and extend to other areas where multistate simulation is of interest. We provide a technical explanation of the multistate models used by gems, explain the functions of gems and their arguments, and show a sample application.

March 20, 2015 07:00 AM

nparcomp: An R Software Package for Nonparametric Multiple Comparisons and Simultaneous Con*dence Intervals

Vol. 64, Issue 9, Mar 2015


One-way layouts, i.e., a single factor with several levels and multiple observations at each level, frequently arise in various fields. Usually not only a global hypothesis is of interest but also multiple comparisons between the different treatment levels. In most practical situations, the distribution of observed data is unknown and there may exist a number of atypical measurements and outliers. Hence, use of parametric and semiparametric procedures that impose restrictive distributional assumptions on observed samples becomes questionable. This, in turn, emphasizes the demand on statistical procedures that enable us to accurately and reliably analyze one-way layouts with minimal conditions on available data. Nonparametric methods offer such a possibility and thus become of particular practical importance. In this article, we introduce a new R package nparcomp which provides an easy and user-friendly access to rank-based methods for the analysis of unbalanced one-way layouts. It provides procedures performing multiple comparisons and computing simultaneous confidence intervals for the estimated effects which can be easily visualized. The special case of two samples, the nonparametric Behrens-Fisher problem, is included. We illustrate the implemented procedures by examples from biology and medicine.

March 20, 2015 07:00 AM

R Package multgee: A Generalized Estimating Equations Solver for Multinomial Responses

Vol. 64, Issue 8, Mar 2015


The R package multgee implements the local odds ratios generalized estimating equations (GEE) approach proposed by Touloumis, Agresti, and Kateri (2013), a GEE approach for correlated multinomial responses that circumvents theoretical and practical limitations of the GEE method. A main strength of multgee is that it provides GEE routines for both ordinal (ordLORgee) and nominal (nomLORgee) responses, while relevant other softwares in R and SAS are restricted to ordinal responses under a marginal cumulative link model specification. In addition, multgee offers a marginal adjacent categories logit model for ordinal responses and a marginal baseline category logit model for nominal responses. Further, utility functions are available to ease the local odds ratios structure selection ( and to perform a Wald type goodness-of-fit test between two nested GEE models (waldts). We demonstrate the application of multgee through a clinical trial with clustered ordinal multinomial responses.

March 20, 2015 07:00 AM

March 14, 2015

Dirk Eddelbuettel

littler 0.2.3

max-heap image

A new minor release of littler is available now.

It adds or extends a number of things:

  • added support for drat by adding a new example installDrat.r;

  • the install.r, install2.r and check.r scripts now use getOption("repos") to set the default repos; this works well with drat and multiple repos set via, e.g. ~/.littler.r or /etc/littler.r;

  • added support for installing Debian binaries as part of a check.r run, this can be particularly useful for one-command checks as done by some of the Rocker containers;

  • added support for reproducible builds: if REPRODUCIBLE_BUILD is defined, no date and time stamp is added to the binary;

  • added new command-line option -L|--libpath to expand the library path used for packages;

  • added support for setting multiple repos from the command-line in the install2.r script;

  • the manual page was updated with respect to recent additions;

  • a link to the examples web page was added to the --usage output display;

See the littler examples page for more details.

Full details for the littler release are provided as usual at the ChangeLog page.

The code is available via the GitHub repo, from tarballs off my littler page and the local directory here. A fresh package has gone to the incoming queue at Debian; Michael Rutter will probably have new Ubuntu binaries at CRAN in a few days too.

Comments and suggestions are welcome via the mailing list or issue tracker at the GitHub repo.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

March 14, 2015 03:36 PM

March 13, 2015

Dirk Eddelbuettel

Why Drat? A Guest Post by Steven Pav

Editorial Note: The following post was kindly contributed by Steven Pav.

Why Drat?

After playing around with drat for a few days now, my impressions of it are best captured by Dirk's quote:

It just works.


To get some idea of what I mean by this, suppose you are a happy consumer of R packages, but want access to, say, the latest, greatest releases of my distribution package, sadist. You can simply add the following to your .Rprofile file:


After this, you instantly have access to new releases in the github/shabbychef drat store via the package tools you already know and tolerate. You can use


to install the sadists package from the drat store, for example. Similarly, if you issue


all the drat stores you have added will be checked for package updates, along with their dependencies which may well come from other repositories including CRAN.

Use cases

The most obvious use cases are:

  1. Micro releases. For package authors, this provides a means to get feedback from the early adopters, but also allows one to push small changes and bug fixes without burning through your CRAN karma (if you have any left). My personal drat store tends to be a few minor releases ahead of my CRAN releases.

  2. Local repositories. In my professional life, I write and maintain proprietary packages. Pushing package updates used to involve saving the package .tar.gz to a NAS, then calling something like R CMD INSTALL package_name_0.3.1.9001.tar.gz. This is not something I wanted to ask of my colleagues. With drat, they can instead add the following stanza to .Rprofile: drat:::addRepo('localRepo','file:///mnt/NAS/r/local/drat'), and then rely on update.packages to do the rest.

I suspect that in the future, drat might be (ab)used in the following ways:

  1. Rolling your own vanilla CRAN mirror, though I suspect there are better existing ways to accomplish this.

  2. Patching CRAN. Suppose you found a bug in a package on CRAN (inconceivable!). As it stands now, you email the maintainer, and wait for a fix. Maybe the patch is trivial, but suppose it is never delivered. Now, you can simply make the patch yourself, pick a higher revision number, and stash it in your drat store. The only downside is that eventually the package maintainer might bump their revision number without pushing a fix, and you are stuck in an arms race of version numbers.

  3. Forgoing CRAN altogether. While some package maintainers might find this attractive, I think I would prefer a single huge repository, warts and all, to a landscape of a million microrepos. Perhaps some enterprising group will set up a CRAN-like drat store on github, and accept packages by pull request (whether github CDN can or will support the traffic that CRAN does is another matter), but this seems a bit too futuristic for me now.

My wish list

In exchange for writing this blog post, I get to lobby Dirk for some features in drat:

  1. I shudder at the thought of hundreds of tiny drat stores. Perhaps there should be a way to aggregate addRepo commands in some way. This would allow curators to publish their suggested lists of repos.

  2. Drat stores are served in the gh-pages branch of a github repo. I wish there were some way to keep the index.html file in that directory reflect the packages present in the sources. Maybe this could be achieved with some canonical RMarkdown code that most people use.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

March 13, 2015 11:47 PM

Bioconductor Project Working Papers

Targeted Estimation and Inference for the Sample Average Treatment Effect

While the population average treatment effect has been the subject of extensive methods and applied research, less consideration has been given to the sample average treatment effect: the mean difference in the counterfactual outcomes for the study units. The sample parameter is easily interpretable and is arguably the most relevant when the study units are not representative of a greater population or when the exposure's impact is heterogeneous. Formally, the sample effect is not identifiable from the observed data distribution. Nonetheless, targeted maximum likelihood estimation (TMLE) can provide an asymptotically unbiased and efficient estimate of both the population and sample parameters. In this paper, we study the asymptotic and finite sample properties of the TMLE for the sample effect and provide a conservative variance estimator. In most settings, the sample parameter can be estimated more efficiently than the population parameter. Finite sample simulations illustrate the potential gains in precision and power from selecting the sample effect as the target of inference. As a motivating example, we discuss the Sustainable East Africa Research in Community Health (SEARCH) study, an ongoing cluster randomized trial for HIV prevention and treatment.

by Laura B. Balzer et al. at March 13, 2015 06:20 PM


The primary analysis in many randomized controlled trials focuses on the average treatment effect and does not address whether treatment benefits are widespread or limited to a select few. This problem affects many disease areas, since it stems from how randomized trials, often the gold standard for evaluating treatments, are designed and analyzed. Our goal is to estimate the fraction who benefit from a treatment, based on randomized trial data. We consider cases where the primary outcome is continuous, discrete, or ordinal. In general, the fraction who benefit is a non-identifiable parameter, and the best that can be obtained are sharp lower and upper bounds on it. We develop a method to estimate these bounds using a novel application of linear programming, which allows fast implementation. MATLAB software is provided. The method can incorporate information from prognostic baseline variables in order to improve precision, without requiring parametric model assumptions. Also, assumptions based on subject matter knowledge can be incorporated to improve the bounds. We apply our general method to estimate lower and upper bounds on the fraction who benefit from a new surgical intervention for stroke.

by Emily Huang et al. at March 13, 2015 05:07 PM

March 10, 2015

Bioconductor Project Working Papers

Statistical Estimation of T1 Relaxation Time Using Conventional Magnetic Resonance Imaging

Quantitative T1 maps (qT1) are often used to study diffuse tissue abnormalities that may be difficult to assess on standard clinical sequences. While qT1 maps can provide valuable information for studying the progression and treatment of diseases like multiple sclerosis, the additional scan time required and multi-site implementation issues have limited their inclusion in many standard clinical and research protocols. Hence, the availability of qT1 maps has historically been limited.

In this paper, we propose a new method of estimating T1 maps retroactively that only requires the acquisition or availability of four conventional MRI sequences. For these sequences, we employ a novel normalization method using cerebellar gray matter as a reference tissue, which allows diffuse differences in cerebral normal-appearing white matter (NAWM) to be detected. We use a regression model, fit separately to each tissue class, that relates the normalized intensities of each sequence to the acquired qT1 map value at each voxel using smooth functions. We test our model on a set of 63 subjects, including primary progressive (PPMS), relapsing-remitting (RRMS) and secondary progressive multiple sclerosis (SPMS) patients and healthy controls, and generate statistical qT1 maps using cross-validation. We find the estimation error of these maps to be similar to the measurement error of the acquired qT1 maps, and we find the prediction error of the statistical and acquired qT1 maps to be similar. Visually, the statistical qT1 maps are similar to but less noisy than the acquired qT1 maps. Nonparametric tests of group differences in NAWM relative to healthy controls show similar results whether acquired or statistical qT1 maps are used, but the statistical qT1 maps have more power to detect group differences than the acquired maps.

by Amanda Mejia et al. at March 10, 2015 10:42 PM

March 09, 2015

Journal of the Royal Statistical Society: Series C

March 06, 2015


Probability Theory: Convergence in Distribution Problem

Let's solve some theoretical problem in probability, specifically on convergence. The problem below is originally from Exercise 5.42 of Casella and Berger (2001). And I just want to share my solution on this. If there is an incorrect argument below, I would be happy if you could point that to me.


Let $X_1, X_2,\cdots$ be iid (independent and identically distributed) and $X_{(n)}=\max_{1\leq i\leq n}x_i$.
  1. If $x_i\sim$ beta(1,$\beta$), find a value of $\nu$ so that $n^{\nu}(1-X_{(n)})$ converges in distribution;
  2. If $x_i\sim$ exponential(1), find a sequence $a_n$ so that $X_{(n)}-a_n$ converges in distribution.


  1. Let $Y_n=n^{\nu}(1-X_{(n)})$, we say that $Y_n\rightarrow Y$ in distribution. If $$\lim_{n\rightarrow \infty}F_{Y_n}(y)=F_Y(y).$$ Then, $$ \begin{aligned} \lim_{n\rightarrow\infty}F_{Y_n}(y)&=\lim_{n\rightarrow\infty}P(Y_n\leq y)=\lim_{n\rightarrow\infty}P(n^{\nu}(1-X_{(n)})\leq y)\\ &=\lim_{n\rightarrow\infty}P\left(1-X_{(n)}\leq \frac{y}{n^{\nu}}\right)\\ &=\lim_{n\rightarrow\infty}P\left(-X_{(n)}\leq \frac{y}{n^{\nu}}-1\right)=\lim_{n\rightarrow\infty}\left[1-P\left(-X_{(n)}> \frac{y}{n^{\nu}}-1\right)\right]\\ &=\lim_{n\rightarrow\infty}\left[1-P\left(\max\{X_1,X_2,\cdots,X_n\}< 1-\frac{y}{n^{\nu}}\right)\right]\\ &=\lim_{n\rightarrow\infty}\left[1-P\left(X_1< 1-\frac{y}{n^{\nu}},X_2< 1-\frac{y}{n^{\nu}},\cdots,X_n< 1-\frac{y}{n^{\nu}}\right)\right]\\ &=\lim_{n\rightarrow\infty}\left[1-P\left(X_1< 1-\frac{y}{n^{\nu}}\right)^n\right],\;\text{since}\;X_i's\;\text{are iid.} \end{aligned} $$ And because $x_i\sim$ beta(1,$\beta$), the density is $$ f_{X_1}(x)=\begin{cases} \beta(1-x)^{\beta - 1}&\beta>0, 0\leq x\leq 1\\ 0,&\mathrm{Otherwise} \end{cases} $$ Implies, $$ \begin{aligned} \lim_{n\to \infty}P(Y_n\leq y)&=\lim_{n\to \infty}\left\{1-\left[\int_0^{1-\frac{y}{n^{\nu}}}\beta(1-t)^{\beta-1}\,\mathrm{d}t\right]^n\right\}\\ &=\lim_{n\to \infty}\left\{1-\left[-\int_1^{\frac{y}{n^{\nu}}}\beta u^{\beta-1}\,\mathrm{d}u\right]^{n}\right\}\\ &=\lim_{n\to \infty}\left\{1-\left[-\beta\frac{u^{\beta}}{\beta}\bigg|_{u=1}^{u=\frac{y}{n^{\nu}}}\right]^{n}\right\}\\ &=1-\lim_{n\to \infty}\left[1-\left(\frac{y}{n^{\nu}}\right)^{\beta}\right]^{n} \end{aligned} $$ We can simplify the limit if $\nu=\frac{1}{\beta}$, that is $$ \lim_{n\to\infty}P(Y_n\leq y)=1-\lim_{n\to\infty}\left[1-\frac{y^{\beta}}{n}\right]^{n}=1-e^{-y^{\beta}} $$ To confirm this in Python, run the following code using the sympy module

    Therefore, if $1-e^{-y^{\beta}}$ is a distribution function of $Y$, then $Y_n=n^{\nu}(1-X_{(n)})$ converges in distribution to $Y$ for $\nu=\frac{1}{\beta}$.
  2. $$ \begin{aligned} P(X_{(n)}-a_{n}\leq y) &= P(X_{(n)}\leq y + a_n)=P(\max\{X_1,X_2,\cdots,X_n\}\leq y+a_n)\\ &=P(X_1\leq y+a_n,X_2\leq y+a_n,\cdots,X_n\leq y+a_n)\\ &=P(X_1\leq y+a_n)^n,\;\text{since}\;x_i's\;\text{are iid}\\ &=\left[\int_{-\infty}^{y+a_n}f_{X_1}(t)\,\mathrm{d}t\right]^n \end{aligned} $$ Since $X_i\sim$ exponential(1), then the density is $$ f_{X_1}=\begin{cases} e^{-x},&0\leq x\leq \infty\\ 0,&\mathrm{otherwise} \end{cases} $$ So that, $$ \begin{aligned} P(X_{(n)}-a_{n}\leq y)&=\left[\int_{0}^{y+a_n}e^{-t}\,\mathrm{d}t\right]=\left\{-\left[e^{-(y+a_n)}-1\right]\right\}^n\\ &=\left[1-e^{-(y+a_n)}\right]^n \end{aligned} $$ If we let $Y_n=X_{(n)}-a_n$, then we say that $Y_n\rightarrow Y$ in distribution if $$ \lim_{n\to\infty}P(Y_n\leq y)=P(Y\leq y) $$ Therefore, $$ \begin{aligned} \lim_{n\to\infty}P(Y_n\leq y) &= \lim_{n\to\infty}P(X_{(n)}-a_n\leq y)=\lim_{n\to \infty}\left[1-e^{-y-a_n}\right]^n\\ &=\lim_{n\to\infty}\left[1-\frac{e^{-y}}{e^{a_n}}\right]^n \end{aligned} $$ We can simplify the limit if $a_n=\log n$, that is $$ \lim_{n\to\infty}\left[1-\frac{e^{-y}}{e^{\log n}}\right]^n=\lim_{n\to\infty}\left[1-\frac{e^{-y}}{n}\right]^n=e^{-e^{-y}} $$ Check this in Python by running the following code,

    In conclusion, if $e^{-e^{-y}}$ is a distribution function of Y, then $Y_n=X_{(n)}-a_n$ converges in distribution to $Y$ for sequence $a_n=\log n$.


  1. Casella, G. and Berger, R.L. (2001). Statistical Inference. Thomson Learning, Inc.

by Al-Ahmadgaid Asaad ( at March 06, 2015 11:27 AM

Dirk Eddelbuettel

Rcpp 0.11.5

The new release 0.11.5 of Rcpp just reached the CRAN network for GNU R, and a Debian package has also been be uploaded.

Rcpp has become the most popular way of enhancing GNU R with C++ code. As of today, 345 packages on CRAN depend on Rcpp for making analyses go faster and further; BioConductor adds another 41 packages, and casual searches on GitHub suggests dozens mores.

This release continues the 0.11.* release cycle, adding another large number of small bug fixes, polishes and enhancements. Since the previous release in January, we incorporated a number of pull requests and changes from several contributors. This time, JJ deserves a special mention as he is responsible for a metric ton of the changes listed below, making Rcpp Attributes even more awesome. As always, you can follow the development via the GitHub repo and particularly the Issue tickets and Pull Requests. And any discussions, questions, ... regarding Rcpp are always welcome at the rcpp-devel mailing list.

See below for a detailed list of changes extracted from the NEWS file.

Changes in Rcpp version 0.11.5 (2015-03-04)

  • Changes in Rcpp API:

    • An error handler for tinyformat was defined to prevent the assert() macro from spilling.

    • The Rcpp::warning function was added as a wrapper for Rf_warning.

    • The XPtr class was extended with new checked_get and release functions as well as improved behavior (throw an exception rather than crash) when a NULL external pointer is dereferenced.

    • R code is evaluated within an R_toplevelExec block to prevent user interrupts from bypassing C++ destructors on the stack.

    • The Rcpp::Environment constructor can now use a supplied parent environment.

    • The Rcpp::Function constructor can now use a supplied environment or namespace.

    • The attributes_hidden macro from R is used to shield internal functions; the R_ext/Visibility.h header is now included as well.

    • A Rcpp::print function was added as a wrapper around Rf_PrintValue.

  • Changes in Rcpp Attributes:

    • The pkg_types.h file is now included in RcppExports.cpp if it is present in either the inst/include or src.

    • sourceCpp was modified to allow includes of local files (e.g. #include "foo.hpp"). Implementation files (.cc; .cpp) corresponding to local includes are also automatically built if they exist.

    • The generated attributes code was simplified with respect to RNGScope and now uses RObject and its destructor rather than SEXP protect/unprotect.

    • Support addition of the rng parameter in Rcpp::export to suppress the otherwise automatic inclusion of RNGScope in generated code.

    • Attributes code was made more robust and can e.g. no longer recurse.

    • Version 3.2 of the Rtools is now correctly detected as well.

    • Allow 'R' to come immediately after '***' for defining embedded R code chunks in sourceCpp.

    • The attributes vignette has been updated with documentation on new features added over the past several releases.

  • Changes in Rcpp tests:

    • On Travis CI, all build dependencies are installed as binary .deb packages resulting in faster tests.

  • Thanks to CRANberries, you can also look at a diff to the previous release As always, even fuller details are on the Rcpp Changelog page and the Rcpp page which also leads to the downloads page, the browseable doxygen docs and zip files of doxygen output for the standard formats. A local directory has source and documentation too. Questions, comments etc should go to the rcpp-devel mailing list off the R-Forge page.

    This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

    March 06, 2015 11:26 AM

    March 01, 2015

    Dirk Eddelbuettel

    drat 0.0.2: Improved Support for Lightweight R Repositories

    A few weeks ago we introduced the drat package. Its name stands for drat R Archive Template, and it helps with easy-to-create and easy-to-use repositories for R packages. Two early blog posts describe drat: First Steps Towards Lightweight Repositories, and Publishing a Package.

    A new version 0.0.2 is now on CRAN. It adds several new features:

    • beginnings of native git support via the excellent new git2r package,
    • a new helper function to prune a repo of older versions of packages (as R repositories only show the newest release of a package),
    • improved core functionality in inserting a package, and adding a repo.

    Courtesy of CRANberries, there is a comparison to the previous release. More detailed information is on the drat page.

    This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

    March 01, 2015 03:39 PM

    February 27, 2015


    R: How to Layout and Design an Infographic

    As promised from my recent article, here's my tutorial on how to layout and design an infographic in R. This article will serve as a template for more infographic design that I plan to share on future posts. Hence, we will go through the following sections:
    1. Layout - mainly handles by grid package.
    2. Design - style of the elements in the layout.
      • Texts - use extrafont package for custom fonts;
      • Shapes (lines and point characters) - use grid, although this package has been removed from CRAN (as of February 26, 2015), the compressed file of the source code of the package is still available. But if I am not mistaken, by default this package is included in R. You might check it first before installing.
      • Plots - several choices for plotting data in R: base plot, lattice, or ggplot2 package.

    The Infographic

    We aim to obtain the following layout and design in the final output of our code:
    To start with, we need to setup our data first. And for illustration purposes, we will use a simulated data:

    Design: Colour

    The aesthetic of an infographic not only depends on the shapes and plots, but also on the colours. So if you are not an artist, I suggest to look first for a list of sample infographics to get some inspiration. Once you have found the theme for your chart, grab the colour of it. To grab the colour, use eyedropper tool from software such as photoshop, affinity designer, etc. There is also free add ons for Mozilla Firefox called ColorZilla, I haven't tried it but maybe you could explore that. For the above theme, there are five colours with the following hexadecimal colour code:

    Colour NameHexadecimal
    Table 1: Colours Used in the Chart.
    Dark Violet#552683
    Dark Yellow#E7A922
    Gray (Infographic Text)#A9A8A7
    Dark Yellow (Crime Text)#CA8B01

    Design: Data Visualization

    At this point, we'll prepare the elements in the layout, and we begin with the plots. Below is the bar plot of y1 in the data frame, dat, in three groupings, grp. Note that the plot you'll obtain will not be the same with the one below since the data changes every time we run the simulation above.
    So that's the default theme of ggplot2, and we want to customize this using the theme function. One of the elements in the plot that will be tweaked is the font. To deal with this we need to import the fonts using the extrafont package. That is,

    What happens above is that all fonts installed in your machine will be imported. It's better to import all of it so that we'll have several choices to play on. For the above infographic, the font used is called Impact, which is available on windows and I think on mac as well. If you don't have that, then download and install it first before running the above codes. To arrive on the design of the bar plot in the infographic we use the following theme,

    I named it kobe_theme since if you recall from my previous article, the above chart is inspired by Kobe Bryant Infographic. So applying this to the plot we'll have the following,
    Obtain by running p1 + kobe_theme(). If in case you want to reorder the ticks in the x-axis, by starting with A from the top and ending with L in the bottom, simply run the following,

    And you'll have
    So that's our first plot, next is to plot y2 from dat data frame, this time using the line plot.
    Obtain by running the following code:

    Applying kobe_theme, will give us
    Above plot is generated by running p2 + kobe_theme(). We should expect this since the kobe_theme that was applied in the bar plot with coord_flip option enabled, affects the orientation of the grids. So instead, we do a little tweak on the current theme, and see for yourself the difference:

    So that we have the following result:
    Now that's better, one more issue is the title label for the legend. To change the label, run the following code:

    And that will give us
    Finally, y3 variable is plotted using the following codes:

    by default we have,
    Applying kobe_theme2(),


    All plots are now set, next is to place it in the layout. The following steps explain the procedure:
    1. Start by creating new grid plot, grid.newpage();
    2. Next define the layout of the grid. Think of this as a matrix of plots, where a 2 by 2 matrix plot will give us 4 windows (two rows and two columns). These windows will serve as a placeholder of the plots. So to achieve a matrix plot with 4 rows and 3 columns, we run

    3. Next is the background colour, this will be the background colour of the infographic. For the given chart, we run the following:

    4. Next is to insert texts in the layout, use the grid.text function. The position of objects/elements such as texts in the grid is defined by the (x, y) coordinates. The bound of the grid by default is a unit square, of course the aspect ratio of the square can be modified. So the support of x and y is $[0,1]^2$;
    5. To insert the plot into a specific window in the matrix plot use the vplayout function for the coordinates of the placeholder, and print for pasting. Say we want to insert the first plot in first row, second column, we code it this way

      Now to place it in first row and stretched it over all (three) columns, run

    Using the above procedure, we have the following codes for the infographic. Enjoy!

    PNG Output

    PDF Output


    1. ggplot2 Documentation.
    2. Cookbook for R.

    by Al-Ahmadgaid Asaad ( at February 27, 2015 11:58 AM

    February 25, 2015


    Philippine Infographic: Recapitulation on Incidents Involving Motorcycle Riding in Tandem Criminals for 2011-2013

    The Philippine government has launched Open Data Philippines ( last year, January 16, 2014. Accordingly, the aims to make national government data searchable, accessible, and useful, with the help of the different agencies of government, and with the participation of the public. This website consolidates the data sets of different government agencies, allowing users to find specific information from a rich and continuously growing collection of public data sets. provides information on how to access these datasets and tools, such infographics and other applications, to make the information easy to understand. Users may not only view the datasets, but also share and download them as spreadsheets and other formats, for their own use.

    The primary goal of is to foster a citizenry empowered to make informed decisions, and to promote efficiency and transparency in government. For more, check out the video:

    Although admittedly I accidentally discovered this few weeks ago, but still good news for me. I mean I've been frustrated about our government data since college, it was difficult to do case study and research about crimes, rainfall, and other interesting variables to model due to the lack of data available online. With the launch of Open Data Philippines, and for believing that data can improve our country, it's a win win for me. So as a first exploitation on it. I decided to use the data from Philippine National Police (PNP) agency about the incidents involving motorcycle riding in tandem criminals. Check my first infographic below,

    The Infographic (PNG)

    PDF Version

    What software did I use for creating this infographic?

    Well I designed it entirely using R, with the help of ggplot2, grid, and extrafont packages. The above infographic is inspired by Kobe Bryant Infographic. The hexadecimal color codes from the said chart were extracted using the eyedropper tool from Affinity Designer.

    I will not share any code in this post, but will do a tutorial on how to create one. So be notified by subscribing.

    by Al-Ahmadgaid Asaad ( at February 25, 2015 10:02 AM

    February 24, 2015

    Douglas Bates

    RCall: Running an embedded R in Julia

    I have used R (and S before it) for a couple of decades. In the last few years most of my coding has been in Julia, a language for technical computing that can provide remarkable performance for a dynamically typed language via Just-In-Time (JIT) compilation of functions and via multiple dispatch.

    Nonetheless there are facilities in R that I would like to have access to from Julia. I created the RCall package for Julia to do exactly that. This IJulia notebook provides an introduction to RCall.

    This is not a novel idea by any means. Julia already has PyCall and JavaCall packages that provide access to Python and to Java. These packages are used extensively and are much more sophisticated than RCall, at present. Many other languages have facilities to run an embedded instance of R. In fact, Python has several such interfaces.

    The things I plan to do using RCall is to access datasets from R and R packages, to fit models that are not currently implemented in Julia and to use R graphics, especially the ggplot2 and lattice packages. Unfortunately I am not currently able to start a graphics device from the embedded R but I expect that to be fixed soon.

    I can tell you the most remarkable aspect of RCall although it may not mean much if you haven't tried to do this kind of thing. It is written entirely in Julia. There is absolutely no "glue" code written in a compiled language like C or C++. As I said, this may not mean much to you unless you have tried to do something like this, in which case it is astonishing.

    by Douglas Bates ( at February 24, 2015 11:05 PM

    February 22, 2015


    Python: Getting Started with Data Analysis

    Analysis with Programming has recently been syndicated to Planet Python. And as a first post being a contributing blog on the said site, I would like to share how to get started with data analysis on Python. Specifically, I would like to do the following:
    1. Importing the data
      • Importing CSV file both locally and from the web;
    2. Data transformation;
    3. Descriptive statistics of the data;
    4. Hypothesis testing
      • One-sample t test;
    5. Visualization; and
    6. Creating custom function.

    Importing the data

    This is the crucial step, we need to import the data in order to proceed with the succeeding analysis. And often times data are in CSV format, if not, at least can be converted to CSV format. In Python we can do this using the following codes:

    To read CSV file locally, we need the pandas module which is a python data analysis library. The read_csv function can read data both locally and from the web.

    Data transformation

    Now that we have the data in the workspace, next is to do transformation. Statisticians and scientists often do this step to remove unnecessary data not included in the analysis. Let's view the data first:

    To R programmers, above is the equivalent of print(head(df)) which prints the first six rows of the data, and print(tail(df)) -- the last six rows of the data, respectively. In Python, however, the number of rows for head of the data by default is 5 unlike in R, which is 6. So that the equivalent of the R code head(df, n = 10) in Python, is df.head(n = 10). Same goes for the tail of the data.

    Column and row names of the data are extracted using the colnames and rownames functions in R, respectively. In Python, we extract it using the columns and index attributes. That is,

    Transposing the data is obtain using the T method,

    Other transformations such as sort can be done using sort attribute. Now let's extract a specific column. In Python, we do it using either iloc or ix attributes, but ix is more robust and thus I prefer it. Assuming we want the head of the first column of the data, we have

    By the way, the indexing in Python starts with 0 and not 1. To slice the index and first three columns of the 11th to 21st rows, run the following

    Which is equivalent to print df.ix[10:20, ['Abra', 'Apayao', 'Benguet']]

    To drop a column in the data, say columns 1 (Apayao) and 2 (Benguet), use the drop attribute. That is,

    axis argument above tells the function to drop with respect to columns, if axis = 0, then the function drops with respect to rows.

    Descriptive Statistics

    Next step is to do descriptive statistics for preliminary analysis of our data using the describe attribute:

    Hypothesis Testing

    Python has a great package for statistical inference. And that's the stats library of scipy. The one sample t-test is implemented in ttest_1samp function. So that, if we want to test the mean of the Abra's volume of palay production against the null hypothesis with 15000 assumed population mean of the volume of palay production, we have

    The values returned are tuple of the following values:
    • t : float or array
    • prob : float or array
          two-tailed p-value
    From the above numerical output, we see that the p-value = 0.2627 is greater than $\alpha=0.05$, hence there is no sufficient evidence to conclude that the average volume of palay production is not equal to 15000. Applying this test for all variables against the population mean 15000 volume of production, we have

    The first array returned is the t-statistic of the data, and the second array is the corresponding p-values.


    There are several module for visualization in Python, and the most popular one is the matplotlib library. To mention few, we have bokeh and seaborn modules as well to choose from. In my previous post, I've demonstrated the matplotlib package which has the following graphic for box-whisker plot,
    Now plotting using pandas module can beautify the above plot into the theme of the popular R plotting package, the ggplot. To use the ggplot theme just add one more line to the above code,

    And you'll have the following,
    Even neater than the default matplotlib.pyplot theme. But in this post, I would like to introduce the seaborn module which is a statistical data visualization library. So that, we have the following
    Sexy boxplot, scroll down for more.

    Creating custom function

    To define a custom function in Python, we use the def function. For example, say we define a function that will add two numbers, we do it as follows,

    By the way, in Python indentation is important. Use indentation for scope of the function, which in R we do it with braces {...}. Now here's an algorithm from my previous post,
    1. Generate samples of size 10 from Normal distribution with $\mu$ = 3 and $\sigma^2$ = 5;
    2. Compute the $\bar{x}$ and $\bar{x}\mp z_{\alpha/2}\displaystyle\frac{\sigma}{\sqrt{n}}$ using the 95% confidence level;
    3. Repeat the process 100 times; then
    4. Compute the percentage of the confidence intervals containing the true mean.
    Coding this in Python we have,

    Above code might be easy to read, but it's slow in replication. Below is the improvement of the above code, thanks to Python gurus, see comments on my previous post.


    For those who are interested in the ipython notebook of this article, please click here. This article was converted to ipython notebook by of Nuttens Claude.

    Data Source


    1. Pandas, Scipy, and Seaborn Documentations.
    2. Wes McKinney & PyData Development Team (2014). pandas: powerful Python data analysis toolkit.

    by Al-Ahmadgaid Asaad ( at February 22, 2015 12:35 PM

    February 04, 2015

    Bioconductor Project Working Papers

    Simulation of Semicompeting Risk Survival Data and Estimation Based on Multistate Frailty Model

    We develop a simulation procedure to simulate the semicompeting risk survival data. In addition, we introduce an EM algorithm and a B–spline based estimation procedure to evaluate and implement Xu et al. (2010)’s nonparametric likelihood es- timation approach. The simulation procedure provides a route to simulate samples from the likelihood introduced in Xu et al. (2010)’s. Further, the EM algorithm and the B–spline methods stabilize the estimation and gives accurate estimation results. We illustrate the simulation and the estimation procedure with simluation examples and real data analysis.

    by Fei Jiang et al. at February 04, 2015 07:57 PM

    January 16, 2015

    Modern Toolmaking


    My package caretEnsemble, for making ensembles of caret models, is now on CRAN.

    Check it out, and let me know what you think! (Submit bug reports and feature requests to the issue tracker)

    by Zachary Deane-Mayer ( at January 16, 2015 10:22 PM

    January 15, 2015

    Gregor Gorjanc

    cpumemlog: Monitor CPU and RAM usage of a process (and its children)

    Long time no see ...

    Today I pushed the cpumemlog script to GitHub Read more about this useful utility at the GitHub site.

    by Gregor Gorjanc ( at January 15, 2015 10:16 PM

    January 01, 2015

    Journal of the Royal Statistical Society: Series B

    Journal of the Royal Statistical Society: Series A

    December 27, 2014

    RCpp Gallery

    Create an R-tree data structure using Rcpp and Boost::Geometry


    The purpose of this post is to show how to use Boost::Geometry library which was introduced recently in Rcpp. Especially, we focus on R-tree data structure for searching objects in space because only one spatial index is implemented - R-tree Currently in this library.

    Boost.Geometry which is part of the Boost C++ Libraries gives us algorithms for solving geometry problems. In this library, the Boost.Geometry.Index which is one of components is intended to gather data structures called spatial indexes which are often used to searching for objects in space quickly. Generally speaking, spatial indexes stores geometric objects’ representations and allows searching for objects occupying some space or close to some point in space.


    R-tree is a tree data structure used for spatial searching, i.e., for indexing multi-dimensional information such as geographical coordinates, rectangles or polygons. R-tree was proposed by Antonin Guttman in 1984 as an expansion of B-tree for multi-dimensional data and plays significant role in both theoretical and applied contexts. It is only one spatial index implemented in Boost::Geometry.

    As a real application of this, It is often used to store spatial objects such as restaurant locations or the polygons which typical maps are made of: streets, buildings, outlines of lakes, coastlines, etc in order to perform a spatial query like “Find all stations within 1 km of my current location”, “Let me know all road segments in 2 km of my location” or “find the nearest gas station” which we often ask google seach by your voice recenlty. In this way, the R-tree can be used (nearest neighbor) search for some places.

    You can find more explanations about R-tree in Wikipedia.

    Write a wrapper class of rtree in Boost::Geometry using Rcpp

    Now, we write a simple C++ wrapper class of rtree class in Boost::Geometry::Index that we can use in R.

    The most important feature to mention here is the use of Rcpp module to expose your own class to R. Although almost all classes in Boost library have a lot of functions, , you do not use all in many cases. In that case, you should write your wrapper class for making your code simple.

    Rcpp code

    // [[Rcpp::depends(BH)]]
    // Enable C++11 via this plugin to suppress 'long long' errors
    // [[Rcpp::plugins("cpp11")]]
    #include <vector>
    #include <Rcpp.h>
    #include <boost/geometry.hpp>
    #include <boost/geometry/geometries/point.hpp>
    #include <boost/geometry/geometries/box.hpp>
    #include <boost/geometry/index/rtree.hpp>
    using namespace Rcpp;
    // Mnemonics
    namespace bg = boost::geometry;
    namespace bgi = boost::geometry::index;
    typedef bg::model::point<float, 2, bg::cs::cartesian> point_t;
    typedef bg::model::box<point_t> box;
    typedef std::pair<box, unsigned int> value_t;
    class RTreeCpp {
      // Constructor.
      // You have to give spatial data as a data frame.
      RTreeCpp(DataFrame df) {
          int size = df.nrows();
          NumericVector id   = df[0]; 
          NumericVector bl_x = df[1]; 
          NumericVector bl_y = df[2]; 
          NumericVector ur_x = df[3]; 
          NumericVector ur_y = df[4]; 
          for(int i = 0; i < size; ++i) {        
              // create a box
              box b(point_t(bl_x[i], bl_y[i]), point_t(ur_x[i], ur_y[i]));
              // insert new value
              rtree_.insert(std::make_pair(b, static_cast<unsigned int>(id[i])));   
      // This method(query) is k-nearest neighbor search. 
      // It returns some number of values nearest to some point(point argument) in space.
      std::vector<int> knn(NumericVector point, unsigned int n) {
          std::vector<value_t> result_n;
          rtree_.query(bgi::nearest(point_t(point[0], point[1]), n), std::back_inserter(result_n));    
          std::vector<int> indexes;
          std::vector<value_t>::iterator itr;
          for ( itr = result_n.begin(); itr != result_n.end(); ++itr ) {
              value_t value = *itr;
              indexes.push_back( value.second );
          return indexes;
        // R-tree can be created using various algorithm and parameters
        // You can change the algorithm by template parameter. 
        // In this example we use quadratic algorithm. 
        // Maximum number of elements in nodes in R-tree is set to 16.
        bgi::rtree<value_t, bgi::quadratic<16> > rtree_;
    // [[Rcpp::export]]
    std::vector<int> showKNN(Rcpp::DataFrame df, NumericVector point, unsigned int n) {
      RTreeCpp tree(df);            // recreate tree each time
      return tree.knn(point, n);

    R code using RTreeCpp

    First, we create a sample data set of spatial data.

    # Sample spatial data(boxes)
    points <- data.frame(
        bl_x=c(0, 2, 4), 
        bl_y=c(0, 2, 4), 
        ur_x=c(1, 3, 5), 
        ur_y=c(1, 3, 5))
     * To visually the data, we use the following code: 
    size <- nrow(points)
    #colors for rectangle area
    colors <- rainbow(size)
    #Plot these points
    plot(c(0, 5), c(0, 5), type= "n", xlab="", ylab="")
    for(i in 1:size){
        rect(points[i, "bl_x"], points[i, "bl_y"], points[i, "ur_x"], points[i, "ur_y"], col=colors[i])  
    legend(4, 2, legend=points$id, fill=colors)

    plot of chunk unnamed-chunk-5

    One can use the RTreeCpp class as follows:

    # new RTreeCpp object
    # Search nearest neighbor points(return value : id of points data)
    showKNN(points, c(0, 0), 1)
    [1] 0
    showKNN(points, c(0, 0), 2)
    [1] 1 0
    showKNN(points, c(0, 0), 3)
    [1] 2 1 0

    Note the re-creation of the RTreeCpp object is of course inefficient, but the Rcpp Gallery imposes some constraints on how we present code. For actual application a stateful and persistent object would be created. This could be done via Rcpp Modules as well a number of different ways. Here, however, we need to recreate the object for each call as knitr (which is used behind the scenes) cannot persist objects between code chunks. This is simply a technical limitation of the Rcpp Gallery—but not of Rcpp itself.

    December 27, 2014 12:00 AM

    December 15, 2014

    R you ready?

    QQ-plots in R vs. SPSS – A look at the differences


    We teach two software packages, R and SPSS, in Quantitative Methods 101 for psychology freshman at Bremen University (Germany). Sometimes confusion arises, when the software packages produce different results. This may be due to specifics in the implemention of a method or, as in most cases, to different default settings. One of these situations occurs when the QQ-plot is introduced. Below we see two QQ-plots, produced by SPSS and R, respectively. The data used in the plots were generated by:

    x <- sample(0:9, 100, rep=T)    


    QQ-plot in SPSS using Blom's method


    qqnorm(x, datax=T)      # uses Blom's method by default
    qqline(x, datax=T)

    There are some obvious differences:

    1. The most obvious one is that the R plot seems to contain more data points than the SPSS plot. Actually, this is not the case. Some data points are plotted on top of each in SPSS while they are spread out vertically in the R plot. The reason for this difference is that SPSS uses a different approach assigning probabilities to the values. We will expore the two approaches below.
    2. The scaling of the y-axis differs. R uses quantiles from the standard normal distribution. SPSS by default rescales these values using the mean and standard deviation from the original data. This allows to directly compare the original and theoretical values. This is a simple linear transformation and will not be explained any further here.
    3. The QQ-lines are not identical. R uses the 1st and 3rd quartile from both distributions to draw the line. This is different in SPSS where of a line is drawn for identical values on both axes. We will expore the differences below.

    QQ-plots from scratch

    To get a better understanding of the difference we will build the R and SPSS-flavored QQ-plot from scratch.

    R type

    In order to calculate theoretical quantiles corresponding to the observed values, we first need to find a way to assign a probability to each value of the original data. A lot of different approaches exist for this purpose (for an overview see e.g. Castillo-Gutiérrez, Lozano-Aguilera, & Estudillo-Martínez, 2012b). They usually build on the ranks of the observed data points to calculate corresponding p-values, i.e. the plotting positions for each point. The qqnorm function uses two formulae for this purpose, depending on the number of observations n (Blom’s mfethod, see ?qqnorm; Blom, 1958). With r being the rank, for n > 10 it will use the formula p = (r - 1/2) / n, for n \leq 10 the formula p = (r - 3/8) / (n + 1/4) to determine the probability value p for each observation (see the help files for the functions qqnorm and ppoint). For simplicity reasons, we will only implement the n > 10 case here.

    n <- length(x)          # number of observations
    r <- order(order(x))    # order of values, i.e. ranks without averaged ties
    p <- (r - 1/2) / n      # assign to ranks using Blom's method
    y <- qnorm(p)           # theoretical standard normal quantiles for p values
    plot(x, y)              # plot empirical against theoretical values

    Before we take at look at the code, note that our plot is identical to the plot generated by qqnorm above, except that the QQ-line is missing. The main point that makes the difference between R and SPSS is found in the command order(order(x)). The command calculates ranks for the observations using ordinal ranking. This means that all observations get different ranks and no average ranks are calculated for ties, i.e. for observations with equal values. Another approach would be to apply fractional ranking and calculate average values for ties. This is what the function rank does. The following codes shows the difference between the two approaches to assign ranks.

    v <- c(1,1,2,3,3)
    order(order(v))     # ordinal ranking used by R
    ## [1] 1 2 3 4 5
    rank(v)             # fractional ranking used by SPSS
    ## [1] 1.5 1.5 3.0 4.5 4.5

    R uses ordinal ranking and SPSS uses fractional ranking by default to assign ranks to values. Thus, the positions do not overlap in R as each ordered observation is assigned a different rank and therefore a different p-value. We will pick up the second approach again later, when we reproduce the SPSS-flavored plot in R.1

    The second difference between the plots concerned the scaling of the y-axis and was already clarified above.

    The last point to understand is how the QQ-line is drawn in R. Looking at the probs argument of qqline reveals that it uses the 1st and 3rd quartile of the original data and theoretical distribution to determine the reference points for the line. We will draw the line between the quartiles in red and overlay it with the line produced by qqline to see if our code is correct.

    plot(x, y)                      # plot empirical against theoretical values
    ps <- c(.25, .75)               # reference probabilities
    a <- quantile(x, ps)            # empirical quantiles
    b <- qnorm(ps)                  # theoretical quantiles
    lines(a, b, lwd=4, col="red")   # our QQ line in red
    qqline(x, datax=T)              # R QQ line

    The reason for different lines in R and SPSS is that several approaches to fitting a straight line exist (for an overview see e.g. Castillo-Gutiérrez, Lozano-Aguilera, & Estudillo-Martínez, 2012a). Each approach has different advantages. The method used by R is more robust when we expect values to diverge from normality in the tails, and we are primarily interested in the normality of the middle range of our data. In other words, the method of fitting an adequate QQ-line depends on the purpose of the plot. An explanation of the rationale of the R approach can e.g. be found here.

    SPSS type

    The default SPSS approach also uses Blom’s method to assign probabilities to ranks (you may choose other methods is SPSS) and differs from the one above in the following aspects:

    • a) As already mentioned, SPSS uses ranks with averaged ties (fractional rankings) not the plain order ranks (ordinal ranking) as in R to derive the corresponding probabilities for each data point. The rest of the code is identical to the one above, though I am not sure if SPSS distinguishes between the n  10 case.
    • b) The theoretical quantiles are scaled to match the estimated mean and standard deviation of the original data.
    • c) The QQ-line goes through all quantiles with identical values on the x and y axis.
    n <- length(x)                # number of observations
    r <- rank(x)                  # a) ranks using fractional ranking (averaging ties)
    p <- (r - 1/2) / n            # assign to ranks using Blom's method
    y <- qnorm(p)                 # theoretical standard normal quantiles for p values
    y <- y * sd(x) + mean(x)      # b) transform SND quantiles to mean and sd from original data
    plot(x, y)                    # plot empirical against theoretical values

    Lastly, let us add the line. As the scaling of both axes is the same, the line goes through the origin with a slope of 1.

    abline(0,1)                   # c) slope 0 through origin

    The comparison to the SPSS output shows that they are (visually) identical.

    Function for SPSS-type QQ-plot

    The whole point of this demonstration was to pinpoint and explain the differences between a QQ-plot generated in R and SPSS, so it will no longer be a reason for confusion. Note, however, that SPSS offers a whole range of options to generate the plot. For example, you can select the method to assign probabilities to ranks and decide how to treat ties. The plots above used the default setting (Blom’s method and averaging across ties). Personally I like the SPSS version. That is why I implemented the function qqnorm_spss in the ryouready package, that accompanies the course. The formulae for the different methods to assign probabilities to ranks can be found in Castillo-Gutiérrez et al. (2012b). The implentation is a preliminary version that has not yet been thoroughly tested. You can find the code here. Please report any bugs or suggestions for improvements (which are very welcome) in the github issues section.

    install_github("markheckmann/ryouready")                # install from github repo
    library(ryouready)                                      # load package
    qq <- qqnorm_spss(x, method=1, ties.method="average")   # Blom's method with averaged ties
    plot(qq)                                                # generate QQ-plot
    ggplot(qq)                                              # use ggplot2 to generate QQ-plot


    1. Technical sidenote: Internally, qqnorm uses the function ppoints to generate the p-values. Type in stats:::qqnorm.default to the console to have a look at the code. 

    by markheckmann at December 15, 2014 08:55 AM

    October 23, 2014

    RCpp Gallery

    Sampling Importance Resampling (SIR) and social revolution.


    The purpose of this gallery post is several fold:

    • to demonstrate the use of the new and improved C++-level implementation of R’s sample() function (see here)
    • to demonstrate the Gallery’s new support for images in contributed posts
    • to demonstrate the usefulness of SIR for updating posterior beliefs given a sample from an arbitrary prior distribution

    Application: Foreign Threats and Social Revolution

    The application in this post uses an example from Jackman’s Bayesian Analysis for the Social Sciences (page 72) which now has a 30-year history in the Political Science (See Jackman for more references). The focus is on the extent to which the probability of revolution varies with facing a foreign threat or not. Facing a foreign threat is measured by “defeated …” or “not defeated …” over a span of 20 years. The countries come from in Latin America. During this period of time, there are only three revolutions: Bolivia (1952), Mexico (1910), and Nicaragua (1979).

      Revolution No Revolution
    Defeated and invaded or lost territory 1 7
    Not defeated for 20 years 2 74

    The goal is to learn about the true, unobservable probabilities of revolution given a recent defeat or the absence of one. That is, we care about


    And, beyond that, we care about whether and differ.

    These data are assumed to arise from a Binomial process, where the likelihood of the probability parameter value, , is

    where is the total number of revolutions and non-revolutions and is the number of revolutions. The MLE for this model is just the sample proportion, so a Frequentist statistician would be wondering whether was sufficiently larger than to be unlikely to have happened by chance alone (given the null hypothesis that the two proportions were identical).

    A Bayesian statistician could approach the question a bit more directly and compute the probability that To do this, we first need samples from the posterior distribution of and . In this post, we will get these samples via Sampling Importance Resampling.

    Sampling Importance Resampling

    Sampling Importance Resampling allows us to sample from the posterior distribution, where

    by resampling from a series of draws from the prior, . Denote one of those draws from the prior distribution, , as . Then draw from the prior sample is drawn with replacement into the posterior sample with probability

    Generating Samples from the Prior Distributions

    We begin by drawing many samples from a series of prior distributions. Although using a prior Beta prior distribution on the parameter admits a closed-form solution, the point here is to demonstrate a simulation based approach. On the other hand, a Gamma prior distribution over is very much not conjugate and simulation is the best approach.

    In particular, we will consider our posterior beliefs about the different in probabilities under five different prior distributions.

    dfPriorInfo <- data.frame(id = 1:5,
                              dist = c("beta", "beta", "gamma", "beta", "beta"),
                              par1 = c(1, 1, 3, 10, .5),
                              par2 = c(1, 5, 20, 10, .5),
                              stringsAsFactors = FALSE)
      id  dist par1 par2
    1  1  beta  1.0  1.0
    2  2  beta  1.0  5.0
    3  3 gamma  3.0 20.0
    4  4  beta 10.0 10.0
    5  5  beta  0.5  0.5

    Using the data frame dfPriorInfo and the plyr package, we will draw a total of 20,000 values from each of the prior distributions. This can be done in any number of ways and is completely independent of using Rcpp for the SIR magic.

    MC1 <- 20000
    dfPriors <- ddply(dfPriorInfo, "id",
                      .fun = (function(X) data.frame(draws = ("r", X$dist, sep = ""),
                                                                      list(MC1, X$par1, X$par2))))))

    However, we can confirm that our draws are as we expect and that we have the right number of them (5 * 20k = 100k).

      id     draws
    1  1 0.7124225
    2  1 0.5910231
    3  1 0.0595327
    4  1 0.4718945
    5  1 0.4485650
    6  1 0.0431667
    [1] 100000      2

    Re-Sampling from the Prior

    Now, we write a C++ snippet that will create our R-level function to generate a sample of D values from the prior draws (prdraws) given their likelihood after the data (i.e., number of success – nsucc, number of failures – nfail).

    The most important feature to mention here is the use of some new and improved extensions which effectively provide an equivalent, performant mirror of R’s sample() function at the C++-level. Note that you need the RcppArmadillo 0.4.500.0 or newer for this version of sample().

    The return value of this function is a length D vector of draws from the posterior distribution given the draws from the prior distribution where the likelihood is used as a filtering weight.

    # include <RcppArmadilloExtensions/sample.h>
    # include <RcppArmadilloExtensions/fixprob.h>
    // [[Rcpp::depends(RcppArmadillo)]]
    using namespace Rcpp ;
    // [[Rcpp::export()]]
    NumericVector samplePost (const NumericVector prdraws,
                              const int D,
                              const int nsucc,
                              const int nfail) {
        int N = prdraws.size();
        NumericVector wts(N);
        for (int n = 0 ; n < N ; n++) {
            wts(n) = pow(prdraws(n), nsucc) * pow(1 - prdraws(n), nfail);
        RcppArmadillo::FixProb(wts, N, true);
        NumericVector podraws = RcppArmadillo::sample(prdraws, D, true, wts);

    To use the samplePost() function, we create the R representation of the data as follows.

    nS <- c(1, 2) # successes
    nF <- c(7, 74) # failures

    As a simple example, consider drawing a posterior sample of size 30 for the “defeated case” from discrete prior distribution with equal weight on the values of .125 (the MLE), .127, and .8. We see there is a mixture of .125 and .127 values, but no .8 values. values of .8 were simply to unlikely (given the likelihood) to be resampled from the prior.

    table(samplePost(c(.125, .127, .8), 30, nS[1], nF[1]))
    0.125 0.127 
        9    21 

    Again making use of the plyr package, we construct samples of size 20,000 for both and under each of the 5 prior distribution samples. These posterior draws are stored in the data frame dfPost.

    MC2 <- 20000
    f1 <- function(X) {
        draws <- X$draws
        t1 <- samplePost(draws, MC2, nS[1], nF[1])
        t2 <- samplePost(draws, MC2, nS[2], nF[2])
        return(data.frame(theta1 = t1, theta2 = t2))
    dfPost <- ddply(dfPriors, "id", f1)
      id    theta1    theta2
    1  1 0.3067334 0.0130865
    2  1 0.1421879 0.0420830
    3  1 0.3218130 0.0634511
    4  1 0.0739756 0.0363466
    5  1 0.1065267 0.0460336
    6  1 0.0961749 0.0440790
    [1] 100000      3

    Summarizing Posterior Inferences

    Here, we are visualizing the posterior draws for the quantity of interest — the difference in probabilities of revolution. These posterior draws are grouped according to the prior distribution used. A test of whether revolution is more likely given a foreign threat is operationalized by the probability that is positive. This probability for each distribution is shown in white. For all choices of the prior here, the probability that “foreign threat matters” exceeds .90.

    The full posterior distribution of is shown for each of the five priors in blue. A solid, white vertical band indicates “no effect”. In all cases. the majority of the mass is clearly to the right of this band.

    Recall that the priors are, themselves, over the individual revolution probabilities, and . The general shape of each of these prior distributions of the parameter is shown in a grey box by the white line. For example, is actually a uniform distribution over the parameter space, . On the other hand, has most of its mass at the two tails.

    plot of chunk unnamed-chunk-9

    At least across these specifications of the prior distributions on , the conclusion that “foreign threats matter” finds a good deal of support. What is interesting about this application is that despite these distributions over the difference in probabilities, the p-value associated with Fisher’s Exact Test for 2 x 2 tables is just .262.

    October 23, 2014 12:00 AM

    October 20, 2014

    Modern Toolmaking

    For faster R on a mac, use veclib

    Update: The links to all my github gists on blogger are broken, and I can't figure out how to fix them.  If you know how to insert gitub gists on a dynamic blogger template, please let me known.

    In the meantime, here are instructions with links to the code:
    First of all, use homebrew to compile openblas.  It's easy!  Second of all, you can also use homebrew to install R! (But maybe stick with the CRAN version unless you really want to compile your own R binary)

    To use openblas with R, follow these instructions:

    To use veclib with R, follow these intructions:


    Inspired by this post, I decided to try using OpenBLAS for R on my mac.  However, it turns out there's a simpler option, using the vecLib BLAS library, which is provided by Apple as part of the accelerate framework.

    If you are using R 2.15, follow these instructions to change your BLAS from the default to vecLib:

    However, as noted in r-sig-mac, these instructions do not work for R 3.0.  You have to directly link to the accelerate framework's version of vecLib:

    Finally, test your new blas using this script:

    On my system (a retina macbook pro), the default BLAS takes 141 seconds and vecLib takes 43 seconds, which is a significant speedup.  If you plan to use vecLib, note the following warning from the R development team "Although fast, it is not under our control and may possibly deliver inaccurate results."

    So far, I have not encountered any issues using vecLib, but it's only been a few hours :-).

    UPDATE: you can also install OpenBLAS on a mac:

    If you do this, make sure to change the directories to point to the correct location on your system  (e.g. change /users/zach/source to whatever directory you clone the git repo into).  On my system, the benchmark script takes ~41 seconds when using openBLAS, which is a small but significant speedup.

    by Zachary Deane-Mayer ( at October 20, 2014 04:24 PM

    October 01, 2014

    RCpp Gallery

    Implementing an EM Algorithm for Probit Regressions

    Users new to the Rcpp family of functionality are often impressed with the performance gains that can be realized, but struggle to see how to approach their own computational problems. Many of the most impressive performance gains are demonstrated with seemingly advanced statistical methods, advanced C++–related constructs, or both. Even when users are able to understand how various demonstrated features operate in isolation, examples may implement too many at once to seem accessible.

    The point of this Gallery article is to offer an example application that performs well (thanks to the Rcpp family) but has reduced statistical and programming overhead for some users. In addition, rather than simply presenting the final product, the development process is explicitly documented and narrated.

    Motivating Example: Probit Regression

    As an example, we will consider estimating the parameters the standard Probit regression model given by

    where and are length vectors and the presence of an “intercept” term is absorbed into if desired.

    The analyst only has access to a censored version of , namely where the subscript denotes the th observation.

    As is common, the censoring is assumed to generate if and otherwise. When we assume , the problem is just the Probit regression model loved by all.

    To make this concrete, consider a model of voter turnout using the dataset provided by the Zelig R package.

       race age educate income vote
    1 white  60      14 3.3458    1
    2 white  51      10 1.8561    0
    3 white  24      12 0.6304    0
    4 white  38       8 3.4183    1
    5 white  25      12 2.7852    1
    6 white  67      12 2.3866    1
    [1] 2000    5

    Our goal will be to estimate the parameters associated with the variables income, educate, and age. Since there is nothing special about this dataset, standard methods work perfectly well.

    fit0 <- glm(vote ~ income + educate + age,
                data = turnout,
                family = binomial(link = "probit")
    Call:  glm(formula = vote ~ income + educate + age, family = binomial(link = "probit"), 
        data = turnout)
    (Intercept)       income      educate          age  
        -1.6824       0.0994       0.1067       0.0169  
    Degrees of Freedom: 1999 Total (i.e. Null);  1996 Residual
    Null Deviance:	    2270 
    Residual Deviance: 2030 	AIC: 2040

    Using fit0 as our baseline, the question is how can we recover these estimates with an Rcpp-based approach. One answer is implement the EM-algorithm in C++ snippets that can be processed into R-level functions; that’s what we will do. (Think of this as a Probit regression analog to the linear regression example — but with fewer features.)

    EM Algorithm: Intuition

    For those unfamiliar with the EM algorithm, consider the Wikipedia article and a denser set of Swarthmore lecture notes.

    The intuition behind this approach begins by noticing that if mother nature revealed the values, we would simply have a linear regression problem and focus on

    where the meaning of the matrix notation is assumed.

    Because mother nature is not so kind, we have to impute the values. For a given guess of , due to our distributional assumptions about we know that


    where .

    By iterating through these two steps we can eventually recover the desired parameter estimates:

    1. impute/augment values
    2. estimate given the data augmentation

    Our Implementations

    To demonstrate implementation of the EM algorithm for a Probit regression model using Rcpp-provided functionality we consider a series of steps.

    These are:

    1. Attempt 1: Main Structure
    2. Attempt 2: EM with Mistaken Augmentation
    3. Attempt 3: EM with Correct Augmentation
    4. Attempt 4: EM with Correct Augmentation in Parallel

    These steps are not chosen because each produces useful output (from the perspective of parameters estimation), but because they mirror milestones in a development process that benefits new users: only small changes are made at a time.

    To begin, we prepare our R-level data for passage to our eventual C++-based functions.

    mY <- matrix(turnout$vote)
    mX <- cbind(1,

    Attempt 1: Main Structure

    The first milestone will be to mock up a function em1 that is exported to create an R-level function of the same name. The key features here are that we have defined the function to - accept arguments corresponding to likely inputs - create containers for the to-be-computed values, - outline the main loop of the code for the EM iterations, and - return various values of interest in a list

    Users new to the Rcpp process will benefit from return List objects in the beginning. They allow you rapidly return new and different values to the R-level for inspection.

    # include <RcppArmadillo.h>
    // [[Rcpp::depends(RcppArmadillo)]]
    using namespace Rcpp ;
    // [[Rcpp::export()]]
    List em1 (const arma::mat y,
              const arma::mat X,
              const int maxit = 10
              ) {
        // inputs
        const int N = y.n_rows ;
        const int K = X.n_cols ;
        // containers
        arma::mat beta(K, 1) ;
        beta.fill(0.0) ; // initialize betas to 1
        arma::mat eystar(N, 1) ;
        eystar.fill(0) ;
        // algorithm
        for (int it = 0 ; it < maxit ; it++) { // EM iterations
            // NEXT STEP: implement algorithm
        // returns
        List ret ;
        ret["N"] = N ;
        ret["K"] = K ;
        ret["beta"] = beta ;
        ret["eystar"] = eystar ;
        return(ret) ;

    We know that this code does not produce estimates of anything. Indeed, that is by design. Neither the beta nor eystar elements of the returned list are ever updated after they are initialized to 0.

    However, we can see that much of the administrative work for a working implementation is complete.

    fit1 <- em1(y = mY,
                X = mX,
                maxit = 20
    [1,]    0
    [2,]    0
    [3,]    0
    [4,]    0
    [1,]    0
    [2,]    0
    [3,]    0
    [4,]    0
    [5,]    0
    [6,]    0

    Having verified that input data structures and output data structures are “working” as expected, we turn to updating the values.

    Attempt 2: EM with Mistaken Augmentation

    Updates to the values are different depending on whether or . Rather than worrying about correctly imputing the unobserved propensities, we will use dummy values of 1 and -1 as placeholders. Instead, the focus is on building on the necessary conditional structure of the code and looping through the update step for every observation.

    Additionally, at the end of each imputation step (the E in EM) we update the estimate with the least squares estimate (the M in EM).

    # include <RcppArmadillo.h>
    // [[Rcpp::depends(RcppArmadillo)]]
    using namespace Rcpp ;
    // [[Rcpp::export()]]
    List em2 (const arma::mat y,
              const arma::mat X,
              const int maxit = 10
              ) {
        // inputs
        const int N = y.n_rows ;
        const int K = X.n_cols ;
        // containers
        arma::mat beta(K, 1) ;
        beta.fill(0.0) ; // initialize betas to 0
        arma::mat eystar(N, 1) ;
        eystar.fill(0) ;
        // algorithm
        for (int it = 0 ; it < maxit ; it++) {
            arma::mat mu = X * beta ;
            // augmentation step
            for (int n = 0 ; n < N ; n++) {
                if (y(n, 0) == 1) { // y = 1
                    // NEXT STEP: fix augmentation
                    eystar(n, 0) = 1 ;
                if (y(n, 0) == 0) { // y = 0
                    // NEXT STEP: fix augmentation
                    eystar(n, 0) = -1 ;
            // maximization step
            beta = (X.t() * X).i() * X.t() * eystar ;
        // returns
        List ret ;
        ret["N"] = N ;
        ret["K"] = K ;
        ret["beta"] = beta ;
        ret["eystar"] = eystar ;
        return(ret) ;

    This code, like that in Attempt 1, is syntactically fine. But, as we know, the update step is very wrong. However, we can see that the updates are happening as we’d expect and we see non-zero returns for the beta element and the eystar element.

    fit2 <- em2(y = mY,
                X = mX,
                maxit = 20
    [1,] -0.816273
    [2,]  0.046065
    [3,]  0.059481
    [4,]  0.009085
    [1,]    1
    [2,]   -1
    [3,]   -1
    [4,]    1
    [5,]    1
    [6,]    1

    Attempt 3: EM with Correct Augmentation

    With the final logical structure of the code built out, we will now correct the data augmentation. Specifically, we replace the assignment of 1 and -1 with the expectation of the unobservable values . Rather than muddy our EM function (em3()) with further arithmetic, we sample call the C++ level functions f() and g() which were included prior to our definition of em3().

    But, since these are just utility functions needed internally by em3(), they are not tagged to be exported (via // [[Rcpp::export()]]) to the R level.

    As it stands, this is a correct implementation (although there is room for improvement).

    # include <RcppArmadillo.h>
    // [[Rcpp::depends(RcppArmadillo)]]
    using namespace Rcpp ;
    double f (double mu) {
        double val = ((R::dnorm(-mu, 0, 1, false)) /
                      (1 - R::pnorm(-mu, 0, 1, true, false))
                      ) ;
        return(val) ;
    double g (double mu) {
        double val = ((R::dnorm(-mu, 0, 1, false)) /
                      (R::pnorm(-mu, 0, 1, true, false))
                      ) ;
        return(val) ;
    // [[Rcpp::export()]]
    List em3 (const arma::mat y,
              const arma::mat X,
              const int maxit = 10
              ) {
        // inputs
        const int N = y.n_rows ;
        const int K = X.n_cols ;
        // containers
        arma::mat beta(K, 1) ;
        beta.fill(0.0) ; // initialize betas to 0
        arma::mat eystar(N, 1) ;
        eystar.fill(0) ;
        // algorithm
        for (int it = 0 ; it < maxit ; it++) {
            arma::mat mu = X * beta ;
            // augmentation step
            // NEXT STEP: parallelize augmentation step
            for (int n = 0 ; n < N ; n++) {
                if (y(n, 0) == 1) { // y = 1
                    eystar(n, 0) = mu(n, 0) + f(mu(n, 0)) ;
                if (y(n, 0) == 0) { // y = 0
                    eystar(n, 0) = mu(n, 0) - g(mu(n, 0)) ;
            // maximization step
            beta = (X.t() * X).i() * X.t() * eystar ;
        // returns
        List ret ;
        ret["N"] = N ;
        ret["K"] = K ;
        ret["beta"] = beta ;
        ret["eystar"] = eystar ;
        return(ret) ;
    fit3 <- em3(y = mY,
                X = mX,
                maxit = 100
    [1,]  1.3910
    [2,] -0.6599
    [3,] -0.7743
    [4,]  0.8563
    [5,]  0.9160
    [6,]  1.2677

    Second, notice that this output is identical to the parameter estimates (the object fit0) from our R level call to the glm() function.

    [1,] -1.68241
    [2,]  0.09936
    [3,]  0.10667
    [4,]  0.01692
    Call:  glm(formula = vote ~ income + educate + age, family = binomial(link = "probit"), 
        data = turnout)
    (Intercept)       income      educate          age  
        -1.6824       0.0994       0.1067       0.0169  
    Degrees of Freedom: 1999 Total (i.e. Null);  1996 Residual
    Null Deviance:	    2270 
    Residual Deviance: 2030 	AIC: 2040

    Attempt 4: EM with Correct Augmentation in Parallel

    With a functional implementation complete as em3(), we know turn to the second order concern: performance. The time required to evaluate our function can be reduced from the perspective of a user sitting at a computer with idle cores.

    Although the small size of these data don’t necessitate parallelization, the E step is a natural candidate for being parallelized. Here, the parallelization relies on OpenMP. See here for other examples of combining Rcpp and OpenMP or here for a different approach.

    Sys.setenv("PKG_CXXFLAGS" = "-fopenmp")
    Sys.setenv("PKG_LIBS" = "-fopenmp")

    Aside from some additional compiler flags, the changes to our new implementation in em4() are minimal. They are:

    • include the additional header
    • mark the for loop for parallelization with a #pragma
    # include <RcppArmadillo.h>
    # include <omp.h>
    // [[Rcpp::depends(RcppArmadillo)]]
    using namespace Rcpp ;
    double f (double mu) {
        double val = ((R::dnorm(-mu, 0, 1, false)) /
                      (1 - R::pnorm(-mu, 0, 1, true, false))
                      ) ;
        return(val) ;
    double g (double mu) {
        double val = ((R::dnorm(-mu, 0, 1, false)) /
                      (R::pnorm(-mu, 0, 1, true, false))
                      ) ;
        return(val) ;
    // [[Rcpp::export()]]
    List em4 (const arma::mat y,
              const arma::mat X,
              const int maxit = 10,
              const int nthr = 1
              ) {
        // inputs
        const int N = y.n_rows ;
        const int K = X.n_cols ;
        omp_set_num_threads(nthr) ;
        // containers
        arma::mat beta(K, 1) ;
        beta.fill(0.0) ; // initialize betas to 0
        arma::mat eystar(N, 1) ;
        eystar.fill(0) ;
        // algorithm
        for (int it = 0 ; it < maxit ; it++) {
            arma::mat mu = X * beta ;
            // augmentation step
    #pragma omp parallel for
            for (int n = 0 ; n < N ; n++) {
                if (y(n, 0) == 1) { // y = 1
                    eystar(n, 0) = mu(n, 0) + f(mu(n, 0)) ;
                if (y(n, 0) == 0) { // y = 0
                    eystar(n, 0) = mu(n, 0) - g(mu(n, 0)) ;
            // maximization step
            beta = (X.t() * X).i() * X.t() * eystar ;
        // returns
        List ret ;
        ret["N"] = N ;
        ret["K"] = K ;
        ret["beta"] = beta ;
        ret["eystar"] = eystar ;
        return(ret) ;

    This change should not (and does not) result in any change to the calculations being done. However, if our algorithm involved random number generation, great care would need to be taken to ensure our results were reproducible.

    fit4 <- em4(y = mY,
                X = mX,
                maxit = 100
    identical(fit4$beta, fit3$beta)
    [1] TRUE

    Finally, we can confirm that our parallelization was “successful”. Again, because there is really no need to parallelize this code, performance gains are modest. But, that it indeed runs faster is clear.

    microbenchmark(seq = (em3(y = mY,
                              X = mX,
                              maxit = 100
                   par = (em4(y = mY,
                              X = mX,
                              maxit = 100,
                              nthr = 4
                   times = 20
    Unit: milliseconds
     expr   min    lq  mean median    uq   max neval cld
      seq 32.94 33.01 33.04  33.03 33.07 33.25    20   b
      par 11.16 11.20 11.35  11.26 11.29 13.16    20  a 


    The purpose of this lengthy gallery post is neither to demonstrate new functionality nor the computational feasibility of cutting-edge algorithms. Rather, it is to explicitly walk through a development process similar that which new users can benefit from using while using a very common statistical problem, Probit regression.

    October 01, 2014 12:00 AM

    September 12, 2014

    R you ready?

    Using colorized PNG pictograms in R base plots

    Today I stumbled across a figure in an explanation on multiple factor analysis which contained pictograms.



    Figure 1 from Abdi & Valentin (2007), p. 8.

    I wanted to reproduce a similar figure in R using pictograms and additionally color them e.g. by group membership . I have almost no knowledge about image processing, so I tried out several methods of how to achieve what I want. The first thing I did was read in an PNG file and look at the data structure. The package png allows to read in PNG files. Note that all of the below may not work on Windows machines, as it does not support semi-transparency (see ?readPNG).

    img <- readPNG(system.file("img", "Rlogo.png", package="png"))
    ## [1] "array"
    ## [1]  76 100   4

    The object is a numerical array with four layers (red, green, blue, alpha; short RGBA). Let’s have a look at the first layer (red) and replace all non-zero entries by a one and the zeros by a dot. This will show us the pattern of non-zero values and we already see the contours.

    l4 <- img[,,1]
    l4[l4 > 0] <- 1
    l4[l4 == 0] <- "."
    d <- apply(l4, 1, function(x) {
     cat(paste0(x, collapse=""), "\n") 

    To display the image in R one way is to raster the image (i.e. the RGBA layers are collapsed into a layer of single HEX value) and print it using rasterImage.


    rimg <- as.raster(img) # raster multilayer object
    r <- nrow(rimg) / ncol(rimg) # image ratio
    plot(c(0,1), c(0,r), type = "n", xlab = "", ylab = "", asp=1)
    rasterImage(rimg, 0, 0, 1, r) 


    Let’s have a look at a small part the rastered image object. It is a matrix of HEX values.

    rimg[40:50, 1:6]
    ## [1,] "#C4C5C202" "#858981E8" "#838881FF" "#888D86FF" "#8D918AFF" "#8F938CFF"
    ## [2,] "#00000000" "#848881A0" "#80847CFF" "#858A83FF" "#898E87FF" "#8D918BFF"
    ## [3,] "#00000000" "#8B8E884C" "#7D817AFF" "#82867EFF" "#868B84FF" "#8A8E88FF"
    ## [4,] "#00000000" "#9FA29D04" "#7E827BE6" "#7E817AFF" "#838780FF" "#878C85FF"
    ## [5,] "#00000000" "#00000000" "#81857D7C" "#797E75FF" "#7F827BFF" "#838781FF"
    ## [6,] "#00000000" "#00000000" "#898C8510" "#787D75EE" "#797E76FF" "#7F837BFF"
    ## [7,] "#00000000" "#00000000" "#00000000" "#7F837C7B" "#747971FF" "#797E76FF"
    ## [8,] "#00000000" "#00000000" "#00000000" "#999C9608" "#767C73DB" "#747971FF"
    ## [9,] "#00000000" "#00000000" "#00000000" "#00000000" "#80847D40" "#71766EFD"
    ## [10,] "#00000000" "#00000000" "#00000000" "#00000000" "#00000000" "#787D7589"
    ## [11,] "#00000000" "#00000000" "#00000000" "#00000000" "#00000000" "#999C9604"

    And print this small part.

    plot(c(0,1), c(0,.6), type = "n", xlab = "", ylab = "", asp=1)
    rasterImage(rimg[40:50, 1:6], 0, 0, 1, .6) 


    Now we have an idea of how the image object and the rastered object look like from the inside. Let’s start to modify the images to suit our needs.

    In order to change the color of the pictograms, my first idea was to convert the graphics to greyscale and remap the values to a color ramp of may choice. To convert to greyscale there are tons of methods around (see e.g. here). I just pick one of them I found on SO by chance. With R=Red, G=Green and B=Blue we have

    brightness = sqrt(0.299 * R^2 + 0.587 * G^2 + 0.114 * B^2)

    This approach modifies the PNG files after they have been coerced into a raster object.

    # function to calculate brightness values
    brightness <- function(hex) {
      v <- col2rgb(hex)
      sqrt(0.299 * v[1]^2 + 0.587 * v[2]^2 + 0.114 * v[3]^2) /255
    # given a color ramp, map brightness to ramp also taking into account 
    # the alpha level. The defaul color ramp is grey
    img_to_colorramp <- function(img, ramp=grey) {
      cv <- as.vector(img)
      b <- sapply(cv, brightness)
      g <- ramp(b)
      a <- substr(cv, 8,9)     # get alpha values
      ga <- paste0(g, a)       # add alpha values to new colors
      img.grey <- matrix(ga, nrow(img), ncol(img), byrow=TRUE)  
    # read png and modify
    img <- readPNG(system.file("img", "Rlogo.png", package="png"))
    img <- as.raster(img)           # raster multilayer object
    r <- nrow(img) / ncol(img)      # image ratio
    s <- 3.5                        # size
    plot(c(0,10), c(0,3.5), type = "n", xlab = "", ylab = "", asp=1)
    rasterImage(img, 0, 0, 0+s/r, 0+s)  # original
    img2 <- img_to_colorramp(img)       # modify using grey scale
    rasterImage(img2, 5, 0, 5+s/r, 0+s)


    Great, it works! Now Let’s go and try out some other color palettes using colorRamp to create a color ramp.

    plot(c(0,10),c(0,8.5), type = "n", xlab = "", ylab = "", asp=1)
    img1 <- img_to_colorramp(img)
    rasterImage(img1, 0, 5, 0+s/r, 5+s)
    reds <- function(x) 
      rgb(colorRamp(c("darkred", "white"))(x), maxColorValue = 255)
    img2 <- img_to_colorramp(img, reds)
    rasterImage(img2, 5, 5, 5+s/r, 5+s)
    greens <- function(x) 
      rgb(colorRamp(c("darkgreen", "white"))(x), maxColorValue = 255)
    img3 <- img_to_colorramp(img, greens)
    rasterImage(img3, 0, 0, 0+s/r, 0+s)
    single_color <- function(...) "#0000BB"
    img4 <- img_to_colorramp(img, single_color)
    rasterImage(img4, 5, 0, 5+s/r, 0+s)


    Okay, that basically does the job. Now we will apply it to the wine pictograms.
    Let’s use this wine glass from Wikimedia Commons. It’s quite big so I uploaded a reduced size version to imgur . We will use it for our purposes.

    # load file from web
    f <- tempfile()
    download.file("", f)
    img <- readPNG(f)
    img <- as.raster(img)
    r <- nrow(img) / ncol(img)
    s <- 1
    # let's create a function that returns a ramp function to save typing
    ramp <- function(colors) 
      function(x) rgb(colorRamp(colors)(x), maxColorValue = 255)
    # create dataframe with coordinates and colors
    x <- data.frame(x=rnorm(16, c(2,2,4,4)), 
                    y=rnorm(16, c(1,3)), 
                    colors=c("black", "darkred", "garkgreen", "darkblue"))
    plot(c(1,6), c(0,5), type="n", xlab="", ylab="", asp=1)
    for (i in 1L:nrow(x)) {
      colorramp <- ramp(c(x[i,3], "white"))
      img2 <- img_to_colorramp(img, colorramp)
      rasterImage(img2, x[i,1], x[i,2], x[i,1]+s/r, x[i,2]+s)


    Another approach would be to modifying the RGB layers before rastering to HEX values.

    img <- readPNG(system.file("img", "Rlogo.png", package="png"))
    img2 <- img
    img[,,1] <- 0    # remove Red component
    img[,,2] <- 0    # remove Green component
    img[,,3] <- 1    # Set Blue to max
    img <- as.raster(img)
    r <- nrow(img) / ncol(img)  # size ratio
    s <- 3.5   # size
    plot(c(0,10), c(0,3.5), type = "n", xlab = "", ylab = "", asp=1)
    rasterImage(img, 0, 0, 0+s/r, 0+s)
    img2[,,1] <- 1   # Red to max
    img2[,,2] <- 0
    img2[,,3] <- 0
    rasterImage(as.raster(img2), 5, 0, 5+s/r, 0+s)


    To just colorize the image, we could weight each layer.

    # wrap weighting into function
    weight_layers <- function(img, w) {
      for (i in seq_along(w))
        img[,,i] <- img[,,i] * w[i]
    plot(c(0,10), c(0,3.5), type = "n", xlab = "", ylab = "", asp=1)
    img <- readPNG(system.file("img", "Rlogo.png", package="png"))
    img2 <- weight_layers(img, c(.2, 1,.2))
    rasterImage(img2, 0, 0, 0+s/r, 0+s)
    img3 <- weight_layers(img, c(1,0,0))
    rasterImage(img3, 5, 0, 5+s/r, 0+s)


    After playing around and hard-coding the modifications I started to search and found the EBimage package which has a lot of features for image processing that make ones life (in this case only a bit) easier.

    f <- system.file("img", "Rlogo.png", package="png")
    img <- readImage(f) 
    img2 <- img
    img[,,2] = 0      # zero out green layer
    img[,,3] = 0      # zero out blue layer
    img <- as.raster(img)
    img2[,,1] = 0
    img2[,,3] = 0
    img2 <- as.raster(img2)
    r <- nrow(img) / ncol(img)
    s <- 3.5
    plot(c(0,10), c(0,3.5), type = "n", xlab = "", ylab = "", asp=1)
    rasterImage(img, 0, 0, 0+s/r, 0+s)
    rasterImage(img2, 5, 0, 5+s/r, 0+s)


    EBImage is a good choice and fairly easy to handle. Now let’s again print the pictograms.

    f <- tempfile(fileext=".png")
    download.file("", f)
    img <- readImage(f)
    # will replace whole image layers by one value
    # only makes sense if there is a alpha layer that 
    # gives the contours
    mod_color <- function(img, col) {
      v <- col2rgb(col) / 255
      img = channel(img, 'rgb')
      img[,,1] = v[1]   # Red
      img[,,2] = v[2]   # Green
      img[,,3] = v[3]   # Blue
    r <- nrow(img) / ncol(img)  # get image ratio
    s <- 1                      # size
    # create random data
    x <- data.frame(x=rnorm(16, c(2,2,4,4)), 
                    y=rnorm(16, c(1,3)), 
    # plot pictograms 
    plot(c(1,6), c(0,5), type="n", xlab="", ylab="", asp=1)
    for (i in 1L:nrow(x)) {
      img2 <- mod_color(img, x[i, 3])
      rasterImage(img2, x[i,1], x[i,2], x[i,1]+s*r, x[i,2]+s)


    Note, that above I did not bother to center each pictogram to position it correctly. This still needs to be done. Anyway, that’s it! Mission completed.


    Abdi, H., & Valentin, D. (2007). Multiple factor analysis (MFA). In N. Salkind (Ed.), Encyclopedia of Measurement and Statistics (pp. 1–14). Thousand Oaks, CA: Sage Publications. Retrieved from

    by markheckmann at September 12, 2014 09:19 AM

    February 17, 2014

    Seth Falcon

    Have Your SHA and Bcrypt Too


    I've been putting off sharing this idea because I've heard the rumors about what happens to folks who aren't security experts when they post about security on the internet. If this blog is replaced with cat photos and rainbows, you'll know what happened.

    The Sad Truth

    It's 2014 and chances are you have accounts on websites that are not properly handling user passwords. I did no research to produce the following list of ways passwords are mishandled in decreasing order of frequency:

    1. Site uses a fast hashing algorithm, typically SHA1(salt + plain-password).
    2. Site doesn't salt password hashes
    3. Site stores raw passwords

    We know that sites should be generating secure random salts and using an established slow hashing algorithm (bcrypt, scrypt, or PBKDF2). Why are sites not doing this?

    While security issues deserve a top spot on any site's priority list, new features often trump addressing legacy security concerns. The immediacy of the risk is hard to quantify and it's easy to fall prey to a "nothing bad has happened yet, why should we change now" attitude. It's easy for other bugs, features, or performance issues to win out when measured by immediate impact. Fixing security or other "legacy" issues is the Right Thing To Do and often you will see no measurable benefit from the investment. It's like having insurance. You don't need it until you do.

    Specific to the improper storage of user password data is the issue of the impact to a site imposed by upgrading. There are two common approaches to upgrading password storage. You can switch cold turkey to the improved algorithms and force password resets on all of your users. Alternatively, you can migrate incrementally such that new users and any user who changes their password gets the increased security.

    The cold turkey approach is not a great user experience and sites might choose to delay an upgrade to avoid admitting to a weak security implementation and disrupting their site by forcing password resets.

    The incremental approach is more appealing, but the security benefit is drastically diminished for any site with a substantial set of existing users.

    Given the above migration choices, perhaps it's (slightly) less surprising that businesses choose to prioritize other work ahead of fixing poorly stored user password data.

    The Idea

    What if you could upgrade a site so that both new and existing users immediately benefited from the increased security, but without the disruption of password resets? It turns out that you can and it isn't very hard.

    Consider a user table with columns:


    Where the hashed_pass column is computed using a weak fast algorithm, for example SHA1(salt + plain_pass).

    The core of the idea is to apply a proper algorithm on top of the data we already have. I'll use bcrypt to make the discussion concrete. Add columns to the user table as follows:


    Process the existing user table by computing bcrypt(salt2 + hashed_pass) and storing the result in the hashed_pass column (overwriting the less secure value); save the new salt value to salt2 and set hash_type to bycrpt+sha1.

    To verify a user where hash_type is bcrypt+sha1, compute bcrypt(salt2 + SHA1(salt + plain_pass)) and compare to the hashed_pass value. Note that bcrypt implementations encode the salt as a prefix of the hashed value so you could avoid the salt2 column, but it makes the idea easier to explain to have it there.

    You can take this approach further and have any user that logs in (as well as new users) upgrade to a "clean" bcrypt only algorithm since you can now support different verification algorithms using hash_type. With the proper application code changes in place, the upgrade can be done live.

    This scheme will also work for sites storing non-salted password hashes as well as those storing plain text passwords (THE HORROR).

    Less Sadness, Maybe

    Perhaps this approach makes implementing a password storage security upgrade more palatable and more likely to be prioritized. And if there's a horrible flaw in this approach, maybe you'll let me know without turning this blog into a tangle of cat photos and rainbows.

    February 17, 2014 07:08 PM

    December 26, 2013

    Seth Falcon

    A Rebar Plugin for Locking Deps: Reproducible Erlang Project Builds For Fun and Profit

    What's this lock-deps of which you speak?

    If you use rebar to generate an OTP release project and want to have reproducible builds, you need the rebar_lock_deps_plugin plugin. The plugin provides a lock-deps command that will generate a rebar.config.lock file containing the complete flattened set of project dependencies each pegged to a git SHA. The lock file acts similarly to Bundler's Gemfile.lock file and allows for reproducible builds (*).

    Without lock-deps you might rely on the discipline of using a tag for all of your application's deps. This is insufficient if any dep depends on something not specified as a tag. It can also be a problem if a third party dep doesn't provide a tag. Generating a rebar.config.lock file solves these issues. Moreover, using lock-deps can simplify the work of putting together a release consisting of many of your own repos. If you treat the master branch as shippable, then rather than tagging each subproject and updating rebar.config throughout your project's dependency chain, you can run get-deps (without the lock file), compile, and re-lock at the latest versions throughout your project repositories.

    The reproducibility of builds when using lock-deps depends on the SHAs captured in rebar.config.lock. The plugin works by scanning the cloned repos in your project's deps directory and extracting the current commit SHA. This works great until a repository's history is rewritten with a force push. If you really want reproducible builds, you need to not nuke your SHAs and you'll need to fork all third party repos to ensure that someone else doesn't screw you over in this fashion either. If you make a habit of only depending on third party repos using a tag, assume that upstream maintainers are not completely bat shit crazy, and don't force push your master branch, then you'll probably be fine.

    Getting Started

    Install the plugin in your project by adding the following to your rebar.config file:

    %% Plugin dependency
    {deps, [
        {rebar_lock_deps_plugin, ".*",
         {git, "git://", {branch, "master"}}}
    %% Plugin usage
    {plugins, [rebar_lock_deps_plugin]}.

    To test it out do:

    rebar get-deps
    # the plugin has to be compiled so you can use it
    rebar compile
    rebar lock-deps

    If you'd like to take a look at a project that uses the plugin, take a look at CHEF's erchef project.

    Bonus features

    If you are building an OTP release project using rebar generate then you can use rebar_lock_deps_plugin to enhance your build experience in three easy steps.

    1. Use rebar bump-rel-version version=$BUMP to automate the process of editing rel/reltool.config to update the release version. The argument $BUMP can be major, minor, or patch (default) to increment the specified part of a semver X.Y.Z version. If $BUMP is any other value, it is used as the new version verbatim. Note that this function rewrites rel/reltool.config using ~p. I check-in the reformatted version and maintain the formatting when editing. This way, the general case of a version bump via bump-rel-version results in a minimal diff.

    2. Autogenerate a change summary commit message for all project deps. Assuming you've generated a new lock file and bumped the release version, use rebar commit-release to commit the changes to rebar.config.lock and rel/reltool.config with a commit message that summarizes the changes made to each dependency between the previously locked version and the newly locked version. You can get a preview of the commit message via rebar log-changed-deps.

    3. Finally, create an annotated tag for your new release with rebar tag-release which will read the current version from rel/reltool.config and create an annotated tag named with the version.

    The dependencies, they are ordered

    Up to version 2.0.1 of rebar_lock_deps_plugin, the dependencies in the generated lock file were ordered alphabetically. This was a side-effect of using filelib:wildcard/1 to list the dependencies in the top-level deps directory. In most cases, the order of the full dependency set does not matter. However, if some of the code in your project uses parse transforms, then it will be important for the parse transform to be compiled and on the code path before attempting to compile code that uses the parse transform.

    This issue was recently discovered by a colleague who ran into build issues using the lock file for a project that had recently integrated lager for logging. He came up with the idea of maintaining the order of deps as they appear in the various rebar.config files along with a prototype patch proving out the idea. As of rebar_lock_deps_plugin 3.0.0, the lock-deps command will (mostly) maintain the relative order of dependencies as found in the rebar.config files.

    The "mostly" is that when a dep is shared across two subprojects, it will appear in the expected order for the first subproject (based on the ordering of the two subprojects). The deps for the second subproject will not be in strict rebar.config order, but the resulting order should address any compile-time dependencies and be relatively stable (only changing when project deps alter their deps with larger impact when shared deps are introduced or removed).

    Digression: fun with dependencies

    There are times, as a programmer, when a real-world problem looks like a text book exercise (or an interview whiteboard question). Just the other day at work we had to design some manhole covers, but I digress.

    Fixing the order of the dependencies in the generated lock file is (nearly) the same as finding an install order for a set of projects with inter-dependencies. I had some fun coding up the text book solution even though the approach doesn't handle the constraint of respecting the order provided by the rebar.config files. Onward with the digression.

    We have a set of "packages" where some packages depend on others and we want to determine an install order such that a package's dependencies are always installed before the package. The set of packages and the relation "depends on" form a directed acyclic graph or DAG. The topological sort of a DAG produces an install order for such a graph. The ordering is not unique. For example, with a single package C depending on A and B, valid install orders are [A, B, C] and [B, A, C].

    To setup the problem, we load all of the project dependency information into a proplist mapping each package to a list of its dependencies extracted from the package's rebar.config file.

    read_all_deps(Config, Dir) ->
        TopDeps = rebar_config:get(Config, deps, []),
        Acc = [{top, dep_names(TopDeps)}],
        DepDirs = filelib:wildcard(filename:join(Dir, "*")),
        Acc ++ [
         {filename:basename(D), dep_names(extract_deps(D))}
         || D <- DepDirs ].

    Erlang's standard library provides the digraph and digraph_utils modules for constructing and operating on directed graphs. The digraph_utils module includes a topsort/1 function which we can make use of for our "exercise". The docs say:

    Returns a topological ordering of the vertices of the digraph Digraph if such an ordering exists, false otherwise. For each vertex in the returned list, there are no out-neighbours that occur earlier in the list.

    To figure out which way to point the edges when building our graph, consider two packages A and B with A depending on B. We know we want to end up with an install order of [B, A]. Rereading the topsort/1 docs, we must want an edge B => A. With that, we can build our DAG and obtain an install order with the topological sort:

    load_digraph(Config, Dir) ->
        AllDeps = read_all_deps(Config, Dir),
        G = digraph:new(),
        Nodes = all_nodes(AllDeps),
        [ digraph:add_vertex(G, N) || N <- Nodes ],
        %% If A depends on B, then we add an edge A <= B
          [ digraph:add_edge(G, Dep, Item)
            || Dep <- DepList ]
          || {Item, DepList} <- AllDeps, Item =/= top ],
    %% extract a sorted unique list of all deps
    all_nodes(AllDeps) ->
        lists:usort(lists:foldl(fun({top, L}, Acc) ->
                                        L ++ Acc;
                                   ({K, L}, Acc) ->
                                        [K|L] ++ Acc
                                end, [], AllDeps)).

    The digraph module manages graphs using ETS giving it a convenient API, though one that feels un-erlang-y in its reliance on side-effects.

    The above gives an install order, but doesn't take into account the relative order of deps as specified in the rebar.config files. The solution implemented in the plugin is a bit less fancy, recursing over the deps and maintaining the desired ordering. The only tricky bit being that shared deps are ignored until the end and the entire linearized list is de-duped which required a . Here's the code:

    order_deps(AllDeps) ->
        Top = proplists:get_value(top, AllDeps),
        order_deps(lists:reverse(Top), AllDeps, []).
    order_deps([], _AllDeps, Acc) ->
    order_deps([Item|Rest], AllDeps, Acc) ->
        ItemDeps = proplists:get_value(Item, AllDeps),
        order_deps(lists:reverse(ItemDeps) ++ Rest, AllDeps, [Item | Acc]).
    de_dup(AccIn) ->
        WithIndex = lists:zip(AccIn, lists:seq(1, length(AccIn))),
        UWithIndex = lists:usort(fun({A, _}, {B, _}) ->
                                         A =< B
                                 end, WithIndex),
        Ans0 = lists:sort(fun({_, I1}, {_, I2}) ->
                                  I1 =< I2
                          end, UWithIndex),
        [ V || {V, _} <- Ans0 ].

    Conclusion and the end of this post

    The great thing about posting to your blog is, you don't have to have a proper conclusion if you don't want to.

    December 26, 2013 04:20 PM

    December 09, 2013

    Leandro Penz

    Probabilistic bug hunting

    Probabilistic bug hunting

    Have you ever run into a bug that, no matter how careful you are trying to reproduce it, it only happens sometimes? And then, you think you've got it, and finally solved it - and tested a couple of times without any manifestation. How do you know that you have tested enough? Are you sure you were not "lucky" in your tests?

    In this article we will see how to answer those questions and the math behind it without going into too much detail. This is a pragmatic guide.

    The Bug

    The following program is supposed to generate two random 8-bit integer and print them on stdout:

      #include <stdio.h>
      #include <fcntl.h>
      /* Returns -1 if error, other number if ok. */
      int get_random_chars(char *r1, char*r2)
      	int f = open("/dev/urandom", O_RDONLY);
      	if (f < 0)
      		return -1;
      	if (read(f, r1, sizeof(*r1)) < 0)
      		return -1;
      	if (read(f, r2, sizeof(*r2)) < 0)
      		return -1;
      	return *r1 & *r2;
      int main(void)
      	char r1;
      	char r2;
      	int ret;
      	ret = get_random_chars(&r1, &r2);
      	if (ret < 0)
      		fprintf(stderr, "error");
      		printf("%d %d\n", r1, r2);
      	return ret < 0;

    On my architecture (Linux on IA-32) it has a bug that makes it print "error" instead of the numbers sometimes.

    The Model

    Every time we run the program, the bug can either show up or not. It has a non-deterministic behaviour that requires statistical analysis.

    We will model a single program run as a Bernoulli trial, with success defined as "seeing the bug", as that is the event we are interested in. We have the following parameters when using this model:

    • \(n\): the number of tests made;
    • \(k\): the number of times the bug was observed in the \(n\) tests;
    • \(p\): the unknown (and, most of the time, unknowable) probability of seeing the bug.

    As a Bernoulli trial, the number of errors \(k\) of running the program \(n\) times follows a binomial distribution \(k \sim B(n,p)\). We will use this model to estimate \(p\) and to confirm the hypotheses that the bug no longer exists, after fixing the bug in whichever way we can.

    By using this model we are implicitly assuming that all our tests are performed independently and identically. In order words: if the bug happens more ofter in one environment, we either test always in that environment or never; if the bug gets more and more frequent the longer the computer is running, we reset the computer after each trial. If we don't do that, we are effectively estimating the value of \(p\) with trials from different experiments, while in truth each experiment has its own \(p\). We will find a single value anyway, but it has no meaning and can lead us to wrong conclusions.

    Physical analogy

    Another way of thinking about the model and the strategy is by creating a physical analogy with a box that has an unknown number of green and red balls:

    • Bernoulli trial: taking a single ball out of the box and looking at its color - if it is red, we have observed the bug, otherwise we haven't. We then put the ball back in the box.
    • \(n\): the total number of trials we have performed.
    • \(k\): the total number of red balls seen.
    • \(p\): the total number of red balls in the box divided by the total number of green balls in the box.

    Some things become clearer when we think about this analogy:

    • If we open the box and count the balls, we can know \(p\), in contrast with our original problem.
    • Without opening the box, we can estimate \(p\) by repeating the trial. As \(n\) increases, our estimate for \(p\) improves. Mathematically: \[p = \lim_{n\to\infty}\frac{k}{n}\]
    • Performing the trials in different conditions is like taking balls out of several different boxes. The results tell us nothing about any single box.

    Estimating \(p\)

    Before we try fixing anything, we have to know more about the bug, starting by the probability \(p\) of reproducing it. We can estimate this probability by dividing the number of times we see the bug \(k\) by the number of times we tested for it \(n\). Let's try that with our sample bug:

      $ ./hasbug
      67 -68
      $ ./hasbug
      79 -101
      $ ./hasbug

    We know from the source code that \(p=25%\), but let's pretend that we don't, as will be the case with practically every non-deterministic bug. We tested 3 times, so \(k=1, n=3 \Rightarrow p \sim 33%\), right? It would be better if we tested more, but how much more, and exactly what would be better?

    \(p\) precision

    Let's go back to our box analogy: imagine that there are 4 balls in the box, one red and three green. That means that \(p = 1/4\). What are the possible results when we test three times?

    Red balls Green balls \(p\) estimate
    0 3 0%
    1 2 33%
    2 1 66%
    3 0 100%

    The less we test, the smaller our precision is. Roughly, \(p\) precision will be at most \(1/n\) - in this case, 33%. That's the step of values we can find for \(p\), and the minimal value for it.

    Testing more improves the precision of our estimate.

    \(p\) likelihood

    Let's now approach the problem from another angle: if \(p = 1/4\), what are the odds of seeing one error in four tests? Let's name the 4 balls as 0-red, 1-green, 2-green and 3-green:

    The table above has all the possible results for getting 4 balls out of the box. That's \(4^4=256\) rows, generated by this python script. The same script counts the number of red balls in each row, and outputs the following table:

    k rows %
    0 81 31.64%
    1 108 42.19%
    2 54 21.09%
    3 12 4.69%
    4 1 0.39%

    That means that, for \(p=1/4\), we see 1 red ball and 3 green balls only 42% of the time when getting out 4 balls.

    What if \(p = 1/3\) - one red ball and two green balls? We would get the following table:

    k rows %
    0 16 19.75%
    1 32 39.51%
    2 24 29.63%
    3 8 9.88%
    4 1 1.23%

    What about \(p = 1/2\)?

    k rows %
    0 1 6.25%
    1 4 25.00%
    2 6 37.50%
    3 4 25.00%
    4 1 6.25%

    So, let's assume that you've seen the bug once in 4 trials. What is the value of \(p\)? You know that can happen 42% of the time if \(p=1/4\), but you also know it can happen 39% of the time if \(p=1/3\), and 25% of the time if \(p=1/2\). Which one is it?

    The graph bellow shows the discrete likelihood for all \(p\) percentual values for getting 1 red and 3 green balls:

    The fact is that, given the data, the estimate for \(p\) follows a beta distribution \(Beta(k+1, n-k+1) = Beta(2, 4)\) (1) The graph below shows the probability distribution density of \(p\):

    The R script used to generate the first plot is here, the one used for the second plot is here.

    Increasing \(n\), narrowing down the interval

    What happens when we test more? We obviously increase our precision, as it is at most \(1/n\), as we said before - there is no way to estimate that \(p=1/3\) when we only test twice. But there is also another effect: the distribution for \(p\) gets taller and narrower around the observed ratio \(k/n\):

    Investigation framework

    So, which value will we use for \(p\)?

    • The smaller the value of \(p\), the more we have to test to reach a given confidence in the bug solution.
    • We must, then, choose the probability of error that we want to tolerate, and take the smallest value of \(p\) that we can.

      A usual value for the probability of error is 5% (2.5% on each side).
    • That means that we take the value of \(p\) that leaves 2.5% of the area of the density curve out on the left side. Let's call this value \(p_{min}\).
    • That way, if the observed \(k/n\) remains somewhat constant, \(p_{min}\) will raise, converging to the "real" \(p\) value.
    • As \(p_{min}\) raises, the amount of testing we have to do after fixing the bug decreases.

    By using this framework we have direct, visual and tangible incentives to test more. We can objectively measure the potential contribution of each test.

    In order to calculate \(p_{min}\) with the mentioned properties, we have to solve the following equation:

    \[\sum_{k=0}^{k}{n\choose{k}}p_{min} ^k(1-p_{min})^{n-k}=\frac{\alpha}{2} \]

    \(alpha\) here is twice the error we want to tolerate: 5% for an error of 2.5%.

    That's not a trivial equation to solve for \(p_{min}\). Fortunately, that's the formula for the confidence interval of the binomial distribution, and there are a lot of sites that can calculate it:

    Is the bug fixed?

    So, you have tested a lot and calculated \(p_{min}\). The next step is fixing the bug.

    After fixing the bug, you will want to test again, in order to confirm that the bug is fixed. How much testing is enough testing?

    Let's say that \(t\) is the number of times we test the bug after it is fixed. Then, if our fix is not effective and the bug still presents itself with a probability greater than the \(p_{min}\) that we calculated, the probability of not seeing the bug after \(t\) tests is:

    \[\alpha = (1-p_{min})^t \]

    Here, \(\alpha\) is also the probability of making a type I error, while \(1 - \alpha\) is the statistical significance of our tests.

    We now have two options:

    • arbitrarily determining a standard statistical significance and testing enough times to assert it.
    • test as much as we can and report the achieved statistical significance.

    Both options are valid. The first one is not always feasible, as the cost of each trial can be high in time and/or other kind of resources.

    The standard statistical significance in the industry is 5%, we recommend either that or less.

    Formally, this is very similar to a statistical hypothesis testing.

    Back to the Bug

    Testing 20 times

    This file has the results found after running our program 5000 times. We must never throw out data, but let's pretend that we have tested our program only 20 times. The observed \(k/n\) ration and the calculated \(p_{min}\) evolved as shown in the following graph:

    After those 20 tests, our \(p_{min}\) is about 12%.

    Suppose that we fix the bug and test it again. The following graph shows the statistical significance corresponding to the number of tests we do:

    In words: we have to test 24 times after fixing the bug to reach 95% statistical significance, and 35 to reach 99%.

    Now, what happens if we test more before fixing the bug?

    Testing 5000 times

    Let's now use all the results and assume that we tested 5000 times before fixing the bug. The graph bellow shows \(k/n\) and \(p_{min}\):

    After those 5000 tests, our \(p_{min}\) is about 23% - much closer to the real \(p\).

    The following graph shows the statistical significance corresponding to the number of tests we do after fixing the bug:

    We can see in that graph that after about 11 tests we reach 95%, and after about 16 we get to 99%. As we have tested more before fixing the bug, we found a higher \(p_{min}\), and that allowed us to test less after fixing the bug.

    Optimal testing

    We have seen that we decrease \(t\) as we increase \(n\), as that can potentially increases our lower estimate for \(p\). Of course, that value can decrease as we test, but that means that we "got lucky" in the first trials and we are getting to know the bug better - the estimate is approaching the real value in a non-deterministic way, after all.

    But, how much should we test before fixing the bug? Which value is an ideal value for \(n\)?

    To define an optimal value for \(n\), we will minimize the sum \(n+t\). This objective gives us the benefit of minimizing the total amount of testing without compromising our guarantees. Minimizing the testing can be fundamental if each test costs significant time and/or resources.

    The graph bellow shows us the evolution of the value of \(t\) and \(t+n\) using the data we generated for our bug:

    We can see clearly that there are some low values of \(n\) and \(t\) that give us the guarantees we need. Those values are \(n = 15\) and \(t = 24\), which gives us \(t+n = 39\).

    While you can use this technique to minimize the total number of tests performed (even more so when testing is expensive), testing more is always a good thing, as it always improves our guarantee, be it in \(n\) by providing us with a better \(p\) or in \(t\) by increasing the statistical significance of the conclusion that the bug is fixed. So, before fixing the bug, test until you see the bug at least once, and then at least the amount specified by this technique - but also test more if you can, there is no upper bound, specially after fixing the bug. You can then report a higher confidence in the solution.


    When a programmer finds a bug that behaves in a non-deterministic way, he knows he should test enough to know more about the bug, and then even more after fixing it. In this article we have presented a framework that provides criteria to define numerically how much testing is "enough" and "even more." The same technique also provides a method to objectively measure the guarantee that the amount of testing performed provides, when it is not possible to test "enough."

    We have also provided a real example (even though the bug itself is artificial) where the framework is applied.

    As usual, the source code of this page (R scripts, etc) can be found and downloaded in

    December 09, 2013 12:00 AM

    December 01, 2013

    Gregor Gorjanc

    Read line by line of a file in R

    Are you using R for data manipulation for later use with other programs, i.e., a workflow something like this:
    1. read data sets from a disk,
    2. modify the data, and
    3. write it back to a disk.
    All fine, but of data set is really big, then you will soon stumble on memory issues. If data processing is simple and you can read only chunks, say only line by line, then the following might be useful:

    ## File
    file <- "myfile.txt"
    ## Create connection
    con <- file(description=file, open="r")
    ## Hopefully you know the number of lines from some other source or
    com <- paste("wc -l ", file, " | awk '{ print $1 }'", sep="")
    n <- system(command=com, intern=TRUE)
    ## Loop over a file connection
    for(i in 1:n) {
    tmp <- scan(file=con, nlines=1, quiet=TRUE)
    ## do something on a line of data
    Created by Pretty R at

    by Gregor Gorjanc ( at December 01, 2013 10:55 PM

    November 20, 2013

    R you ready?

    Sending data from client to server and back using shiny

    post-logoAfter some time of using shiny I got to the point where I needed to send some arbitrary data from the client to the server, process it with R and return some other data to the client. As a client/server programming newbie this was a challenge for me as I did not want to dive too deep into the world of web programming. I wanted to get the job done using shiny and preferably as little JS/PHP etc. scripting as possible.

    It turns out that the task is quite simple as shiny comes with some currently undocumented functions under the hood that will make this task quite easy. You can find some more information on these functions here.

    As mentioned above, I am a web programming newbie. So this post may be helpful for people with little web programming experience (just a few lines of JavaScript are needed) and who want to see a simple way of how to get the job done.

    Sending data from client to server

    Sending the data from the client to the server is accomplished by the JS function Shiny.onInputChange. This function takes a JS object and sends it to the shiny server. On the server side the object will be accessible as an R object under the name which is given as the second argument to the Shiny.onInputChange function. Let’s start by sending a random number to the server. The name of the object on the server side will be mydata.

    Let’s create the shiny user interface file (ui.R). I will add a colored div, another element for verbatim text output called results and add the JavaScript code to send the data. The workhorse line is Shiny.onInputChange(“mydata”, number);. The JS code is included by passing it as a string to the tags$script function.

    # ui.R
    shinyUI( bootstrapPage(
      # a div named mydiv
      tags$div(id="mydiv", style="width: 50px; height :50px;
               left: 100px; top: 100px;
               background-color: gray; position: absolute"),
      # a shiny element to display unformatted text
      # javascript code to send data to shiny server
        document.getElementById("mydiv").onclick = function() {
          var number = Math.random();
          Shiny.onInputChange("mydata", number);

    Now, on the server side, we can simply access the data that was sent by addressing it the usual way via the input object (i.e. input$mydata. The code below will make the verbatimTextOutput element results show the value that was initially passed to the server.

    # server.R
    shinyServer(function(input, output, session) {
        output$results = renderPrint({

    You can copy the above files from here or run the code directly. When you run the code you will find that the random value in the upper box is updated if you click on the div.


    What we have achieved so far is to pass some data to the server, access it and pass it back to a display on the client side. For the last part however, we have used a standard shiny element to send back the data to the client.

    Sending data from server to client

    Now let’s add a component to send custom data from the server back to the client. This task has two parts. On the client side we need to define a handler function. This is a function that will receive the data from the server and perform some task with it. In other words, the function will handle the received data. To register a handler the function Shiny.addCustomMessageHandler is used. I will name our handler function myCallbackHandler. Our handler function will use the received data and execute some JS code. In our case it will change the color of our div called mydiv according to the color value that is passed from the server to the handler. Let’s add the JS code below to the ui.R file.

    # ui.R
    # handler to receive data from server
            function(color) {
              document.getElementById("mydiv").style.backgroundColor = color;

    Let’s move to the server side. I want the server to send the data to the handler function whenever the div is clicked, i.e. when the value of input$mydata changes. The sending of the data to the client is accomplished by an R function called sendCustomMessage which can be found in the session object. The function is passed the name of the client side handler function and the R object we want to pass to the function. Here, I create a random hex color value string that gets sent to a client handler function myCallbackHandler. The line sending the data to the client is contained in an observer. The observer includes the reactive object input$mydata, so the server will send someting to the client side handler function whenever the values of input$mydata changes. And it changes each time we click on the div. Let’s add the code below to the server.R file.

    # server.R
    # observes if value of mydata sent from the client changes.  if yes
    # generate a new random color string and send it back to the client
    # handler function called 'myCallbackHandler'
        color = rgb(runif(1), runif(1), runif(1))
        session$sendCustomMessage(type = "myCallbackHandler", color)

    You can copy the above files from here or run the code directly. When you run the code you will see that the div changes color when you click on it.


    That’s it. We have passed custom data from the client to the server and back. The following graphics sums up the functions that were used.


    Passing more complex objects

    The two functions also do a good job passing more complex JS or R objects. If you modify your code to send a JS object to shiny, it will be converted into an R list object on the server side. Let’s replace the JS object we send to the server (in ui.R) with following lines. On the server side, we will get a list.

    document.getElementById("mydiv").onclick = function() {
      var obj = {one: [1,2,3,4],
                 two: ["a", "b", "c"]};
      Shiny.onInputChange("mydata", obj);

    Note that now however the shiny server will only execute the function once (on loading), not each time the click event is fired. The reason is, that now the input data is static, i.e. the JS object we send via onInputChange does not change. To reduce workload on the server side, the code in the observer will only be executed if the reactive value under observation (i.e. the value of input$mydata) changes. As this is not the case anymore as the value we pass is static, the observer that sends back the color information to the client to change the color of the div is not executed a second time.

    The conversion also works nicely the other way round. We can pass an R list object to the sendCustomMessage function and on the client side it will appear as a JS object. So we are free to pass almost any type of data we need to.

    Putting the JS code in a separate file

    To keep things simple I included the JS code directly into the ui.R file using tags$script. This does not look very nice and you may want to put the JS code in a separate file instead. For this purpose I will create a JS file called mycode.js and include all the above JS code in it. Additionally, this file has another modification: All the code is wrapped into some JS/jQuery code ($(document).ready(function() { })that will make sure the JS code is run after the DOM (that is all the HTML elements) is loaded. Before, I simply placed the JS code below the HTML elements to make sure they are loaded, but I guess this is no good practice.

    // mycode.js
    $(document).ready(function() {
      document.getElementById("mydiv").onclick = function() {
        var number = Math.random();
        Shiny.onInputChange("mydata", number);
        function(color) {
          document.getElementById("mydiv").style.backgroundColor = color;

    To include the JS file shiny offers the includeScript function to include JS files. The server.R file has not changed, the ui.R file now looks like this.

    # server.R
    shinyUI( bootstrapPage(
      # include the js code
      # a div named mydiv
               style="width: 50px; height :50px; left: 100px; top: 100px;
               background-color: gray; position: absolute"),
      # an element for unformatted text

    You can copy the above files from here or run the gist directly from within R.


    The above examples are purely artifical as it will not make much sense to let the server generate a random color value and send it back to the client. JS might just do all this on the client side without any need for client/server communiation at all. The examples are just for demonstration purposes to outline the mechanisms you may use for sending custom data to the server or client using the functions supplied by the marvellous shiny package. Winston Chang (one of the RStudio and shiny guys) has some more examples in his testapp repo. Have a look at the message-handler-inline and the message-handler-jsfile folders.


    by markheckmann at November 20, 2013 06:26 PM

    August 13, 2013

    Gregor Gorjanc

    Setup up the inverse of additive relationship matrix in R

    Additive genetic covariance between individuals is one of the key concepts in (quantitative) genetics. When doing the prediction of additive genetic values for pedigree members, we need the inverse of the so called numerator relationship matrix (NRM) or simply A. Matrix A has off-diagonal entries equal to numerator of Wright's relationship coefficient and diagonal elements equal to 1 + inbreeding coefficient. I have blogged before about setting up such inverse in R using routine from the ASReml-R program or importing the inverse from the CFC program. However, this is not the only way to "skin this cat" in R. I am aware of the following attempts to provide this feature in R for various things (the list is probably incomplete and I would grateful if you point me to other implementations):
    • pedigree R package has function makeA() and makeAinv() with obvious meanings; there is also calcG() if you have a lot of marker data instead of pedigree information; there are also some other very handy functions calcInbreeding(), orderPed(), trimPed(), etc.
    • pedigreemm R package does not have direct implementation to get A inverse, but has all the needed ingredients, which makes the package even more interesting
    • MCMCglmm R package has function inverseA() which works with pedigree or phlyo objects; there are also handy functions such as prunePed(), rbv()sm2asreml(), etc.
    • kinship and kinship2 R packages have function kinship() to setup kinship matrix, which is equal to the half of A; there are also nice functions for plotting pedigrees etc. (see also here)
    • see also a series of R scripts for relationship matrices 
    As I described before, the interesting thing is that setting up inverse of A is easier and cheaper than setting up A and inverting it. This is very important for large applications. This is an old result using the following matrix theory. We can decompose symmetric positive definite matrix as A = LU = LL' (Cholesky decomposition) or as A = LDU = LDL' (Generalized Cholesky decomposition), where L (U) is lower (upper) triangular, and D is diagonal matrix. Note that L and U in previous two equations are not the same thing (L from Cholesky is not equal to L from Generalized Cholesky decomposition)! Sorry for sloppy notation. In order to confuse you even more note that Henderson usually wrote A = TDT'. We can even do A = LSSU, where S diagonal is equal to the square root of D diagonal. This can get us back to A = LU = LL' as LSSU = LSSL' = LSS'L' = LS(LS)' = L'L (be ware of sloppy notation)! The inverse rule says that inv(A) = inv(LDU) = inv(U) inv(D) inv(L) = inv(L)' inv(D) inv(L) = inv(L)' inv(S) inv(S) inv(L). I thank to Martin Maechler for pointing out to the last (obviously) bit to me. In Henderson's notation this would be inv(A) = inv(T)' inv(D) inv(T) = inv(T)' inv(S) inv(S) inv(T) Uf ... The important bit is that with NRM (aka A) inv(L) has nice simple structure - it shows the directed graph of additive genetic values in pedigree, while inv(D) tells us about the precision (inverse variance) of additive genetic values given the additive genetic values of parents and therefore depends on knowledge of parents and their inbreeding (the more they are inbred less variation can we expect in their progeny). Both inv(L) and inv(D) are easier to setup.

    Packages MCMCglmm and pedigree give us inv(A) directly (we can also get inv(D) in MCMCglmm), but pedigreemm enables us to play around with the above matrix algebra and graph theory. First we need a small example pedigree. Bellow is an example with 10 members and there is also some inbreeding and some individuals have both, one, or no parents known. It is hard to see inbreeding directly from the table, but we will improve that later (see also here).

    ped <- data.frame( id=c(  1,   2,   3,   4,   5,   6,   7,   8,   9,  10),
    fid=c( NA, NA, 2, 2, 4, 2, 5, 5, NA, 8),
    mid=c( NA, NA, 1, NA, 3, 3, 6, 6, NA, 9))

    Now we will create an object of a pedigree class and show the A = U'U stuff:

    ## install.packages(pkgs="pedigreemm")
    ped2 <- with(ped, pedigree(sire=fid, dam=mid, label=id))
    U <-
    A &lt
    ;- crossprod(U)
    round(U, digits=2)
    ## 10 x 10 sparse Matrix of class "dtCMatrix"
    ## [1,] 1 . 0.50 . 0.25 0.25 0.25 0.25 . 0.12
    ## [2,] . 1 0.50 0.50 0.50 0.75 0.62 0.62 . 0.31
    ## [3,] . . 0.71 . 0.35 0.35 0.35 0.35 . 0.18
    ## [4,] . . . 0.87 0.43 . 0.22 0.22 . 0.11
    ## [5,] . . . . 0.71 . 0.35 0.35 . 0.18
    ## [6,] . . . . . 0.71 0.35 0.35 . 0.18
    ## [7,] . . . . . . 0.64 . . .
    ## [8,] . . . . . . . 0.64 . 0.32
    ## [9,] . . . . . . . . 1 0.50
    ## [10,] . . . . . . . . . 0.66

    ## To check
    U - chol(A)

    round(A, digits=2)
    ## 10 x 10 sparse Matrix of class "dsCMatrix"
    ## [1,] 1.00 . 0.50 . 0.25 0.25 0.25 0.25 . 0.12
    ## [2,] . 1.00 0.50 0.50 0.50 0.75 0.62 0.62 . 0.31
    ## [3,] 0.50 0.50 1.00 0.25 0.62 0.75 0.69 0.69 . 0.34
    ## [4,] . 0.50 0.25 1.00 0.62 0.38 0.50 0.50 . 0.25
    ## [5,] 0.25 0.50 0.62 0.62 1.12 0.56 0.84 0.84 . 0.42
    ## [6,] 0.25 0.75 0.75 0.38 0.56 1.25 0.91 0.91 . 0.45
    ## [7,] 0.25 0.62 0.69 0.50 0.84 0.91 1.28 0.88 . 0.44
    ## [8,] 0.25 0.62 0.69 0.50 0.84 0.91 0.88 1.28 . 0.64
    ## [9,] . . . . . . . . 1.0 0.50
    ## [10,] 0.12 0.31 0.34 0.25 0.42 0.45 0.44 0.64 0.5

    ote tha
    m package uses Matrix classes in order to store only what we need to store, e.g., matrix U is triangular (t in "dtCMatrix") and matrix A is symmetric (s in "dsCMatrix"). To show the generalized Cholesky A = LDU (or using Henderson notation A = TDT') we use gchol() from the bdsmatrix R package. Matrix T shows the "flow" of genes in pedigree.

    ## install.packages(pkgs="bdsmatrix")
    &lt;- gchol(as.matrix(A))
    D &lt;- diag(tmp)
    (T <- as(as.matrix(tmp), "dtCMatrix"))
    ## 10 x 10 sparse Matrix of class "dtCMatrix"
    ## [1,] 1.000 . . . . . . . . .
    ## [2,] . 1.0000 . . . . . . . .
    ## [3,] 0.500 0.5000 1.00 . . . . . . .
    ## [4,] . 0.5000 . 1.000 . . . . . .
    ## [5,] 0.250 0.5000 0.50 0.500 1.00 . . . . .
    ## [6,] 0.250 0.7500 0.50 . . 1.00 . . . .
    ## [7,] 0.250 0.6250 0.50 0.250 0.50 0.50 1 . . .
    ## [8,] 0.250 0.6250 0.50 0.250 0.50 0.50 . 1.0 . .
    ## [9,] . . . . . . . . 1.0 .
    ## [10,] 0.125 0.3125 0.25 0.125 0.25 0.25 . 0.5 0.5 1

    ## To chec
    - T %*% diag(sqrt(D))
    L - t(U)
    Now the A inverse part (inv(A) = inv(T)' inv(D) inv(T) = inv(T)' inv(S) inv(S) inv(T) using Henderson's notation, note that ). The nice thing is that pedigreemm authors provided functions to get inv(T) and D.

    (TInv <- as(ped2, "sparseMatrix"))
    ## 10 x 10 sparse Matrix of class "dtCMatrix" (unitriangular)
    ## 1 1.0 . . . . . . . . .
    ## 2 . 1.0 . . . . . . . .
    ## 3 -0.5 -0.5 1.0 . . . . . . .
    ## 4 . -0.5 . 1.0 . . . . . .
    ## 5 . . -0.5 -0.5 1.0 . . . . .
    ## 6 . -0.5 -0.5 . . 1.0 . . . .
    ## 7 . . . . -0.5 -0.5 1 . . .
    ## 8 . . . . -0.5 -0.5 . 1.0 . .
    ## 9 . . . . . . . . 1.0 .
    ## 10 . . . . . . . -0.5 -0.5 1
    round(DInv <- Diagonal(x=1/Dmat(ped2)), digits=2)
    ## 10 x 10 diagonal matrix of class "ddiMatrix"
    ## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
    ## [1,] 1 . . . . . . . . .
    ## [2,] . 1 . . . . . . . .
    ## [3,] . . 2 . . . . . . .
    ## [4,] . . . 1.33 . . . . . .
    ## [5,] . . . . 2 . . . . .
    ## [6,] . . . . . 2 . . . .
    ## [7,] . . . . . . 2.46 . . .
    ## [8,] . . . . . . . 2.46 . .
    ## [9,] . . . . . . . . 1 .
    ## [10,] . . . . . . . . . 2.33
    round(t(TInv) %*% DInv %*% TInv, digits=2)
    ## 10 x 10 sparse Matrix of class "dgCMatrix"
    ## .
    und(crossprod(sqrt(DInv) %*% TInv), digits=2)
    ## 10 x 10 sparse Matrix of class "dsCMatrix"
    ##  [1,]  1.5  0.50 -1.0  .     .     .     .     .     .     .   
    ## [2,] 0.5 2.33 -0.5 -0.67 . -1.00 . . . .
    ## [3,] -1.0 -0.50 3.0 0.50 -1.00 -1.00 . . . .
    ## [4,] . -0.67 0.5 1.83 -1.00 . . . . .
    ## [5,] . . -1.0 -1.00 3.23 1.23 -1.23 -1.23 . .
    ## [6,] . -1.00 -1.0 . 1.23 3.23 -1.23 -1.23 . .
    ## [7,] . . . . -1.23 -1.23 2.46 . . .
    ## [8,] . . . . -1.23 -1.23 . 3.04 0.58 -1.16
    ## [9,] . . . . . . . 0.58 1.58 -1.16
    ## [10,] . . . . . . . -1.16 -1.16

    # T
    o c
    ) - crossprod(sqrt(DInv) %*% TInv)

    The second method (using crossprod) is preferred as it leads directly to symmetric matrix (dsCMatrix), which stores only upper or lower triangle. And make sure you do not do crossprod(TInv %*% sqrt(DInv)) as it is the wrong order of matrices.

    As promised we will display (plot) pedigree by use of conversion functions of matrix objects to graph objects using the following code. Two examples are provided using the graph and igraph packages. The former does a very good job on this example, but otherwise igraph seems to have much nicer support for editing etc.

    ## source("")
    ## biocLite(pkgs=c("graph", "Rgraphviz"))
    g <- as(t(TInv), "graph")

    ## install.packages(pkgs="igraph")
    i &l
    t;- igraph.from.graphNEL(graphNEL=g)
    i)$label <- 1:10
    (i, layout=layout.kamada.kawai)
    ## tkplot(i)

    by Gregor Gorjanc ( at August 13, 2013 02:28 PM

    July 02, 2013

    Gregor Gorjanc

    Parse arguments of an R script

    R can be used also as a scripting tool. We just need to add shebang in the first line of a file (script):


    and then the R code should follow.

    Often we want to pass arguments to such a script, which can be collected in the script by the commandArgs() function. Then we need to parse the arguments and conditional on them do something. I came with a rather general way of parsing these arguments using simply these few lines:

    ## Collect arguments
    args <- commandArgs(TRUE)
    ## Default setting when no arguments passed
    if(length(args) < 1) {
    args <- c("--help")
    ## Help section
    if("--help" %in% args) {
    The R Script
    --arg1=someValue - numeric, blah blah
    --arg2=someValue - character, blah blah
    --arg3=someValue - logical, blah blah
    --help - print this text
    ./test.R --arg1=1 --arg2="
    output.txt" --arg3=TRUE \n\n")
    ## Parse arguments (we expect the form --arg=value)
    parseArgs <- function(x) strsplit(sub("^--", "", x), "=")
    argsDF <-"rbind", parseArgs(args)))
    argsL <- as.list(as.character(argsDF$V2))
    names(argsL) <- argsDF$V1
    ## Arg1 default
    if(is.null(args$arg1)) {
    ## do something
    ## Arg2 default
    if(is.null(args$arg2)) {
    ## do something
    ## Arg3 default
    if(is.null(args$arg3)) {
    ## do something
    ## ... your code here ...
    Created by Pretty R at

    It is some work, but I find it pretty neat and use it for quite a while now. I do wonder what others have come up for this task. I hope I did not miss some very general solution.

    by Gregor Gorjanc ( at July 02, 2013 04:55 PM

    March 24, 2013

    Romain Francois



    This blog is moving to The new one is powered by wordpress and gets a subdomain of

    See you there

    by romain francois at March 24, 2013 03:52 PM

    March 17, 2013

    Modern Toolmaking

    caretEnsemble Classification example

    Here's a quick demo of how to fit a binary classification model with caretEnsemble.  Please note that I haven't spent as much time debugging caretEnsemble for classification models, so there's probably more bugs than my last post.  Also note that multi class models are not yet supported.

    Right now, this code fails for me if I try a model like a nnet or an SVM for stacking, so there's clearly bugs to fix.

    The greedy model relies 100% on the gbm, which makes sense as the gbm has an AUC of 1 on the training set.  The linear model uses all of the models, and achieves an AUC of .5.  This is a little weird, as the gbm, rf, SVN, and knn all achieve an AUC of close to 1.0 on the training set, and I would have expected the linear model to focus on these predictions. I'm not sure if this is a bug, or a failure of my stacking model.

    by Zachary Deane-Mayer ( at March 17, 2013 04:04 AM

    March 13, 2013

    Modern Toolmaking

    New package for ensembling R models

    I've written a new R package called caretEnsemble for creating ensembles of caret models in R.  It currently works well for regression models, and I've written some preliminary support for binary classification models.

    At this point, I've got 2 different algorithms for combining models:

    1. Greedy stepwise ensembles (returns a weight for each model)
    2. Stacks of caret models

    (You can also manually specify weights for a greedy ensemble)

    The greedy algorithm is based on the work of Caruana et al., 2004, and inspired by the medley package here on github.  The stacking algorithm simply builds a second caret model on top of the existing models (using their predictions as input), and employs all of the flexibility of the caret package.

    All the models in the ensemble must use the same training/test folds.  Both algorithms use the out-of-sample predictions to find the weights and train the stack. Here's a brief script demonstrating how to use the package:

    Please feel free to submit any comments here or on github.  I'd also be happy to include any patches you feel like submitting.  In particular, I could use some help writing support for multi-class models, writing more tests, and fixing bugs.

    by Zachary Deane-Mayer ( at March 13, 2013 02:36 PM

    February 18, 2013

    Romain Francois

    Improving the graph gallery

    I'm trying to make improvements to the R Graph Gallery, I'm looking for suggestions from users of the website.

    I've started a question on the website's facebook page. Please take a few seconds to vote to existing improvements possibilities and perhaps offer some of your own ideas.


    by romain francois at February 18, 2013 07:37 AM

    February 04, 2013

    Romain Francois

    bibtex 0.3-5


    The version 0.3-5 of the bibtex package is on CRAN. This fixes a corner case issue about empty bib files thanks to Kurt Hornik.

    by romain francois at February 04, 2013 10:59 AM

    November 05, 2012

    Romain Francois

    OOP with Rcpp modules

    The purpose of Rcpp modules has always been to make it easy to expose C++ functions and classes to R. Up to now, Rcpp modules did not have a way to declare inheritance between C++ classes. This is now fixed in the development version, and the next version of Rcpp will have a simple mechanism to declare inheritance.

    Consider this simple example, we have a base class Shape with two virtual methods (area and contains) and two classes Circle and Rectangle) each deriving from Shape and representing a specific shape.

    The classes might look like this:

    And we can expose these classes to R using the following module declarative code:

    It is worth noticing that:

    • The area and contains methods are exposed as part of the base Shape class
    • Classes Rectangle and Circle simply declare that they derive from Shape using the derives notation.

    R code that uses these classes looks like this:


    by romain francois at November 05, 2012 03:24 PM

    August 01, 2012

    R you ready?

    Creating a text grob that automatically adjusts to viewport size

    Source: flickr, user saikofish, pic lou (

    I recently wanted to construe a dashboard widget that contains some text and other elements using the grid graphics system. The size available for the widget will vary. When the sizes for the elements of the grobs in the widget are specified as Normalised Parent Coordinates the size adjustments happen automatically. Text does not automatically adjust though. The size of the text which is calculated as fontsize times the character expansion factor (cex) remains the same when the viewport size changes. For my widget this would require to adjust the fontsize or cex settings for each case seperately. While this is not really an obstacle, I asked myself how a grob that will adjust its text size automatically when being resized can be construed. Here I jot down my results in the hope that you may find this useful.

    First I will create a new grob class called resizingTextGrob that is supposed to resize automatically.

    resizingTextGrob &lt;- function(...)
      grob(tg=textGrob(...), cl=&quot;resizingTextGrob&quot;)

    The grob created by the function contains nothing more than a textGrob. In order for the grob class to print something we need to specify the drawDetails method for our class which will do the drawing. The drawDetails method is called automatically when drawing a grob using grob.draw.

    drawDetails.resizingTextGrob &lt;- function(x, recording=TRUE)

    Up to now this will produce the same results as a plain textGrob.

    g &lt;- resizingTextGrob(label=&quot;test 1&quot;)
    grid.text(&quot;test 2&quot;, y=.4)

    Now, before doing the drawing we want to calculate the size of the viewport and adjust the fontsize accordingly. To do this we can take the approach to push a new viewport with an adjusted fontsize before the drawing occurs. To perfom the calculations and and push the viewport we specify a preDrawDetails method. This method is automatically called before any drawing occures. It gives us the chance to do some modifications, like e.g. pushing a viewport.

    For this purpose first the available height is calculated. Than the fontsize is rescaled according to the available width. The rescaled fontsize is used for the new viewport. Now for a fully developed class we will want to include these parameters in the grob constructor of course. Or we might define a proportion factor argument by which to shrink the text instead. Anyway, to keep things simple this is not done here.

    preDrawDetails.resizingTextGrob &lt;- function(x)
      h &lt;- convertHeight(unit(1, &quot;snpc&quot;), &quot;mm&quot;, valueOnly=TRUE)
      fs &lt;- rescale(h, to=c(18, 7), from=c(120, 20))
      pushViewport(viewport(gp = gpar(fontsize = fs)))

    To clean up after the drawing the created viewport is popped. This is done in the postDrawDetails which is automatically called after the drawDetails method.

    postDrawDetails.resizingTextGrob &lt;- function(x)

    Now the output will depend on the size of the current viewport. When resizing the device the text size will adjust.

    g &lt;- resizingTextGrob(label=&quot;test 1&quot;)
    grid.text(&quot;test 2&quot;, y=.4)

    Let’s compare the standard textGrob with the new class. For this purpose let’s draw a small clock and display it using different device sizes.

    a &lt;- seq(2*pi, 2*pi/ 12, length=12) + pi/3
    x &lt;- cos(a) / 2
    y &lt;- sin(a) / 2
    segs &lt;- segmentsGrob(x*.2 + .5, y*.2+.5, x*.3 + .5, y*.3 + .5)
    # the standard approach
    tgs.1 &lt;- textGrob(1:12, x*.4 + .5, y*.4 + .5)
    # the new grob class
    tgs.2 &lt;- resizingTextGrob(1:12, x*.4 + .5, y*.4 + .5)
    grid.arrange(grobTree(segs, tgs.1), grobTree(segs, tgs.2))

    What it looks like at the beginning.

    What it looks like when the device is resized.

    Note how the size of the text of the lower clock adjusts to the device size. The text size in the upper graphs remains the same and becomes too big for the clock while it changes for the lower ones.

    BTW: The definitive guide for  the grid graphics model is the book R graphics by Paul Murrell.

    by markheckmann at August 01, 2012 02:06 PM