# Planet R

## May 26, 2016

### Removed CRANberries

#### Package intcox (with last version 0.9.3) was removed from CRAN

Previous versions (as known to CRANberries) which should be available via the Archive link are:

2013-02-25 0.9.3

#### New package heatmaply with initial version 0.3.2

Package: heatmaply
Type: Package
Title: Interactive Heat Maps Using 'plotly'
Version: 0.3.2
Date: 2016-05-25
Authors@R: c(person("Tal", "Galili", role = c("aut", "cre", "cph"), email = "tal.galili@gmail.com", comment = "http://www.r-statistics.com"))
Description: Create interactive heatmaps that are usable from the R console, in the 'RStudio' viewer pane, in 'R Markdown' documents, and in 'Shiny' apps. Hover the mouse pointer over a cell to show details or drag a rectangle to zoom. A heatmap is a popular graphical method for visualizing high-dimensional data, in which a table of numbers are encoded as a grid of colored cells. The rows and columns of the matrix are ordered to highlight patterns and are often accompanied by dendrograms. Heatmaps are used in many fields for visualizing observations, correlations, missing values patterns, and more. Interactive heatmaps allow the inspection of specific value by hovering the mouse over a cell, as well as zooming into a region of the heatmap by dragging a rectangle around the relevant area. This work is based on the 'ggplot2' and 'plotly.js' engine. It produces similar heatmaps as 'd3heatmap', with the advantage of speed ('plotly.js' is able to handle larger size matrix), and the ability to zoom from the dendrogram panes.
Depends: R (>= 3.0.0), plotly (>= 3.6.0), viridis
Imports: ggplot2, dendextend, magrittr (>= 1.0.1), reshape2, scales, utils, stats
Suggests: knitr, rmarkdown, gplots
VignetteBuilder: knitr
URL: https://cran.r-project.org/package=heatmaply, https://github.com/talgalili/heatmaply/, http://www.r-statistics.com/tag/heatmaply/
BugReports: https://github.com/talgalili/heatmaply/issues
LazyData: TRUE
RoxygenNote: 5.0.1
NeedsCompilation: no
Packaged: 2016-05-26 07:40:04 UTC; junior
Author: Tal Galili [aut, cre, cph] (http://www.r-statistics.com)
Maintainer: Tal Galili <tal.galili@gmail.com>
Repository: CRAN
Date/Publication: 2016-05-26 17:50:25

#### New package fakeR with initial version 1.0

Package: fakeR
Type: Package
Title: Simulates Data from a Data Frame of Different Variable Types
Version: 1.0
Date: 2016-05-25
Authors@R: c(person("Lily", "Zhang", email = "lilyhzhang1029@gmail.com", role = c("aut", "cre")), person("Dustin","Tingley", email = "dtingley@gov.harvard.edu", role = c("aut")))
Description: Generates fake data from a dataset of different variable types. The package contains the functions simulate_dataset and simulate_dataset_ts to simulate time-independent and time-dependent data. It randomly samples character and factor variables from contingency tables and numeric and ordered factors from a multivariate normal distribution. It currently supports the simulation of stationary and zero-inflated count time series.
Imports: mvtnorm, polycor, pscl, VGAM, stats
Suggests: knitr, rmarkdown, testthat
VignetteBuilder: knitr
NeedsCompilation: no
Packaged: 2016-05-26 06:59:29 UTC; lilyzhang
Author: Lily Zhang [aut, cre], Dustin Tingley [aut]
Maintainer: Lily Zhang <lilyhzhang1029@gmail.com>
Repository: CRAN
Date/Publication: 2016-05-26 17:54:23

## May 25, 2016

### CRANberries

#### New package mvtboost with initial version 0.5.0

Package: mvtboost
Type: Package
Title: Tree Boosting for Multivariate Outcomes
Version: 0.5.0
Date: 2016-05-25
Author: Patrick Miller [aut, cre]
Maintainer: Patrick Miller <patrick.mil10@gmail.com>
Description: Fits a multivariate model of decision trees for multiple, continuous outcome variables. A model for each outcome variable is fit separately, selecting predictors that explain covariance in the outcomes. Built on top of 'gbm', which fits an ensemble of decision trees to univariate outcomes.
URL: https://github.com/patr1ckm/mvtboost
BugReports: https://github.com/patr1ckm/mvtboost/issues
Depends: R (>= 3.0.0)
Suggests: testthat, plyr, MASS, parallel, lars, ggplot2, knitr, rmarkdown
Imports: gbm, RColorBrewer, stats, graphics, grDevices, utils,
VignetteBuilder: knitr
RoxygenNote: 5.0.1
NeedsCompilation: no
Packaged: 2016-05-25 14:41:06 UTC; pmille13
Repository: CRAN
Date/Publication: 2016-05-25 18:08:26

## May 15, 2016

### Dirk Eddelbuettel

#### Rcpp 0.12.5: Yet another one

The fifth update in the 0.12.* series of Rcpp has arrived on the CRAN network for GNU R a few hours ago, and was just pushed to Debian. This 0.12.5 release follows the 0.12.0 release from late July, the 0.12.1 release in September, the 0.12.2 release in November, the 0.12.3 release in January, and the 0.12.4 release in March --- making it the ninth release at the steady bi-montly release frequency. This release is one again more of a maintenance release addressing a number of small bugs, nuisances or documentation issues without adding any major new features.

Rcpp has become the most popular way of enhancing GNU R with C or C++ code. As of today, 662 packages on CRAN depend on Rcpp for making analytical code go faster and further. That is up by almost fifty packages from the last release in late March!

And as during the last few releases, we have first-time committers. we have new first-time contributors. Sergio Marques helped to enable compilation on Alpine Linux (with its smaller libc variant). Qin Wenfeng helped adapt for Windows builds under R 3.3.0 and the long-awaited new toolchain. Ben Goodrich fixed a (possibly ancient) Rcpp Modules bug he encountered when working with rstan. Other (recurrent) contributor Dan Dillon cleaned up an issue with Nullable and strings. Rcpp Core team members Kevin and JJ took care of small build nuisance on Windows, and I added in a new helper function, updated the skeleton generator and (finally) formally deprecated loadRcppModule() for which loadModule() has been preferred since around R 2.15 or so. More details and links are below.

#### Changes in Rcpp version 0.12.5 (2016-05-14)

• Changes in Rcpp API:

• The checks for different C library implementations now also check for Musl used by Alpine Linux (Sergio Marques in PR #449).

• Rcpp::Nullable works better with Rcpp::String (Dan Dillon in PR #453).

• Changes in Rcpp Attributes:

• R 3.3.0 Windows with Rtools 3.3 is now supported (Qin Wenfeng in PR #451).

• Correct handling of dependent file paths on Windows (use winslash = "/").

• Changes in Rcpp Modules:

• An apparent race condition in Module loading seen with R 3.3.0 was fixed (Ben Goodrich in #461 fixing #458).

• The (older) loadRcppModules() is now deprecated in favour of loadModule() introduced around R 2.15.1 and Rcpp 0.9.11 (PR #470).

• Changes in Rcpp support functions:

• The Rcpp.package.skeleton() function was again updated in order to create a DESCRIPTION file which passes R CMD check without notes. warnings, or error under R-release and R-devel (PR #471).

• A new function compilerCheck can test for minimal g++ versions (PR #474).

Thanks to CRANberries, you can also look at a diff to the previous release. As always, even fuller details are on the Rcpp Changelog page and the Rcpp page which also leads to the downloads page, the browseable doxygen docs and zip files of doxygen output for the standard formats. A local directory has source and documentation too. Questions, comments etc should go to the rcpp-devel mailing list off the R-Forge page.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

## May 13, 2016

### Journal of the Royal Statistical Society: Series A

#### Probability and Statistics by Example 1: Basic Probability and Statistics Y. Suhov M. Kelbert 2014 Cambridge Cambridge University Press 470 pp., $84.99 ISBN 978‐1‐107‐60358‐5 #### A Certain Uncertainty: Nature's Random Ways M. P. Silverman 2014 Cambridge Cambridge University Press xvi + 618 pp., €175.00 ISBN 978‐1‐107‐03281‐1 ## May 12, 2016 ### Bioconductor Project Working Papers #### Interpretable High-Dimensional Inference Via Score Maximization with an Application in Neuroimaging In the fields of neuroimaging and genetics a key goal is testing the association of a single outcome with a very high-dimensional imaging or genetic variable. Oftentimes summary measures of the high-dimensional variable are created to sequentially test and localize the association with the outcome. In some cases, the results for summary measures are significant, but subsequent tests used to localize differences are underpowered and do not identify regions associated with the outcome. We propose a generalization of Rao's score test based on maximizing the score statistic in a linear subspace of the parameter space. If the test rejects the null, then we provide methods to localize signal in the high-dimensional space by projecting the scores to the subspace where the score test was performed. This allows for inference in the high-dimensional space to be performed on the same degrees of freedom as the score test, effectively reducing the number of comparisons. We illustrate the method by analyzing a subset of the Alzheimer's Disease Neuroimaging Initiative dataset. Results suggest cortical thinning of the frontal and temporal lobes may be a useful biological marker of Alzheimer’s risk. Simulation results demonstrate the test has competitive power relative to others commonly used. ## May 10, 2016 ### Bioconductor Project Working Papers #### An Efficient Basket Trial Design The landscape for early phase cancer clinical trials is changing dramatically due to the advent of targeted therapy. Increasingly, new drugs are designed to work against a target such as the presence of a specific tumor mutation. Since typically only a small proportion of cancer patients will possess the mutational target, but the mutation is present in many different cancers, a new class of basket trials is emerging, whereby the drug is tested simultaneously in different baskets, i.e., sub-groups of different tumor types. Investigators not only desire to test whether the drug works, but also to determine which types of tumors are sensitive to the drug. A natural strategy is to conduct parallel trials, with the drug’s effectiveness being tested separately, using for example, the popular Simon two-stage design independently in each basket. The work presented is motivated by the premise that the efficiency of this strategy can be improved by assessing the homogeneity of the baskets’ response rates at an interim analysis and aggregating the baskets in the second stage if the results suggest the drug might be effective in all or most baskets. Via simulations we assess the relative efficiencies of the two strategies. Since the operating characteristics depend on how many tumor types are sensitive to the drug, there is no uniformly efficient strategy. However, our investigation demonstrates substantial efficiencies are possible if the drug works in most or all baskets, at the cost of modest losses of power if the drug works in only a single basket. ## May 08, 2016 ### Dirk Eddelbuettel #### Rblpapi 0.3.4 A new release of Rblpapi is now on CRAN. It provides a direct interface between R and the Bloomberg Terminal via the C++ API provided by Bloomberg Labs (but note that a valid Bloomberg license and installation is required). This marks the fifth release since the package first appeared on CRAN last year. Continued thanks to all contributors for code, suggestions or bug reports. This release contains a lot of internal fixes by Whit, John and myself and should prove to be more resilient to 'odd' representations of data coming back. The NEWS.Rd extract has more details: #### Changes in Rblpapi version 0.3.4 (2016-05-08) • On startup, the API versions of both the headers and the runtime are displayed (PR #161 and #165). • Documentation about extended futures roll notation was added to the bdh manual page. • Additional examples for overrides where added to bdh (PR #158). • Internal code changes make retrieval of data in ‘unusual’ variable types more robust (PRs #157 and #153) • General improvements and fixes to documentation (PR #156) • The bdp function now also supports an option verbose (PR #149). • The internal header Rblpapi_types.h was renamed from a lower-cased variant to conform with Rcpp Attributes best practices (PR #145). Courtesy of CRANberries, there is also a diffstat report for the this release. As always, more detailed information is on the Rblpapi page. Questions, comments etc should go to the issue tickets system at the GitHub repo. This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings. ## May 07, 2016 ### Dirk Eddelbuettel #### BH 1.60.0-2 A new minor release of BH is now on CRAN. BH provides a large part of the Boost C++ libraries as a set of template headers for use by R, possibly with Rcpp as well as other packages. This release uses the same Boost 1.60.0 version of Boost as the last release, but adds three more library: bimap, flyweight and icl. A brief summary of changes from the NEWS file is below. #### Changes in version 1.60.0-2 (2016-05-06) • Added Boost bimap via GH pull request #24 by Jim Hester. • Added Boost icl via GH pull request #27 by Jay Hesselbert. • Added Boost flyweight as requested in GH ticket #26. Courtesy of CRANberries, there is also a diffstat report for the most recent release. Comments and suggestions are welcome via the mailing list or the issue tracker at the GitHubGitHub repo. This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings. ## May 06, 2016 ### Dirk Eddelbuettel #### RcppArmadillo 0.6.700.6.0 A second Armadillo release 6.700.6 came out in the 6.700 series, and we uploaded RcppArmadillo 0.6.700.6.0 to CRAN and Debian. This followed the usual thorough reverse-dependecy checking of by now 220 packages using. This release is a little unusual in that it contains both upstream bugfixes in the same series (see below) but also two nice bug fixes from the RcppArmadillo side. Both were squashed by George G. Vega Yon via two focused pull request. The first ensures that we can now use ARMA_64BIT_WORD (provided C++11 is turned on too) allowing for much bigger Armadillo objects. And the second plugs a small leak in the sparse matrix converter I had added a while back. Nice work, all told! Armadillo is a powerful and expressive C++ template library for linear algebra aiming towards a good balance between speed and ease of use with a syntax deliberately close to a Matlab. Changes in this release are as follows: #### Changes in RcppArmadillo version 0.6.700.6.0 (2016-05-05) • Upgraded to Armadillo 6.700.6 (Catabolic Amalgamator Deluxe) • fix for handling empty matrices by kron() • fix for clang warning in advanced matrix constructors • fix for false deprecated warning in trunc_log() and trunc_exp() • fix for gcc-6.1 warning about misleading indentation • corrected documentation for the solve() function • Added support for int64_t (ARMA_64BIT_WORD) when required during compilation time. (PR #90 by George G. Vega Yon, fixing #88) • Fixed bug in SpMat exporter (PR #91 by George G. Vega Yon, fixing #89 and #72) Courtesy of CRANberries, there is also a diffstat report for this release. As always, more detailed information is on the RcppArmadillo page. Questions, comments etc should go to the rcpp-devel mailing list off the R-Forge page. This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings. ## May 03, 2016 ### Bioconductor Project Working Papers #### hpcNMF: A high-performance toolbox for non-negative matrix factorization Non-negative matrix factorization (NMF) is a widely used machine learning algorithm for dimension reduction of large-scale data. It has found successful applications in a variety of fields such as computational biology, neuroscience, natural language processing, information retrieval, image processing and speech recognition. In bioinformatics, for example, it has been used to extract patterns and profiles from genomic and text-mining data as well as in protein sequence and structure analysis. While the scientific performance of NMF is very promising in dealing with high dimensional data sets and complex data structures, its computational cost is high and sometimes could be critical for delivering analysis results in a timely manner. In this paper, we describe a high-performance C++ toolbox for NMF, called hpcNMF, that is designed for use on desktop computers and distributed computer clusters. Algorithms based on different statistical models and cost functions as well as various metrics for model selection and evaluating goodness-of-fit are implemented in the toolbox. hpcNMF is platform independent and does not require the use of any special libraries. It is compatible with Windows, Linux and Mac operating systems; and message-passing interface is required for hpcNMF to be deployed on computer clusters to leverage the power of parallelized computing. We illustrate the utility of this toolbox using several real examples encompassing a broad range of applications. ## April 29, 2016 ### Bioconductor Project Working Papers #### STOCHASTIC OPTIMIZATION OF ADAPTIVE ENRICHMENT DESIGNS FOR TWO SUBPOPULATIONS An adaptive enrichment design is a randomized trial that allows enrollment criteria to be modified at interim analyses, based on preset decision rules. When there is prior uncertainty regarding treatment effect heterogeneity, these trials can provide improved power for detecting treatment effects in subpopulations. An obstacle to using these designs is that there is no general approach to determine what decision rules and other design parameters will lead to good performance for a given research problem. To address this, we present a simulated annealing approach for optimizing the parameters of an adaptive enrichment design for a given scientific application. Optimization is done with respect to either expected sample size or expected trial duration, and subject to constraints on power and Type I error rate. We use this optimization framework to compare the performance of two types of multiple testing procedures. We also compare against conventional choices for design parameters that approximate O'Brien-Fleming boundaries and Pocock boundaries. We find that optimized designs can be substantially more efficient than simpler designs using Pocock or O'Brien-Fleming boundaries. Much of this added benefit comes from optimizing the decision rules concerning when to stop a subpopulation's enrollment, or the entire trial, due to futility. ## April 04, 2016 ### Journal of Statistical Software #### Notification : Mon, 04 Apr 2016 08:12:26 +0000 ## March 28, 2016 ### Statistical Modelling #### Using a latent variable model with non-constant factor loadings to examine PM2.5 constituents related to secondary inorganic aerosols Factor analysis is a commonly used method of modelling correlated multivariate exposure data. Typically, the measurement model is assumed to have constant factor loadings. However, from our preliminary analyses of the Environmental Protection Agency's (EPA's) PM2.5 fine speciation data, we have observed that the factor loadings for four constituents change considerably in stratified analyses. Since invariance of factor loadings is a prerequisite for valid comparison of the underlying latent variables, we propose a factor model that includes non-constant factor loadings that change over time and space using P-spline penalized with the generalized cross-validation (GCV) criterion. The model is implemented using the Expectation-Maximization (EM) algorithm and we select the multiple spline smoothing parameters by minimizing the GCV criterion with Newton's method during each iteration of the EM algorithm. The algorithm is applied to a one-factor model that includes four constituents. Through bootstrap confidence bands, we find that the factor loading for total nitrate changes across seasons and geographic regions. #### Generalized multiple indicators, multiple causes measurement error models Generalized Multiple Indicators, Multiple Causes Measurement Error Models (G-MIMIC ME) can be used to study the effects of an unobservable latent variable on a set of outcomes when the causes of the latent variables are unobserved. The errors associated with the unobserved causal variables can be due to either bias recall or day-to-day variability. Another potential source of error, the Berkson error, is due to individual variations that arise from the assignment of group data to individual subjects. In this article, we accomplish the following: (a) extend the classical linear MIMIC models to allow both Berkson and classical measurement errors where the distributions of the outcome variables belong in the exponential family, (b) develop likelihood based estimation methods using the MC-EM algorithm and (c) estimate the variance of the classical measurement error associated with the approximation of the amount of radiation dose received by atomic bomb survivors at the time of their exposure. The G-MIMIC ME model is applied to study the effect of genetic damage, a latent construct based on exposure to radiation, and the effect of radiation dose on physical indicators of genetic damage. #### Longitudinal functional models with structured penalties This article addresses estimation in regression models for longitudinally collected functional covariates (time-varying predictor curves) with a longitudinal scalar outcome. The framework consists of estimating a time-varying coefficient function that is modelled as a linear combination of time-invariant functions with time-varying coefficients. The model uses extrinsic information to inform the structure of the penalty, while the estimation procedure exploits the equivalence between penalized least squares estimation and a linear mixed model representation. The process is empirically evaluated with several simulations and it is applied to analyze the neurocognitive impairment of human immunodeficiency virus (HIV) patients and its association with longitudinally-collected magnetic resonance spectroscopy (MRS) curves. ## February 28, 2016 ### Journal of the Royal Statistical Society: Series C #### Bayesian two‐stage dose finding for cytostatic agents via model adaptation #### A general angular regression model for the analysis of data on animal movement in ecology #### Two‐stage model for time varying effects of zero‐inflated count longitudinal covariates with applications in health behaviour research #### Dependence modelling with regular vine copula models: a case‐study for car crash simulation data ## February 09, 2016 ### Statistical Modelling #### Editorial ## February 02, 2016 ### RCpp Gallery #### SIMD Map-Reduction with RcppNT2 ## Introduction The Numerical Template Toolbox (NT2) collection of header-only C++ libraries that make it possible to explicitly request the use of SIMD instructions when possible, while falling back to regular scalar operations when not. NT2 itself is powered by Boost, alongside two proposed Boost libraries – Boost.Dispatch, which provides a mechanism for efficient tag-based dispatch for functions, and Boost.SIMD, which provides a framework for the implementation of algorithms that take advantage of SIMD instructions. RcppNT2 wraps and exposes these libraries for use with R. If you haven’t already, read the RcppNT2 introduction article to get acquainted with the RcppNT2 package. ## Map Reduce MapReduce is the (infamous) buzzword that describes a class of problems that can be solved by splitting an an algorithm into a map (transform) step, and a reduction step. Although this scheme is typically adopted to help solve problems ‘at scale’ (e.g., with a large number of communicating machines), it is also a useful abstraction for many problems in the SIMD universe. Take, for example, the dot product. This can be expressed in R code simply as: sum(lhs * rhs)  Here, we ‘map’ our vectors by multiplying them together element-wise, and ‘reduce’ the result through summation. Of course, behind the scenes, R is doing a bit more than it has to – it’s computing a new vector, lhs * rhs, which is of the same length as lhs, and then collapsing (reducing) that vector by adding each element up. It would be great if we could skip that large temporary vector allocation. RcppNT2 provides a function, simdMapReduce(), that makes expressing these kinds of problems very easy. To make use of simdMapReduce(), you need to write a class that provides a number of templated methods: • U init() — returns the initial (scalar) data state. • T map(const T&... ts) — transforms the values • T combine(const T& lhs, const T& rhs) — describes how results should be combined • U reduce(const T& t) — describes how a SIMD pack should be reduced You’ll notice that we play a little fast-and-loose with the terms, but it should still be relatively clear what each method accomplishes. With this infrastructure, the dot product could be implemented like so: If you can ignore the C++ templates, it should hopefully be fairly clear what’s going on here. We transform elements by multiplying them together, and we combine + reduce by adding them up. (Unfortunately, although the combine() and reduce() functions are effectively doing the same thing, they need to be expressed separately, as the reduce() function is effectively our bridge from SIMD land to scalar land). Now, let’s show how our map-reducer can be called. Let’s also export a version that accepts IntegerVector, just to show that our class is generic enough to accept other integral types as well. And let’s execute it from R, just to convince ourselves that it works. Great! Of course, a large number of problems can be expressed with a ‘plus’, or ‘sum’ reduction, so RcppNT2 also provides a helper for that, so that you only need to implement the ‘map’ step. We can do this by writing a class that inherits from the PlusReducer class: And, let’s convince ourselves it works: And let’s use a quick microbenchmark to see if we’ve truly gained anything here:  test replications elapsed relative 2 n1 %*% n2 100 0.412 5.282 3 simdDot(n1, n2) 100 0.085 1.090 4 simdDotV2(n1, n2) 100 0.078 1.000 1 sum(n1 * n2) 100 0.307 3.936   test replications elapsed relative 2 i1 %*% i2 100 1.303 42.032 3 simdDotInt(i1, i2) 100 0.031 1.000 1 sum(i1 * i2) 100 0.571 18.419  You might be surprised how profound the speed improvements accrued by using SIMD instructions are. How does this happen? Behind the scenes, simdMapReduce() is handling a number of things for us: 1. Iteration over the sequences used SIMD packs when possible, and scalars when not, 2. Optimized SIMD instructions are used to transform and combine packs of values, 3. Intermediate results are held in a SIMD register, rather than materializing a whole vector, 4. The SIMD register and scalar buffer are not combined until the very final step. In the ‘double’ case, we can pack 2 values into a SIMD pack; in the ‘int’ case, we can pack 4 values (assuming 32bit ‘int’ and 128bit SSE registers, which is the common case on Intel processors at the time of this post). Assuming that it takes the number of clock cycles to execute a SIMD instruction as it does for the scalar equivalent, this should translate into ~2x and ~4x speedups – and that’s not even accounting for gains in efficient register use, cache efficiency, and the ability to avoid the large temporary allocation! That said, we are playing it a little fast and loose in the ‘int’ case: with larger numbers, we could easily overflow; depending on the type of data expected it may be more appropriate to accumulate values into a different data type. In short – if you’re implementing an algorithm, or part of an algorithm, that can be expressed as: • sum(<transformation of variables>) then simdMapReduce() is worth looking at. This article provides just a taste of how RcppNT2 can be used. If you’re interested in learning more, please check out the RcppNT2 website. ## February 01, 2016 ### RCpp Gallery #### Using RcppNT2 to Compute the Variance ## Introduction The Numerical Template Toolbox (NT2) collection of header-only C++ libraries that make it possible to explicitly request the use of SIMD instructions when possible, while falling back to regular scalar operations when not. NT2 itself is powered by Boost, alongside two proposed Boost libraries – Boost.Dispatch, which provides a mechanism for efficient tag-based dispatch for functions, and Boost.SIMD, which provides a framework for the implementation of algorithms that take advantage of SIMD instructions. RcppNT2 wraps and exposes these libraries for use with R. If you haven’t already, read the RcppNT2 introduction article to get acquainted with the RcppNT2 package. ## Computing the Variance As you may or may not know, there are a number of algorithms for computing the variance, each making different tradeoffs in algorithmic complexity versus numerical stability. We’ll focus on implementing a two-pass algorithm, whereby we compute the mean in a first pass, and later the sum of squares in a second pass. First, let’s look at the R code one might write to compute the variance. [1] 0.833939 0.833939  We can decompose the operation into a few distinct steps: 1. Compute the mean for our vector ‘x’, 2. Compute the squared deviations from the mean, 3. Sum the deviations about the mean, 4. Divide the summation by the length minus one. Naively, we could imagine writing a ‘simdTransform()’ for step 2, and an ‘simdReduce()’ for step 3. However, this is a bit inefficient as the transform would require allocating a whole new vector, with the same length as our initial vector. When neither ‘simdTransform()’ nor ‘simdReduce()’ seem to be a good fit, we can fall back to ‘simdFor()’. We can pass a stateful functor to handle accumulation of the transformed results. Let’s write a class that encapsulates this ‘sum of squares’ operation. Now that we have our accumulator class defined, we can use it to compute the variance. We’ll call our function ‘simdVar()’, and export it using Rcpp attributes in the usual way. Let’s confirm that this works as expected… [1] 9.166667 9.166667  And let’s benchmark, to see how performance compares.  expr min lq mean median uq max var(small) 11.784 14.3395 16.37862 15.096 15.7225 40.346 simdVar(small) 1.506 1.7045 2.06541 1.947 2.1055 10.935   expr min lq mean median uq max var(large) 3046.597 3194.231 3278.7417 3301.6205 3323.581 3809.090 simdVar(large) 579.784 594.887 607.0411 607.9845 614.386 712.038  As we can see, taking advantage of SIMD instructions has improved the runtime quite drastically. However, we should note that this is not an entirely fair comparison with Rs implementation of the variance. In particular, we do not have a nice mechanism for handling missing values; if your data does have any NA or NaN values, they will simply be propagated (and not necessarily in the same way that R propagates missingness). If you’re interested, a separate example is provided as part of the RcppNT2 package, and you can browse the standalone source file here. This article provides just a taste of how RcppNT2 can be used. If you’re interested in learning more, please check out the RcppNT2 website. #### Using RcppNT2 to Compute the Sum ## Introduction The Numerical Template Toolbox (NT2) collection of header-only C++ libraries that make it possible to explicitly request the use of SIMD instructions when possible, while falling back to regular scalar operations when not. NT2 itself is powered by Boost, alongside two proposed Boost libraries – Boost.Dispatch, which provides a mechanism for efficient tag-based dispatch for functions, and Boost.SIMD, which provides a framework for the implementation of algorithms that take advantage of SIMD instructions. RcppNT2 wraps and exposes these libraries for use with R. If you haven’t already, read the RcppNT2 introduction article to get acquainted with the RcppNT2 package. ## Computing the Sum First, let’s review how we might use std::accumulate() to sum a vector of numbers. We explicitly pass in the std::plus<double>() functor, just to make it clear that the std::accumulate() algorithm expects a binary functor when accumulating values. Now, let’s rewrite this to take advantage of RcppNT2. There are two main steps required to take advantage of RcppNT2 at a high level: 1. Write a functor, with a templated call operator, with the implementation written in a ‘Boost.SIMD-aware’ way; 2. Provide the functor as an argument to the appropriate SIMD algorithm. Let’s follow these steps in implementing our SIMD sum. As you can see, it’s quite simple to take advantage of Boost.SIMD. For very simple operations such as this, RcppNT2 provides a number of pre-defined functors, which can be accessed in the RcppParallel::functor namespace. The following is an equivalent way of defining the above function: Behind the scenes of simdReduce(), Boost.SIMD will apply your templated functor to ‘packs’ of values when appropriate, and scalar values when not. In other words, there are effectively two kinds of template specializations being generated behind the scenes: one with T = double, and one with T = boost::simd::pack<double>. The use of the packed representation is what allows Boost.SIMD to ensure vectorized instructions are used and generated. Boost.SIMD provides a host of functions and operator overloads that ensure that optimized instructions are used when possible over a packed object, while falling back to ‘default’ operations for scalar values when not. Now, let’s compare the performance of these two implementations.  expr min lq mean median uq max vectorSum(v) 870.894 887.1535 978.7471 987.1810 989.1060 1792.215 vectorSumSimd(v) 270.985 283.2315 297.8062 287.6565 298.7115 608.373  Perhaps surprisingly, the RcppNT2 solution is much faster – the gains are similar to what we might have seen when computing the sum in parallel. However, we’re still just using a single core; we’re just taking advantage of vectorized instructions provided by the CPU. In this particular case, on Intel CPUs, Boost.SIMD will ensure that we are using the addpd instruction, which is documented in the Intel Software Developer’s Manual [PDF]. Note that, for the naive serial sum, the compiler would likely generate similarly efficient code when the -ffast-math optimization flag is set. By default, the compiler is somewhat ‘pessimistic’ about the set of optimizations it can perform around floating point arithmetic. This is because it must respect the IEEE floating point standard, and this means respecting the fact that, for example, floating point operations are not assocative: [1] 1.110223e-16  Surprisingly, the above computation does not evaluate to zero! In practice, you’re likely safe to take advantage of the -ffast-math optimizations, or Boost.SIMD, in your own work. However, be sure to test and verify! This article provides just a taste of how RcppNT2 can be used. If you’re interested in learning more, please check out the RcppNT2 website. #### Introduction to RcppNT2 Modern CPU processors are built with new, extended instruction sets that optimize for certain operations. A class of these allow for vectorized operations, called Single Instruction / Multiple Data (SIMD) instructions. Although modern compilers will use these instructions when possible, they are often unable to reason about whether or not a particular block of code can be executed using SIMD instructions. The Numerical Template Toolbox (NT2) is a collection of header-only C++ libraries that make it possible to explicitly request the use of SIMD instructions when possible, while falling back to regular scalar operations when not. NT2 itself is powered by Boost, alongside two proposed Boost libraries – Boost.Dispatch, which provides a mechanism for efficient tag-based dispatch for functions, and Boost.SIMD, which provides a framework for the implementation of algorithms that take advantage of SIMD instructions. RcppNT2 wraps and exposes these libraries for use with R. The primary abstraction that Boost.SIMD uses under the hood is the boost::simd::pack<> data structure. This item represents a small, contiguous, pack of integral objects (e.g. doubles), and comes with a host of functions that facilitate the use of SIMD operations on those objects when possible. Although you don’t need to know the details to use the high-level functionality provided by Boost.SIMD, it’s useful for understanding what happens behind the scenes. Here’s a quick example of how we might compute the sum of elements in a vector, using NT2. Behind the scenes, simdReduce() takes care of iteration over the provided sequence, and ensures that we use optimized SIMD instructions over packs of numbers when possible, and scalar instructions when not. By passing a templated functor, simdReduce() can automatically choose the correct template specialization depending on whether it’s working with a pack or not. In other words, two template specializations will be generated in this case: one with T = double, and another with T = boost::simd::pack<double>. Let’s confirm that this produces the correct output, and run a small benchmark. [1] TRUE   expr min lq mean median uq max sum(data) 894.451 943.4145 1033.5598 1020.5000 1071.327 1429.533 simd_sum(data) 280.585 293.6315 316.6797 307.8795 314.429 574.050  We get a noticable gain by taking advantage of SIMD instructions here. However, it’s worth noting that we don’t handle NA and NaN with the same granularity as R. ## Learning More This article provides just a taste of how RcppNT2 can be used. If you’re interested in learning more, please check out the RcppNT2 website. ## January 29, 2016 ### Journal of the Royal Statistical Society: Series B #### Joint estimation of multiple graphical models from high dimensional time series #### Bootstrapping the portmanteau tests in weak auto‐regressive moving average models #### Making a non‐parametric density estimator more attractive, and more accurate, by data perturbation #### Sequential selection procedures and false discovery rate control ## December 27, 2015 ### Alstatr #### R and Python: Gradient Descent One of the problems often dealt in Statistics is minimization of the objective function. And contrary to the linear models, there is no analytical solution for models that are nonlinear on the parameters such as logistic regression, neural networks, and nonlinear regression models (like Michaelis-Menten model). In this situation, we have to use mathematical programming or optimization. And one popular optimization algorithm is the gradient descent, which we're going to illustrate here. To start with, let's consider a simple function with closed-form solution given by $$f(\beta) \triangleq \beta^4 - 3\beta^3 + 2.$$ We want to minimize this function with respect to$\beta. The quick solution to this, as what calculus taught us, is to compute for the first derivative of the function, that is $$\frac{\text{d}f(\beta)}{\text{d}\beta}=4\beta^3-9\beta^2.$$ Setting this to 0 to obtain the stationary point gives us \begin{align} \frac{\text{d}f(\beta)}{\text{d}\beta}&\overset{\text{set}}{=}0\nonumber\\ 4\hat{\beta}^3-9\hat{\beta}^2&=0\nonumber\\ 4\hat{\beta}^3&=9\hat{\beta}^2\nonumber\\ 4\hat{\beta}&=9\nonumber\\ \hat{\beta}&=\frac{9}{4}. \end{align} The following plot shows the minimum of the function at\hat{\beta}=\frac{9}{4}$(red line in the plot below). R ScriptNow let's consider minimizing this problem using gradient descent with the following algorithm: 1. Initialize$\mathbf{x}_{r},r=0$2. while$\lVert \mathbf{x}_{r}-\mathbf{x}_{r+1}\rVert > \nu$3.$\mathbf{x}_{r+1}\leftarrow \mathbf{x}_{r} - \gamma\nabla f(\mathbf{x}_r)$4.$r\leftarrow r + 1$5. end while 6. return$\mathbf{x}_{r}$and$r$where$\nabla f(\mathbf{x}_r)$is the gradient of the cost function,$\gamma$is the learning-rate parameter of the algorithm, and$\nu$is the precision parameter. For the function above, let the initial guess be$\hat{\beta}_0=4$and$\gamma=.001$with$\nu=.00001$. Then$\nabla f(\hat{\beta}_0)=112$, so that $\hat{\beta}_1=\hat{\beta}_0-.001(112)=3.888.$ And$|\hat{\beta}_1 - \hat{\beta}_0| = 0.112> \nu$. Repeat the process until at some$r$,$|\hat{\beta}_{r}-\hat{\beta}_{r+1}| \ngtr \nu$. It will turn out that 350 iterations are needed to satisfy the desired inequality, the plot of which is in the following figure with estimated minimum$\hat{\beta}_{350}=2.250483\approx\frac{9}{4}$. R Script with PlotPython ScriptObviously the convergence is slow, and we can adjust this by tuning the learning-rate parameter, for example if we try to increase it into$\gamma=.01$(change gamma to .01 in the codes above) the algorithm will converge at 42nd iteration. To support that claim, see the steps of its gradient in the plot below. If we try to change the starting value from 4 to .1 (change beta_new to .1) with$\gamma=.01$, the algorithm converges at 173rd iteration with estimate$\hat{\beta}_{173}=2.249962\approx\frac{9}{4}(see the plot below). Now let's consider another function known as Rosenbrock defined as $$f(\mathbf{w})\triangleq(1 - w_1) ^ 2 + 100 (w_2 - w_1^2)^2.$$ The gradient is \begin{align} \nabla f(\mathbf{w})&=[-2(1 - w_1) - 400(w_2 - w_1^2) w_1]\mathbf{i}+200(w_2-w_1^2)\mathbf{j}\nonumber\\ &=\left[\begin{array}{c} -2(1 - w_1) - 400(w_2 - w_1^2) w_1\\ 200(w_2-w_1^2) \end{array}\right]. \end{align} Let the initial guess be\hat{\mathbf{w}}_0=\left[\begin{array}{c}-1.8\\-.8\end{array}\right]$,$\gamma=.0002$, and$\nu=.00001$. Then$\nabla f(\hat{\mathbf{w}}_0)=\left[\begin{array}{c} -2914.4\\-808.0\end{array}\right]$. So that $$\nonumber \hat{\mathbf{w}}_1=\hat{\mathbf{w}}_0-\gamma\nabla f(\hat{\mathbf{w}}_0)=\left[\begin{array}{c} -1.21712 \\-0.63840\end{array}\right].$$ And$\lVert\hat{\mathbf{w}}_0-\hat{\mathbf{w}}_1\rVert=0.6048666>\nu$. Repeat the process until at some$r$,$\lVert\hat{\mathbf{w}}_r-\hat{\mathbf{w}}_{r+1}\rVert\ngtr \nu$. It will turn out that 23,374 iterations are needed for the desired inequality with estimate$\hat{\mathbf{w}}_{23375}=\left[\begin{array}{c} 0.9464841 \\0.8956111\end{array}\right]$, the contour plot is depicted in the figure below. R Script with Contour PlotPython ScriptNotice that I did not use ggplot for the contour plot, this is because the plot needs to be updated 23,374 times just to accommodate for the arrows for the trajectory of the gradient vectors, and ggplot is just slow. Finally, we can also visualize the gradient points on the surface as shown in the following figure. R ScriptIn my future blog post, I hope to apply this algorithm on statistical models like linear/nonlinear regression models for simple illustration. #### R: Principal Component Analysis on Imaging Ever wonder what's the mathematics behind face recognition on most gadgets like digital camera and smartphones? Well for most part it has something to do with statistics. One statistical tool that is capable of doing such feature is the Principal Component Analysis (PCA). In this post, however, we will not do (sorry to disappoint you) face recognition as we reserve this for future post while I'm still doing research on it. Instead, we go through its basic concept and use it for data reduction on spectral bands of the image using R. ### Let's view it mathematically Consider a line$L$in a parametric form described as a set of all vectors$k\cdot\mathbf{u}+\mathbf{v}$parameterized by$k\in \mathbb{R}$, where$\mathbf{v}$is a vector orthogonal to a normalized vector$\mathbf{u}$. Below is the graphical equivalent of the statement: So if given a point$\mathbf{x}=[x_1,x_2]^T$, the orthogonal projection of this point on the line$L$is given by$(\mathbf{u}^T\mathbf{x})\mathbf{u}+\mathbf{v}$. Graphically, we mean$Proj$is the projection of the point$\mathbf{x}$on the line, where the position of it is defined by the scalar$\mathbf{u}^{T}\mathbf{x}$. Therefore, if we consider$\mathbf{X}=[X_1, X_2]^T$be a random vector, then the random variable$Y=\mathbf{u}^T\mathbf{X}$describes the variability of the data on the direction of the normalized vector$\mathbf{u}$. So that$Y$is a linear combination of$X_i, i=1,2$. The principal component analysis identifies a linear combinations of the original variables$\mathbf{X}$that contain most of the information, in the sense of variability, contained in the data. The general assumption is that useful information is proportional to the variability. PCA is used for data dimensionality reduction and for interpretation of data. (Ref 1. Bajorski, 2012) To better understand this, consider two dimensional data set, below is the plot of it along with two lines ($L_1$and$L_2$) that are orthogonal to each other: If we project the points orthogonally to both lines we have, So that if normalized vector$\mathbf{u}_1$defines the direction of$L_1$, then the variability of the points on$L_1$is described by the random variable$Y_1=\mathbf{u}_1^T\mathbf{X}$. Also if$\mathbf{u}_2$is a normalized vector that defines the direction of$L_2$, then the variability of the points on this line is described by the random variable$Y_2=\mathbf{u}_2^T\mathbf{X}$. The first principal component is one with maximum variability. So in this case, we can see that$Y_2$is more variable than$Y_1$, since the points projected on$L_2$are more dispersed than in$L_1$. In practice, however, the linear combinations$Y_i = \mathbf{u}_i^T\mathbf{X}, i=1,2,\cdots,p$is maximized sequentially so that$Y_1$is the linear combination of the first principal component,$Y_2$is the linear combination of the second principal component, and so on. Further, the estimate of the direction vector$\mathbf{u}$is simply the normalized eigenvector$\mathbf{e}$of the variance-covariance matrix$\mathbf{\Sigma}$of the original variable$\mathbf{X}$. And the variability explained by the principal component is the corresponding eigenvalue$\lambda$. For more details on theory of PCA refer to (Bajorski, 2012) at Reference 1 below. As promised we will do dimensionality reduction using PCA. We will use the Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) data from (Barjorski, 2012), you can use other locations of AVIRIS data that can be downloaded here. However, since for most cases the AVIRIS data contains thousands of bands so for simplicity we will stick with the data given in (Bajorski, 2012) as it was cleaned reducing to 152 bands only. ### What is spectral bands? In imaging, spectral bands refer to the third dimension of the image usually denoted as$\lambda$. For example, RGB image contains red, green and blue bands as shown below along with the first two dimensions$x$and$y$that define the resolution of the image. These are few of the bands that are visible to our eyes, there are other bands that are not visible to us like infrared, and many other in electromagnetic spectrum. That is why in most cases AVIRIS data contains huge number of bands each captures different characteristics of the image. Below is the proper description of the data. ### Data The Airborne Visible/Infrared Imaging Spectrometer (AVIRIS), is a sensor collecting spectral radiance in the range of wavelengths from 400 to 2500 nm. It has been flown on various aircraft platforms, and many images of the Earth’s surface are available. A 100 by 100 pixel AVIRIS image of an urban area in Rochester, NY, near the Lake Ontario shoreline is shown below. The scene has a wide range of natural and man-made material including a mixture of commercial/warehouse and residential neighborhoods, which adds a wide range of spectral diversity. Prior to processing, invalid bands (due to atmospheric water absorption) were removed, reducing the overall dimensionality to 152 bands. This image has been used in Bajorski et al. (2004) and Bajorski (2011a, 2011b). The first 152 values in the AVIRIS Data represent the spectral radiance values (a spectral curve) for the top left pixel. This is followed by spectral curves of the pixels in the first row, followed by the next row, and so on. (Ref. 1 Bajorski, 2012) To load the data, run the following code: Above code uses EBImage package, and can be installed from my previous post. ### Why do we need to reduce the dimension of the data? Before we jump in to our analysis, in case you may ask why? Well sometimes it's just difficult to do analysis on high dimensional data, especially on interpreting it. This is because there are dimensions that aren't significant (like redundancy) which adds to our problem on the analysis. So in order to deal with this, we remove those nuisance dimension and deal with the significant one. To perform PCA in R, we use the function princomp as seen below: The structure of princomp consist of a list shown above, we will give description to selected outputs. Others can be found in the documentation of the function by executing ?princomp. • sdev - standard deviation, the square root of the eigenvalues$\lambda$of the variance-covariance matrix$\mathbf{\Sigma}$of the data, dat.mat; • loadings - eigenvectors$\mathbf{e}$of the variance-covariance matrix$\mathbf{\Sigma}$of the data, dat.mat; • scores - the principal component scores. Recall that the objective of PCA is to find for a linear combination$Y=\mathbf{u}^T\mathbf{X}$that will maximize the variance$Var(Y)$. So that from the output, the estimate of the components of$\mathbf{u}$is the entries of the loadings which is a matrix of eigenvectors, where the columns corresponds to the eigenvectors of the sequence of principal components, that is if the first principal component is given by$Y_1=\mathbf{u}_1^T\mathbf{X}$, then the estimate of$\mathbf{u}_1$which is$\mathbf{e}_1$(eigenvector) is the set of coefficients obtained from the first column of the loadings. The explained variability of the first principal component is the square of the first standard deviation sdev, the explained variability of the second principal component is the square of the second standard deviation sdev, and so on. Now let's interpret the loadings (coefficients) of the first three principal components. Below is the plot of this, Base above, the coefficients of the first principal component (PC1) are almost all negative. A closer look, the variability in this principal component is mainly explained by the weighted average of radiance of the spectral bands 35 to 100. Analogously, PC2 mainly represents the variability of the weighted average of radiance of spectral bands 1 to 34. And further, the fluctuation of the coefficients of PC3 makes it difficult to tell on which bands greatly contribute on its variability. Aside from examining the loadings, another way to see the impact of the PCs is through the impact plot where the impact curve$\sqrt{\lambda_j}\mathbf{e}_j$are plotted, I want you to explore that. Moving on, let's investigate the percent of variability in$X_i$explained by the$j$th principal component, below is the formula of this, $$\nonumber \frac{\lambda_j\cdot e_{ij}^2}{s_{ii}},$$ where$s_{ii}$is the estimated variance of$X_i$. So that below is the percent of explained variability in$X_i$of the first three principal components including the cumulative percent variability (sum of PC1, PC2, and PC3), For the variability of the first 33 bands, PC2 takes on about 90 percent of the explained variability as seen in the above plot. And still have great contribution further to 102 to 152 bands. On the other hand, from bands 37 to 100, PC1 explains almost all the variability with PC2 and PC3 explain 0 to 1 percent only. The sum of the percentage of explained variability of these principal components is indicated as orange line in the above plot, which is the cumulative percent variability. To wrap up this section, here is the percentage of the explained variability of the first 10 PCs. PC1PC2PC3PC4PC5PC6PC7PC8PC9PC10 Table 1: Variability Explained by the First Ten Principal Components for the AVIRIS data. 82.05717.1760.3200.1820.0940.0650.0370.0290.0140.005 Above variability were obtained by noting that the variability explained by the principal component is simply the eigenvalue (square of the sdev) of the variance-covariance matrix$\mathbf{\Sigma}$of the original variable$\mathbf{X}$, hence the percentage of variability explained by the$j$th PC is equal to its corresponding eigenvalue$\lambda_j$divided by the overall variability which is the sum of the eigenvalues,$\sum_{j=1}^{p}\lambda_j$, as we see in the following code, ### Stopping Rules Given the list of percentage of variability explained by the PCs in Table 1, how many principal components should we take into account that would best represent the variability of the original data? To answer that, we introduce the following stopping rules that will guide us on deciding the number of PCs: 1. Scree plot; 2. Simple fare-share; 3. Broken-stick; and, 4. Relative broken-stick. The scree plot is the plot of the variability of the PCs, that is the plot of the eigenvalues. Where we look for an elbow or sudden drop of the eigenvalues on the plot, hence for our example we have Therefore, we need return the first two principal components based on the elbow shape. However, if the eigenvalues differ by order of magnitude, it is recommended to use the logarithmic scale which is illustrated below, Unfortunately, sometimes it won't work as we can see here, it's just difficult to determine where the elbow is. The succeeding discussions on the last three stopping rules are based on (Bajorski, 2012). The simple fair-share stopping rule identifies the largest$k$such that$\lambda_k$is larger than its fair share, that is larger than$(\lambda_1+\lambda_2+\cdots+\lambda_p)/p$. To illustrate this, consider the following: Thus, we need to stop at second principal component. If one was concerned that the above method produces too many principal components, a broken-stick rule could be used. The rule is that it identifies the principal components with largest$k$such that$\lambda_j/(\lambda_1+\lambda_2+\cdots +\lambda_p)>a_j$, for all$j\leq k$, where $$\nonumber a_j = \frac{1}{p}\sum_{i=j}^{p}\frac{1}{i},\quad j =1,\cdots, p.$$ Let's try it, Above result coincides with the first two stopping rule. The draw back of simple fair-share and broken-stick rules is that it do not work well when the eigenvalues differ by orders of magnitude. In such case, we then use the relative broken-stick rule, where we analyze$\lambda_j$as the first eigenvalue in the set$\lambda_j\geq \lambda_{j+1}\geq\cdots\geq\lambda_{p}$, where$j < p$. The dimensionality$k$is chosen as the largest value such that$\lambda_j/(\lambda_j+\cdots +\lambda_p)>b_j$, for all$j\leq k$, where $$\nonumber b_j = \frac{1}{p-j+1}\sum_{i=1}^{p-j+1}\frac{1}{i}.$$ Applying this to the data we have, According to the numerical output, the first 34 principal components are enough to represent the variability of the original data. ### Principal Component Scores The principal component scores is the resulting new data set obtained from the linear combinations$Y_j=\mathbf{e}_j(\mathbf{x}-\bar{\mathbf{x}}), j = 1,\cdots, p$. So that if we use the first three stopping rules, then below is the scores (in image) of PC1 and PC2, If we base on the relative broken-stick rule then we return the first 34 PCs, and below is the corresponding scores (in image).  Click on the image to zoom in. ### Residual Analysis Of course when doing PCA there are errors to be considered unless one would return all the PCs, but that would not make any sense because why would someone apply PCA when you still take into account all the dimensions? An overview of the errors in PCA without going through the theory is that, the overall error is simply the excluded variability explained by the$k$th to$p$th principal components,$k>j$. ### Reference #### R: k-Means Clustering on Imaging Enough with the theory we recently published, let's take a break and have fun on the application of Statistics used in Data Mining and Machine Learning, the k-Means Clustering. k-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. (Wikipedia, Ref 1.) We will apply this method to an image, wherein we group the pixels into k different clusters. Below is the image that we are going to use,  Colorful Bird From Wall321 We will utilize the following packages for input and output: 1. jpeg - Read and write JPEG images; and, 2. ggplot2 - An implementation of the Grammar of Graphics. ### Download and Read the Image Let's get started by downloading the image to our workspace, and tell R that our data is a JPEG file. ### Cleaning the Data Extract the necessary information from the image and organize this for our computation: The image is represented by large array of pixels with dimension rows by columns by channels -- red, green, and blue or RGB. ### Plotting Plot the original image using the following codes: ### Clustering Apply k-Means clustering on the image: Plot the clustered colours: Possible clusters of pixels on different k-Means: Originalk = 6 Table 1: Different k-Means Clustering. k = 5k = 4 k = 3k = 2 I suggest you try it! ### Reference 1. K-means clustering. Wikipedia. Retrieved September 11, 2014. ## December 16, 2015 ### Alstatr #### R and Python: Theory of Linear Least Squares In my previous article, we talked about implementations of linear regression models in R, Python and SAS. On the theoretical sides, however, I briefly mentioned the estimation procedure for the parameter$\boldsymbol{\beta}$. So to help us understand how software does the estimation procedure, we'll look at the mathematics behind it. We will also perform the estimation manually in R and in Python, that means we're not going to use any special packages, this will help us appreciate the theory. ### Linear Least Squares Consider the linear regression model, $y_i=f_i(\mathbf{x}|\boldsymbol{\beta})+\varepsilon_i,\quad\mathbf{x}_i=\left[ \begin{array}{cccc} 1&x_{11}&\cdots&x_{1p} \end{array}\right],\quad\boldsymbol{\beta}=\left[\begin{array}{c}\beta_0\\\beta_1\\\vdots\\\beta_p\end{array}\right],$ where$y_i$is the response or the dependent variable at the$i$th case,$i=1,\cdots, N$. The$f_i(\mathbf{x}|\boldsymbol{\beta})$is the deterministic part of the model that depends on both the parameters$\boldsymbol{\beta}\in\mathbb{R}^{p+1}$and the predictor variable$\mathbf{x}_i$, which in matrix form, say$\mathbf{X}$, is represented as follows $\mathbf{X}=\left[ \begin{array}{cccccc} 1&x_{11}&\cdots&x_{1p}\\ 1&x_{21}&\cdots&x_{2p}\\ \vdots&\vdots&\ddots&\vdots\\ 1&x_{N1}&\cdots&x_{Np}\\ \end{array} \right].$$\varepsilon_i$is the error term at the$i$th case which we assumed to be Gaussian distributed with mean 0 and variance$\sigma^2$. So that $\mathbb{E}y_i=f_i(\mathbf{x}|\boldsymbol{\beta}),$ i.e.$f_i(\mathbf{x}|\boldsymbol{\beta})$is the expectation function. The uncertainty around the response variable is also modelled by Gaussian distribution. Specifically, if$Y=f(\mathbf{x}|\boldsymbol{\beta})+\varepsilon$and$y\in Y$such that$y>0, then \begin{align*} \mathbb{P}[Y\leq y]&=\mathbb{P}[f(x|\beta)+\varepsilon\leq y]\\ &=\mathbb{P}[\varepsilon\leq y-f(\mathbf{x}|\boldsymbol{\beta})]=\mathbb{P}\left[\frac{\varepsilon}{\sigma}\leq \frac{y-f(\mathbf{x}|\boldsymbol{\beta})}{\sigma}\right]\\ &=\Phi\left[\frac{y-f(\mathbf{x}|\boldsymbol{\beta})}{\sigma}\right], \end{align*} where\Phi$denotes the Gaussian distribution with density denoted by$\phi$below. Hence$Y\sim\mathcal{N}(f(\mathbf{x}|\boldsymbol{\beta}),\sigma^2). That is, \begin{align*} \frac{\operatorname{d}}{\operatorname{d}y}\Phi\left[\frac{y-f(\mathbf{x}|\boldsymbol{\beta})}{\sigma}\right]&=\phi\left[\frac{y-f(\mathbf{x}|\boldsymbol{\beta})}{\sigma}\right]\frac{1}{\sigma}=\mathbb{P}[y|f(\mathbf{x}|\boldsymbol{\beta}),\sigma^2]\\ &=\frac{1}{\sqrt{2\pi}\sigma}\exp\left\{-\frac{1}{2}\left[\frac{y-f(\mathbf{x}|\boldsymbol{\beta})}{\sigma}\right]^2\right\}. \end{align*} If the data are independent and identically distributed, then the log-likelihood function ofyis, \begin{align*} \mathcal{L}[\boldsymbol{\beta}|\mathbf{y},\mathbf{X},\sigma]&=\mathbb{P}[\mathbf{y}|\mathbf{X},\boldsymbol{\beta},\sigma]=\prod_{i=1}^N\frac{1}{\sqrt{2\pi}\sigma}\exp\left\{-\frac{1}{2}\left[\frac{y_i-f_i(\mathbf{x}|\boldsymbol{\beta})}{\sigma}\right]^2\right\}\\ &=\frac{1}{(2\pi)^{\frac{n}{2}}\sigma^n}\exp\left\{-\frac{1}{2}\sum_{i=1}^N\left[\frac{y_i-f_i(\mathbf{x}|\boldsymbol{\beta})}{\sigma}\right]^2\right\}\\ \log\mathcal{L}[\boldsymbol{\beta}|\mathbf{y},\mathbf{X},\sigma]&=-\frac{n}{2}\log2\pi-n\log\sigma-\frac{1}{2\sigma^2}\sum_{i=1}^N\left[y_i-f_i(\mathbf{x}|\boldsymbol{\beta})\right]^2. \end{align*} And because the likelihood function tells us about the plausibility of the parameter\boldsymbol{\beta}$in explaining the sample data. We therefore want to find the best estimate of$\boldsymbol{\beta}$that likely generated the sample. Thus our goal is to maximize the likelihood function which is equivalent to maximizing the log-likelihood with respect to$\boldsymbol{\beta}$. And that's simply done by taking the partial derivative with respect to the parameter$\boldsymbol{\beta}$. Therefore, the first two terms in the right hand side of the equation above can be disregarded since it does not depend on$\boldsymbol{\beta}$. Also, the location of the maximum log-likelihood with respect to$\boldsymbol{\beta}$is not affected by arbitrary positive scalar multiplication, so the factor$\frac{1}{2\sigma^2}$can be omitted. And we are left with the following equation, $$\label{eq:1} -\sum_{i=1}^N\left[y_i-f_i(\mathbf{x}|\boldsymbol{\beta})\right]^2.$$ One last thing is that, instead of maximizing the log-likelihood function we can do minimization on the negative log-likelihood. Hence we are interested on minimizing the negative of Equation (\ref{eq:1}) which is $$\label{eq:2} \sum_{i=1}^N\left[y_i-f_i(\mathbf{x}|\boldsymbol{\beta})\right]^2,$$ popularly known as the residual sum of squares (RSS). So RSS is a consequence of maximum log-likelihood under the Gaussian assumption of the uncertainty around the response variable$y$. For models with two parameters, say$\beta_0$and$\beta_1$the RSS can be visualized like the one in my previous article, that is Performing differentiation under$(p+1)$-dimensional parameter$\boldsymbol{\beta}is manageable in the context of linear algebra, so Equation (\ref{eq:2}) is equivalent to \begin{align*} \lVert\mathbf{y}-\mathbf{X}\boldsymbol{\beta}\rVert^2&=\langle\mathbf{y}-\mathbf{X}\boldsymbol{\beta},\mathbf{y}-\mathbf{X}\boldsymbol{\beta}\rangle=\mathbf{y}^{\text{T}}\mathbf{y}-\mathbf{y}^{\text{T}}\mathbf{X}\boldsymbol{\beta}-(\mathbf{X}\boldsymbol{\beta})^{\text{T}}\mathbf{y}+(\mathbf{X}\boldsymbol{\beta})^{\text{T}}\mathbf{X}\boldsymbol{\beta}\\ &=\mathbf{y}^{\text{T}}\mathbf{y}-\mathbf{y}^{\text{T}}\mathbf{X}\boldsymbol{\beta}-\boldsymbol{\beta}^{\text{T}}\mathbf{X}^{\text{T}}\mathbf{y}+\boldsymbol{\beta}^{\text{T}}\mathbf{X}^{\text{T}}\mathbf{X}\boldsymbol{\beta} \end{align*} And the derivative with respect to the parameter is \begin{align*} \frac{\operatorname{\partial}}{\operatorname{\partial}\boldsymbol{\beta}}\lVert\mathbf{y}-\mathbf{X}\boldsymbol{\beta}\rVert^2&=-2\mathbf{X}^{\text{T}}\mathbf{y}+2\mathbf{X}^{\text{T}}\mathbf{X}\boldsymbol{\beta} \end{align*} Taking the critical point by setting the above equation to zero vector, we have \begin{align} \frac{\operatorname{\partial}}{\operatorname{\partial}\boldsymbol{\beta}}\lVert\mathbf{y}-\mathbf{X}\hat{\boldsymbol{\beta}}\rVert^2&\overset{\text{set}}{=}\mathbf{0}\nonumber\\ -\mathbf{X}^{\text{T}}\mathbf{y}+\mathbf{X}^{\text{T}}\mathbf{X}\hat{\boldsymbol{\beta}}&=\mathbf{0}\nonumber\\ \mathbf{X}^{\text{T}}\mathbf{X}\hat{\boldsymbol{\beta}}&=\mathbf{X}^{\text{T}}\mathbf{y}\label{eq:norm} \end{align} Equation (\ref{eq:norm}) is called the normal equation. If\mathbf{X}$is full rank, then we can compute the inverse of$\mathbf{X}^{\text{T}}\mathbf{X}, \begin{align} \mathbf{X}^{\text{T}}\mathbf{X}\hat{\boldsymbol{\beta}}&=\mathbf{X}^{\text{T}}\mathbf{y}\nonumber\\ (\mathbf{X}^{\text{T}}\mathbf{X})^{-1}\mathbf{X}^{\text{T}}\mathbf{X}\hat{\boldsymbol{\beta}}&=(\mathbf{X}^{\text{T}}\mathbf{X})^{-1}\mathbf{X}^{\text{T}}\mathbf{y}\nonumber\\ \hat{\boldsymbol{\beta}}&=(\mathbf{X}^{\text{T}}\mathbf{X})^{-1}\mathbf{X}^{\text{T}}\mathbf{y}.\label{eq:betahat} \end{align} That's it, since both\mathbf{X}$and$\mathbf{y}$are known. ### Prediction If$\mathbf{X}$is full rank and spans the subspace$V\subseteq\mathbb{R}^N$, where$\mathbb{E}\mathbf{y}=\mathbf{X}\boldsymbol{\beta}\in V$. Then the predicted values of$\mathbf{y}$is given by, $$\label{eq:pred} \hat{\mathbf{y}}=\mathbb{E}\mathbf{y}=\mathbf{P}_{V}\mathbf{y}=\mathbf{X}(\mathbf{X}^{\text{T}}\mathbf{X})^{-1}\mathbf{X}^{\text{T}}\mathbf{y},$$ where$\mathbf{P}$is the projection matrix onto the space$V$. For proof of the projection matrix in Equation (\ref{eq:pred}) please refer to reference (1) below. Notice that this is equivalent to $$\label{eq:yhbh} \hat{\mathbf{y}}=\mathbb{E}\mathbf{y}=\mathbf{X}\hat{\boldsymbol{\beta}}.$$ ### Computation Let's fire up R and Python and see how we can apply those equations we derived. For purpose of illustration, we're going to simulate data from Gaussian distributed population. To do so, consider the following codes R ScriptPython ScriptHere we have two predictors x1 and x2, and our response variable y is generated by the parameters$\beta_1=3.5$and$\beta_2=2.8$, and it has Gaussian noise with variance 7. While we set the same random seeds for both R and Python, we should not expect the random values generated in both languages to be identical, instead both values are independent and identically distributed (iid). For visualization, I will use Python Plotly, you can also translate it to R Plotly. Now let's estimate the parameter$\boldsymbol{\beta}$which by default we set to$\beta_1=3.5$and$\beta_2=2.8$. We will use Equation (\ref{eq:betahat}) for estimation. So that we have R ScriptPython ScriptThat's a good estimate, and again just a reminder, the estimate in R and in Python are different because we have different random samples, the important thing is that both are iid. To proceed, we'll do prediction using Equations (\ref{eq:pred}). That is, R ScriptPython ScriptThe first column above is the data y and the second column is the prediction due to Equation (\ref{eq:pred}). Thus if we are to expand the prediction into an expectation plane, then we have You have to rotate the plot by the way to see the plane, I still can't figure out how to change it in Plotly. Anyway, at this point we can proceed computing for other statistics like the variance of the error, and so on. But I will leave it for you to explore. Our aim here is just to give us an understanding on what is happening inside the internals of our software when we try to estimate the parameters of the linear regression models. ### Reference 1. Arnold, Steven F. (1981). The Theory of Linear Models and Multivariate Analysis. Wiley. 2. OLS in Matrix Form ## May 12, 2015 ### Chris Lawrence #### That'll leave a mark Here’s a phrase you never want to see in print (in a legal decision, no less) pertaining to your academic research: “The IRB process, however, was improperly engaged by the Dartmouth researcher and ignored completely by the Stanford researchers.” Whole thing here; it’s a doozy. ## April 14, 2015 ### R you ready? #### Beautiful plots while simulating loss in two-part procrustes problem Today I was working on a two-part procrustes problem and wanted to find out why my minimization algorithm sometimes does not converge properly or renders unexpected results. The loss function to be minimized is $\displaystyle L(\mathbf{Q},c) = \| c \mathbf{A_1Q} - \mathbf{B_1} \|^2 + \| \mathbf{A_2Q} - \mathbf{B_2} \|^2 \rightarrow min$ with $\| \cdot \|$ denoting the Frobenius norm, $c$ is an unknown scalar and $\mathbf{Q}$ an unknown rotation matrix, i.e. $\mathbf{Q}^T\mathbf{Q}=\mathbf{I}$. $\;\mathbf{A_1}, \mathbf{A_2}, \mathbf{B_1}$, and $\mathbf{B_1}$ are four real valued matrices. The minimum for $c$ is easily found by setting the partial derivation of $L(\mathbf{Q},c)$ w.r.t $c$ equal to zero. $\displaystyle c = \frac {tr \; \mathbf{Q}^T \mathbf{A_1}^T \mathbf{B_1}} { \| \mathbf{A_1} \|^2 }$ By plugging $c$ into the loss function $L(\mathbf{Q},c)$ we get a new loss function $L(\mathbf{Q})$ that only depends on $\mathbf{Q}$. This is the starting situation. When trying to find out why the algorithm to minimize $L(\mathbf{Q})$ did not work as expected, I got stuck. So I decided to conduct a small simulation and generate random rotation matrices to study the relation between the parameter $c$ and the value of the loss function $L(\mathbf{Q})$. Before looking at the results for the entire two-part procrustes problem from above, let’s visualize the results for the first part of the loss function only, i.e. $\displaystyle L(\mathbf{Q},c) = \| c \mathbf{A_1Q} - \mathbf{B_1} \|^2 \rightarrow min$ Here, $c$ has the same minimum as for the whole formula above. For the simulation I used $\mathbf{A_1}= \begin{pmatrix} 0.0 & 0.4 & -0.5 \\ -0.4 & -0.8 & -0.5 \\ -0.1 & -0.5 & 0.2 \\ \end{pmatrix} \mkern18mu \qquad \text{and} \qquad \mkern36mu \mathbf{B_1}= \begin{pmatrix} -0.1 & -0.8 & -0.1 \\ 0.3 & 0.2 & -0.9 \\ 0.1 & -0.3 & -0.5 \\ \end{pmatrix}$ as input matrices. Generating many random rotation matrices $\mathbf{Q}$ and plotting $c$ against the value of the loss function yields the following plot. This is a well behaved relation, for each scaling parameter $c$ the loss is identical. Now let’s look at the full two-part loss function. As input matrices I used $\displaystyle A1= \begin{pmatrix} 0.0 & 0.4 & -0.5 \\ -0.4 & -0.8 & -0.5 \\ -0.1 & -0.5 & 0.2 \\ \end{pmatrix} \mkern18mu , \mkern36mu B1= \begin{pmatrix} -0.1 & -0.8 & -0.1 \\ 0.3 & 0.2 & -0.9 \\ 0.1 & -0.3 & -0.5 \\ \end{pmatrix}$ $A2= \begin{pmatrix} 0 & 0 & 1 \\ 1 & 0 & 0 \\ 0 & 1 & 0 \\ \end{pmatrix} \mkern18mu , \mkern36mu B2= \begin{pmatrix} 0 & 0 & 1 \\ 1 & 0 & 0 \\ 0 & 1 & 0 \\ \end{pmatrix}$ and the following R-code. # trace function tr <- function(X) sum(diag(X)) # random matrix type 1 rmat_1 <- function(n=3, p=3, min=-1, max=1){ matrix(runif(n*p, min, max), ncol=p) } # random matrix type 2, sparse rmat_2 <- function(p=3) { diag(p)[, sample(1:p, p)] } # generate random rotation matrix Q. Based on Q find # optimal scaling factor c and calculate loss function value # one_sample <- function(n=2, p=2) { Q <- mixAK::rRotationMatrix(n=1, dim=p) %*% # random rotation matrix det(Q) = 1 diag(sample(c(-1,1), p, rep=T)) # additional reflections, so det(Q) in {-1,1} s <- tr( t(Q) %*% t(A1) %*% B1 ) / norm(A1, "F")^2 # scaling factor c rss <- norm(s*A1 %*% Q - B1, "F")^2 + # get residual sum of squares norm(A2 %*% Q - B2, "F")^2 c(s=s, rss=rss) } # find c and rss or many random rotation matrices # set.seed(10) # nice case for 3 x 3 n <- 3 p <- 3 A1 <- round(rmat_1(n, p), 1) B1 <- round(rmat_1(n, p), 1) A2 <- rmat_2(p) B2 <- rmat_2(p) x <- plyr::rdply(40000, one_sample(3,3)) plot(x$s, x$rss, pch=16, cex=.4, xlab="c", ylab="L(Q)", col="#00000010")  This time the result turns out to be very different and … beautiful :) Here, we do not have a one to one relation between the scaling parameter and the loss function any more. I do not quite know what to make of this yet. But for now I am happy that it has aestethic value. Below you find some more beautiful graphics with different matrices as inputs. Cheers! ## February 24, 2015 ### Douglas Bates # RCall: Running an embedded R in Julia I have used R (and S before it) for a couple of decades. In the last few years most of my coding has been in Julia, a language for technical computing that can provide remarkable performance for a dynamically typed language via Just-In-Time (JIT) compilation of functions and via multiple dispatch. Nonetheless there are facilities in R that I would like to have access to from Julia. I created the RCall package for Julia to do exactly that. This IJulia notebook provides an introduction to RCall. This is not a novel idea by any means. Julia already has PyCall and JavaCall packages that provide access to Python and to Java. These packages are used extensively and are much more sophisticated than RCall, at present. Many other languages have facilities to run an embedded instance of R. In fact, Python has several such interfaces. The things I plan to do using RCall is to access datasets from R and R packages, to fit models that are not currently implemented in Julia and to use R graphics, especially the ggplot2 and lattice packages. Unfortunately I am not currently able to start a graphics device from the embedded R but I expect that to be fixed soon. I can tell you the most remarkable aspect of RCall although it may not mean much if you haven't tried to do this kind of thing. It is written entirely in Julia. There is absolutely no "glue" code written in a compiled language like C or C++. As I said, this may not mean much to you unless you have tried to do something like this, in which case it is astonishing. ## January 16, 2015 ### Modern Toolmaking #### caretEnsemble My package caretEnsemble, for making ensembles of caret models, is now on CRAN. Check it out, and let me know what you think! (Submit bug reports and feature requests to the issue tracker) ## January 15, 2015 ### Gregor Gorjanc #### cpumemlog: Monitor CPU and RAM usage of a process (and its children) Long time no see ... Today I pushed the cpumemlog script to GitHub https://github.com/gregorgorjanc/cpumemlog. Read more about this useful utility at the GitHub site. ## December 15, 2014 ### R you ready? #### QQ-plots in R vs. SPSS – A look at the differences We teach two software packages, R and SPSS, in Quantitative Methods 101 for psychology freshman at Bremen University (Germany). Sometimes confusion arises, when the software packages produce different results. This may be due to specifics in the implemention of a method or, as in most cases, to different default settings. One of these situations occurs when the QQ-plot is introduced. Below we see two QQ-plots, produced by SPSS and R, respectively. The data used in the plots were generated by: set.seed(0) x <- sample(0:9, 100, rep=T)  SPSS R qqnorm(x, datax=T) # uses Blom's method by default qqline(x, datax=T)  There are some obvious differences: 1. The most obvious one is that the R plot seems to contain more data points than the SPSS plot. Actually, this is not the case. Some data points are plotted on top of each in SPSS while they are spread out vertically in the R plot. The reason for this difference is that SPSS uses a different approach assigning probabilities to the values. We will expore the two approaches below. 2. The scaling of the y-axis differs. R uses quantiles from the standard normal distribution. SPSS by default rescales these values using the mean and standard deviation from the original data. This allows to directly compare the original and theoretical values. This is a simple linear transformation and will not be explained any further here. 3. The QQ-lines are not identical. R uses the 1st and 3rd quartile from both distributions to draw the line. This is different in SPSS where of a line is drawn for identical values on both axes. We will expore the differences below. # QQ-plots from scratch To get a better understanding of the difference we will build the R and SPSS-flavored QQ-plot from scratch. ## R type In order to calculate theoretical quantiles corresponding to the observed values, we first need to find a way to assign a probability to each value of the original data. A lot of different approaches exist for this purpose (for an overview see e.g. Castillo-Gutiérrez, Lozano-Aguilera, & Estudillo-Martínez, 2012b). They usually build on the ranks of the observed data points to calculate corresponding p-values, i.e. the plotting positions for each point. The qqnorm function uses two formulae for this purpose, depending on the number of observations $n$ (Blom’s mfethod, see ?qqnorm; Blom, 1958). With $r$ being the rank, for $n > 10$ it will use the formula $p = (r - 1/2) / n$, for $n \leq 10$ the formula $p = (r - 3/8) / (n + 1/4)$ to determine the probability value $p$ for each observation (see the help files for the functions qqnorm and ppoint). For simplicity reasons, we will only implement the $n > 10$ case here. n <- length(x) # number of observations r <- order(order(x)) # order of values, i.e. ranks without averaged ties p <- (r - 1/2) / n # assign to ranks using Blom's method y <- qnorm(p) # theoretical standard normal quantiles for p values plot(x, y) # plot empirical against theoretical values  Before we take at look at the code, note that our plot is identical to the plot generated by qqnorm above, except that the QQ-line is missing. The main point that makes the difference between R and SPSS is found in the command order(order(x)). The command calculates ranks for the observations using ordinal ranking. This means that all observations get different ranks and no average ranks are calculated for ties, i.e. for observations with equal values. Another approach would be to apply fractional ranking and calculate average values for ties. This is what the function rank does. The following codes shows the difference between the two approaches to assign ranks. v <- c(1,1,2,3,3) order(order(v)) # ordinal ranking used by R  ## [1] 1 2 3 4 5  rank(v) # fractional ranking used by SPSS  ## [1] 1.5 1.5 3.0 4.5 4.5  R uses ordinal ranking and SPSS uses fractional ranking by default to assign ranks to values. Thus, the positions do not overlap in R as each ordered observation is assigned a different rank and therefore a different p-value. We will pick up the second approach again later, when we reproduce the SPSS-flavored plot in R.1 The second difference between the plots concerned the scaling of the y-axis and was already clarified above. The last point to understand is how the QQ-line is drawn in R. Looking at the probs argument of qqline reveals that it uses the 1st and 3rd quartile of the original data and theoretical distribution to determine the reference points for the line. We will draw the line between the quartiles in red and overlay it with the line produced by qqline to see if our code is correct. plot(x, y) # plot empirical against theoretical values ps <- c(.25, .75) # reference probabilities a <- quantile(x, ps) # empirical quantiles b <- qnorm(ps) # theoretical quantiles lines(a, b, lwd=4, col="red") # our QQ line in red qqline(x, datax=T) # R QQ line  The reason for different lines in R and SPSS is that several approaches to fitting a straight line exist (for an overview see e.g. Castillo-Gutiérrez, Lozano-Aguilera, & Estudillo-Martínez, 2012a). Each approach has different advantages. The method used by R is more robust when we expect values to diverge from normality in the tails, and we are primarily interested in the normality of the middle range of our data. In other words, the method of fitting an adequate QQ-line depends on the purpose of the plot. An explanation of the rationale of the R approach can e.g. be found here. ## SPSS type The default SPSS approach also uses Blom’s method to assign probabilities to ranks (you may choose other methods is SPSS) and differs from the one above in the following aspects: • a) As already mentioned, SPSS uses ranks with averaged ties (fractional rankings) not the plain order ranks (ordinal ranking) as in R to derive the corresponding probabilities for each data point. The rest of the code is identical to the one above, though I am not sure if SPSS distinguishes between the $n 10$ case. • b) The theoretical quantiles are scaled to match the estimated mean and standard deviation of the original data. • c) The QQ-line goes through all quantiles with identical values on the x and y axis. n <- length(x) # number of observations r <- rank(x) # a) ranks using fractional ranking (averaging ties) p <- (r - 1/2) / n # assign to ranks using Blom's method y <- qnorm(p) # theoretical standard normal quantiles for p values y <- y * sd(x) + mean(x) # b) transform SND quantiles to mean and sd from original data plot(x, y) # plot empirical against theoretical values  Lastly, let us add the line. As the scaling of both axes is the same, the line goes through the origin with a slope of $1$. abline(0,1) # c) slope 0 through origin  The comparison to the SPSS output shows that they are (visually) identical. # Function for SPSS-type QQ-plot The whole point of this demonstration was to pinpoint and explain the differences between a QQ-plot generated in R and SPSS, so it will no longer be a reason for confusion. Note, however, that SPSS offers a whole range of options to generate the plot. For example, you can select the method to assign probabilities to ranks and decide how to treat ties. The plots above used the default setting (Blom’s method and averaging across ties). Personally I like the SPSS version. That is why I implemented the function qqnorm_spss in the ryouready package, that accompanies the course. The formulae for the different methods to assign probabilities to ranks can be found in Castillo-Gutiérrez et al. (2012b). The implentation is a preliminary version that has not yet been thoroughly tested. You can find the code here. Please report any bugs or suggestions for improvements (which are very welcome) in the github issues section. library(devtools) install_github("markheckmann/ryouready") # install from github repo library(ryouready) # load package library(ggplot2) qq <- qqnorm_spss(x, method=1, ties.method="average") # Blom's method with averaged ties plot(qq) # generate QQ-plot ggplot(qq) # use ggplot2 to generate QQ-plot  # Literature 1. Technical sidenote: Internally, qqnorm uses the function ppoints to generate the p-values. Type in stats:::qqnorm.default to the console to have a look at the code. ## October 20, 2014 ### Modern Toolmaking #### For faster R on a mac, use veclib ## Update: The links to all my github gists on blogger are broken, and I can't figure out how to fix them. If you know how to insert gitub gists on a dynamic blogger template, please let me known. In the meantime, here are instructions with links to the code: First of all, use homebrew to compile openblas. It's easy! Second of all, you can also use homebrew to install R! (But maybe stick with the CRAN version unless you really want to compile your own R binary) To use openblas with R, follow these instructions: https://gist.github.com/zachmayer/e591cf868b3a381a01d6#file-openblas-sh To use veclib with R, follow these intructions: https://gist.github.com/zachmayer/e591cf868b3a381a01d6#file-veclib-sh ## OLD POST: Inspired by this post, I decided to try using OpenBLAS for R on my mac. However, it turns out there's a simpler option, using the vecLib BLAS library, which is provided by Apple as part of the accelerate framework. If you are using R 2.15, follow these instructions to change your BLAS from the default to vecLib: However, as noted in r-sig-mac, these instructions do not work for R 3.0. You have to directly link to the accelerate framework's version of vecLib: Finally, test your new blas using this script: On my system (a retina macbook pro), the default BLAS takes 141 seconds and vecLib takes 43 seconds, which is a significant speedup. If you plan to use vecLib, note the following warning from the R development team "Although fast, it is not under our control and may possibly deliver inaccurate results." So far, I have not encountered any issues using vecLib, but it's only been a few hours :-). UPDATE: you can also install OpenBLAS on a mac: If you do this, make sure to change the directories to point to the correct location on your system (e.g. change /users/zach/source to whatever directory you clone the git repo into). On my system, the benchmark script takes ~41 seconds when using openBLAS, which is a small but significant speedup. ## September 19, 2014 ### Chris Lawrence #### What could a federal UK look like? Assuming that the “no” vote prevails in the Scottish independence referendum, the next question for the United Kingdom is to consider constitutional reform to implement a quasi-federal system and resolve the West Lothian question once and for all. In some ways, it may also provide an opportunity to resolve the stalled reform of the upper house as well. Here’s the rough outline of a proposal that might work. • Devolve identical powers to England, Northern Ireland, Scotland, and Wales, with the proviso that local self-rule can be suspended if necessary by the federal legislature (by a supermajority). • The existing House of Commons becomes the House of Commons for England, which (along with the Sovereign) shall comprise the English Parliament. This parliament would function much as the existing devolved legislatures in Scotland and Wales; the consociational structure of the Northern Ireland Assembly (requiring double majorities) would not be replicated. • The House of Lords is abolished, and replaced with a directly-elected Senate of the United Kingdom. The Senate will have authority to legislate on the non-devolved powers (in American parlance, “delegated” powers) such as foreign and European Union affairs, trade and commerce, national defense, and on matters involving Crown dependencies and territories, the authority to legislate on devolved matters in the event self-government is suspended in a constituent country, and dilatory powers including a qualified veto (requiring a supermajority) over the legislation proposed by a constituent country’s parliament. The latter power would effectively replace the review powers of the existing House of Lords; it would function much as the Council of Revision in Madison’s original plan for the U.S. Constitution. As the Senate will have relatively limited powers, it need not be as large as the existing Lords or Commons. To ensure the countries other than England have a meaningful voice, given that nearly 85% of the UK’s population is in England, two-thirds of the seats would be allocated proportionally based on population and one-third allocated equally to the four constituent countries. This would still result in a chamber with a large English majority (around 64.4%) but nonetheless would ensure the other three countries would have meaningful representation as well. ## September 12, 2014 ### R you ready? #### Using colorized PNG pictograms in R base plots Today I stumbled across a figure in an explanation on multiple factor analysis which contained pictograms. Figure 1 from Abdi & Valentin (2007), p. 8. I wanted to reproduce a similar figure in R using pictograms and additionally color them e.g. by group membership . I have almost no knowledge about image processing, so I tried out several methods of how to achieve what I want. The first thing I did was read in an PNG file and look at the data structure. The package png allows to read in PNG files. Note that all of the below may not work on Windows machines, as it does not support semi-transparency (see ?readPNG). library(png) img <- readPNG(system.file("img", "Rlogo.png", package="png")) class(img)  ## [1] "array"  dim(img)  ## [1] 76 100 4  The object is a numerical array with four layers (red, green, blue, alpha; short RGBA). Let’s have a look at the first layer (red) and replace all non-zero entries by a one and the zeros by a dot. This will show us the pattern of non-zero values and we already see the contours. l4 <- img[,,1] l4[l4 > 0] <- 1 l4[l4 == 0] <- "." d <- apply(l4, 1, function(x) { cat(paste0(x, collapse=""), "\n") })  To display the image in R one way is to raster the image (i.e. the RGBA layers are collapsed into a layer of single HEX value) and print it using rasterImage. rimg <- as.raster(img) # raster multilayer object r <- nrow(rimg) / ncol(rimg) # image ratio plot(c(0,1), c(0,r), type = "n", xlab = "", ylab = "", asp=1) rasterImage(rimg, 0, 0, 1, r)  Let’s have a look at a small part the rastered image object. It is a matrix of HEX values. rimg[40:50, 1:6]  ## [1,] "#C4C5C202" "#858981E8" "#838881FF" "#888D86FF" "#8D918AFF" "#8F938CFF" ## [2,] "#00000000" "#848881A0" "#80847CFF" "#858A83FF" "#898E87FF" "#8D918BFF" ## [3,] "#00000000" "#8B8E884C" "#7D817AFF" "#82867EFF" "#868B84FF" "#8A8E88FF" ## [4,] "#00000000" "#9FA29D04" "#7E827BE6" "#7E817AFF" "#838780FF" "#878C85FF" ## [5,] "#00000000" "#00000000" "#81857D7C" "#797E75FF" "#7F827BFF" "#838781FF" ## [6,] "#00000000" "#00000000" "#898C8510" "#787D75EE" "#797E76FF" "#7F837BFF" ## [7,] "#00000000" "#00000000" "#00000000" "#7F837C7B" "#747971FF" "#797E76FF" ## [8,] "#00000000" "#00000000" "#00000000" "#999C9608" "#767C73DB" "#747971FF" ## [9,] "#00000000" "#00000000" "#00000000" "#00000000" "#80847D40" "#71766EFD" ## [10,] "#00000000" "#00000000" "#00000000" "#00000000" "#00000000" "#787D7589" ## [11,] "#00000000" "#00000000" "#00000000" "#00000000" "#00000000" "#999C9604"  And print this small part. plot(c(0,1), c(0,.6), type = "n", xlab = "", ylab = "", asp=1) rasterImage(rimg[40:50, 1:6], 0, 0, 1, .6)  Now we have an idea of how the image object and the rastered object look like from the inside. Let’s start to modify the images to suit our needs. In order to change the color of the pictograms, my first idea was to convert the graphics to greyscale and remap the values to a color ramp of may choice. To convert to greyscale there are tons of methods around (see e.g. here). I just pick one of them I found on SO by chance. With R=Red, G=Green and B=Blue we have brightness = sqrt(0.299 * R^2 + 0.587 * G^2 + 0.114 * B^2)  This approach modifies the PNG files after they have been coerced into a raster object. # function to calculate brightness values brightness <- function(hex) { v <- col2rgb(hex) sqrt(0.299 * v[1]^2 + 0.587 * v[2]^2 + 0.114 * v[3]^2) /255 } # given a color ramp, map brightness to ramp also taking into account # the alpha level. The defaul color ramp is grey # img_to_colorramp <- function(img, ramp=grey) { cv <- as.vector(img) b <- sapply(cv, brightness) g <- ramp(b) a <- substr(cv, 8,9) # get alpha values ga <- paste0(g, a) # add alpha values to new colors img.grey <- matrix(ga, nrow(img), ncol(img), byrow=TRUE) } # read png and modify img <- readPNG(system.file("img", "Rlogo.png", package="png")) img <- as.raster(img) # raster multilayer object r <- nrow(img) / ncol(img) # image ratio s <- 3.5 # size plot(c(0,10), c(0,3.5), type = "n", xlab = "", ylab = "", asp=1) rasterImage(img, 0, 0, 0+s/r, 0+s) # original img2 <- img_to_colorramp(img) # modify using grey scale rasterImage(img2, 5, 0, 5+s/r, 0+s)  Great, it works! Now Let’s go and try out some other color palettes using colorRamp to create a color ramp. plot(c(0,10),c(0,8.5), type = "n", xlab = "", ylab = "", asp=1) img1 <- img_to_colorramp(img) rasterImage(img1, 0, 5, 0+s/r, 5+s) reds <- function(x) rgb(colorRamp(c("darkred", "white"))(x), maxColorValue = 255) img2 <- img_to_colorramp(img, reds) rasterImage(img2, 5, 5, 5+s/r, 5+s) greens <- function(x) rgb(colorRamp(c("darkgreen", "white"))(x), maxColorValue = 255) img3 <- img_to_colorramp(img, greens) rasterImage(img3, 0, 0, 0+s/r, 0+s) single_color <- function(...) "#0000BB" img4 <- img_to_colorramp(img, single_color) rasterImage(img4, 5, 0, 5+s/r, 0+s)  Okay, that basically does the job. Now we will apply it to the wine pictograms. Let’s use this wine glass from Wikimedia Commons. It’s quite big so I uploaded a reduced size version to imgur . We will use it for our purposes. # load file from web f <- tempfile() download.file("http://i.imgur.com/A14ntCt.png", f) img <- readPNG(f) img <- as.raster(img) r <- nrow(img) / ncol(img) s <- 1 # let's create a function that returns a ramp function to save typing ramp <- function(colors) function(x) rgb(colorRamp(colors)(x), maxColorValue = 255) # create dataframe with coordinates and colors set.seed(1) x <- data.frame(x=rnorm(16, c(2,2,4,4)), y=rnorm(16, c(1,3)), colors=c("black", "darkred", "garkgreen", "darkblue")) plot(c(1,6), c(0,5), type="n", xlab="", ylab="", asp=1) for (i in 1L:nrow(x)) { colorramp <- ramp(c(x[i,3], "white")) img2 <- img_to_colorramp(img, colorramp) rasterImage(img2, x[i,1], x[i,2], x[i,1]+s/r, x[i,2]+s) }  Another approach would be to modifying the RGB layers before rastering to HEX values. img <- readPNG(system.file("img", "Rlogo.png", package="png")) img2 <- img img[,,1] <- 0 # remove Red component img[,,2] <- 0 # remove Green component img[,,3] <- 1 # Set Blue to max img <- as.raster(img) r <- nrow(img) / ncol(img) # size ratio s <- 3.5 # size plot(c(0,10), c(0,3.5), type = "n", xlab = "", ylab = "", asp=1) rasterImage(img, 0, 0, 0+s/r, 0+s) img2[,,1] <- 1 # Red to max img2[,,2] <- 0 img2[,,3] <- 0 rasterImage(as.raster(img2), 5, 0, 5+s/r, 0+s)  To just colorize the image, we could weight each layer. # wrap weighting into function weight_layers <- function(img, w) { for (i in seq_along(w)) img[,,i] <- img[,,i] * w[i] img } plot(c(0,10), c(0,3.5), type = "n", xlab = "", ylab = "", asp=1) img <- readPNG(system.file("img", "Rlogo.png", package="png")) img2 <- weight_layers(img, c(.2, 1,.2)) rasterImage(img2, 0, 0, 0+s/r, 0+s) img3 <- weight_layers(img, c(1,0,0)) rasterImage(img3, 5, 0, 5+s/r, 0+s)  After playing around and hard-coding the modifications I started to search and found the EBimage package which has a lot of features for image processing that make ones life (in this case only a bit) easier. library(EBImage) f <- system.file("img", "Rlogo.png", package="png") img <- readImage(f) img2 <- img img[,,2] = 0 # zero out green layer img[,,3] = 0 # zero out blue layer img <- as.raster(img) img2[,,1] = 0 img2[,,3] = 0 img2 <- as.raster(img2) r <- nrow(img) / ncol(img) s <- 3.5 plot(c(0,10), c(0,3.5), type = "n", xlab = "", ylab = "", asp=1) rasterImage(img, 0, 0, 0+s/r, 0+s) rasterImage(img2, 5, 0, 5+s/r, 0+s)  EBImage is a good choice and fairly easy to handle. Now let’s again print the pictograms. f <- tempfile(fileext=".png") download.file("http://i.imgur.com/A14ntCt.png", f) img <- readImage(f) # will replace whole image layers by one value # only makes sense if there is a alpha layer that # gives the contours # mod_color <- function(img, col) { v <- col2rgb(col) / 255 img = channel(img, 'rgb') img[,,1] = v[1] # Red img[,,2] = v[2] # Green img[,,3] = v[3] # Blue as.raster(img) } r <- nrow(img) / ncol(img) # get image ratio s <- 1 # size # create random data set.seed(1) x <- data.frame(x=rnorm(16, c(2,2,4,4)), y=rnorm(16, c(1,3)), colors=1:4) # plot pictograms plot(c(1,6), c(0,5), type="n", xlab="", ylab="", asp=1) for (i in 1L:nrow(x)) { img2 <- mod_color(img, x[i, 3]) rasterImage(img2, x[i,1], x[i,2], x[i,1]+s*r, x[i,2]+s) }  Note, that above I did not bother to center each pictogram to position it correctly. This still needs to be done. Anyway, that’s it! Mission completed. ### Literature Abdi, H., & Valentin, D. (2007). Multiple factor analysis (MFA). In N. Salkind (Ed.), Encyclopedia of Measurement and Statistics (pp. 1–14). Thousand Oaks, CA: Sage Publications. Retrieved from https://www.utdallas.edu/~herve/Abdi-MFA2007-pretty.pdf ## June 18, 2014 ### Chris Lawrence #### Soccer queries answered Kevin Drum asks a bunch of questions about soccer: 1. Outside the penalty area there’s a hemisphere about 20 yards wide. I can’t recall ever seeing it used for anything. What’s it for? 2. On several occasions, I’ve noticed that if the ball goes out of bounds at the end of stoppage time, the referee doesn’t whistle the match over. Instead, he waits for the throw-in, and then immediately whistles the match over. What’s the point of this? 3. Speaking of stoppage time, how has it managed to last through the years? I know, I know: tradition. But seriously. Having a timekeeper who stops the clock for goals, free kicks, etc. has lots of upside and no downside. Right? It wouldn’t change the game in any way, it would just make timekeeping more accurate, more consistent, and more transparent for the fans and players. Why keep up the current pretense? 4. What’s the best way to get a better sense of what’s a foul and what’s a legal tackle? Obviously you can’t tell from the players’ reactions, since they all writhe around like landed fish if they so much as trip over their own shoelaces. Reading the rules provides the basics, but doesn’t really help a newbie very much. Maybe a video that shows a lot of different tackles and explains why each one is legal, not legal, bookable, etc.? The first one’s easy: there’s a general rule that no defensive player can be within 10 yards of the spot of a direct free kick. A penalty kick (which is a type of direct free kick) takes place in the 18-yard box, and no players other than the player taking the kick and the goalkeeper are allowed in the box. However, owing to geometry, the 18 yard box and the 10 yard exclusion zone don’t fully coincide, hence the penalty arc. (That’s also why there are two tiny hash-marks on the goal line and side line 10 yards from the corner flag. And why now referees have a can of shaving cream to mark the 10 yards for other free kicks, one of the few MLS innovations that has been a good idea.) Second one’s also easy: the half and the game cannot end while the ball is out of play. Third one’s harder. First, keeping time inexactly forestalls the silly premature celebrations that are common in most US sports. You’d never see the Stanford-Cal play happen in a soccer game. Second, it allows some slippage for short delays and doesn’t require exact timekeeping; granted, this was more valuable before instant replays and fourth officials, but most US sports require a lot of administrative record-keeping by ancillary officials. A soccer game can be played with one official (and often is, particularly at the amateur level) without having to change timing rules;* in developing countries in particular this lowers the barriers to entry for the sport (along with the low equipment requirements) without changing the nature of the game appreciably. Perhaps most importantly, if the clock was allowed to stop regularly it would create an excuse for commercial timeouts and advertising breaks, which would interrupt the flow of the game and potentially reduce the advantages of better-conditioned and more skilled athletes. (MLS tried this, along with other exciting American ideas like “no tied games,” and it was as appealing to actual soccer fans as ketchup on filet mignon would be to a foodie, and perhaps more importantly didn’t make any non-soccer fans watch.) Fourth, the key distinction is usually whether there was an obvious attempt to play the ball; in addition, in the modern game, even some attempts to play the ball are considered inherently dangerous (tackling from behind, many sliding tackles, etc.) and therefore are fouls even if they are successful in getting more ball than human. * To call offside, you’d also probably need what in my day we called a “linesman.” ## May 07, 2014 ### Chris Lawrence #### The mission and vision thing Probably the worst-kept non-secret is that the next stage of the institutional evolution of my current employer is to some ill-defined concept of “university status,” which mostly involves the establishment of some to-be-determined master’s degree programs. In the context of the University System of Georgia, it means a small jump from the “state college” prestige tier (a motley collection of schools that largely started out as two-year community colleges and transfer institutions) to the “state university” tier (which is where most of the ex-normal schools hang out these days). What is yet to be determined is how that transition will affect the broader institution that will be the University of Middle Georgia.* People on high are said to be working on these things; in any event, here are my assorted random thoughts on what might be reasonable things to pursue: • Marketing and positioning: Unlike the situation facing many of the other USG institutions, the population of the two anchor counties of our core service area (Bibb and Houston) is growing, and Houston County in particular has a statewide reputation for the quality of its public school system. Rather than conceding that the most prepared students from these schools will go to Athens or Atlanta or Valdosta, we should strongly market our institutional advantages over these more “prestigious” institutions, particularly in terms of the student experience in the first two years and the core curriculum: we have no large lecture courses, no teaching assistants, no lengthy bus rides to and from class every day, and the vast majority of the core is taught by full-time faculty with terminal degrees. Not to mention costs to students are much lower, particularly in the case of students who do not qualify for need-based aid. Even if we were to “lose” these students as transfers to the top-tier institutions after 1–4 semesters, we’d still benefit from the tuition and fees they bring in and we would not be penalized in the upcoming state performance funding formula. Dual enrollment in Warner Robins in particular is an opportunity to showcase our institution as a real alternative for better prepared students rather than a safety school. • Comprehensive offerings at the bachelor’s level: As a state university, we will need to offer a comprehensive range of options for bachelor’s students to attract and retain students, both traditional and nontraditional. In particular, B.S. degrees in political science and sociology with emphasis in applied empirical skills would meet public and private employer demand for workers who have research skills and the ability to collect, manage, understand, and use data appropriately. There are other gaps in the liberal arts and sciences as well that need to be addressed to become a truly comprehensive state university. • Create incentives to boost the residential population: The college currently has a heavy debt burden inherited from the overbuilding of dorms at the Cochran campus. We need to identify ways to encourage students to live in Cochran, which may require public-private partnerships to try to build a “college town” atmosphere in the community near campus. We also need to work with wireless providers like Sprint and T-Mobile to ensure that students from the “big city” can fully use their cell phones and tablets in Cochran and Eastman without roaming fees or changing wireless providers. • Tie the institution more closely to the communities we serve: This includes both physical ties and psychological ties. The Macon campus in particular has poor physical links to the city itself for students who might walk or ride bicycles; extending the existing bike/walking trail from Wesleyan to the Macon campus should be a priority, as should pedestrian access and bike facilities along Columbus Road. Access to the Warner Robins campus is somewhat better but still could be improved. More generally, the institution is perceived as an afterthought or alternative of last resort in the community. Improving this situation and perception among community leaders and political figures may require a physical presence in or near downtown Macon, perhaps in partnership with the GCSU Graduate Center. * There is no official name-in-waiting, but given that our former interim president seemed to believe he could will this name into existence by repeating it enough I’ll stick with it. The straw poll of faculty trivia night suggests that it’s the least bad option available, which inevitably means the regents will choose something else instead (if the last name change is anything to go by). ## February 17, 2014 ### Seth Falcon #### Have Your SHA and Bcrypt Too ## Fear I've been putting off sharing this idea because I've heard the rumors about what happens to folks who aren't security experts when they post about security on the internet. If this blog is replaced with cat photos and rainbows, you'll know what happened. ## The Sad Truth It's 2014 and chances are you have accounts on websites that are not properly handling user passwords. I did no research to produce the following list of ways passwords are mishandled in decreasing order of frequency: 1. Site uses a fast hashing algorithm, typically SHA1(salt + plain-password). 2. Site doesn't salt password hashes 3. Site stores raw passwords We know that sites should be generating secure random salts and using an established slow hashing algorithm (bcrypt, scrypt, or PBKDF2). Why are sites not doing this? While security issues deserve a top spot on any site's priority list, new features often trump addressing legacy security concerns. The immediacy of the risk is hard to quantify and it's easy to fall prey to a "nothing bad has happened yet, why should we change now" attitude. It's easy for other bugs, features, or performance issues to win out when measured by immediate impact. Fixing security or other "legacy" issues is the Right Thing To Do and often you will see no measurable benefit from the investment. It's like having insurance. You don't need it until you do. Specific to the improper storage of user password data is the issue of the impact to a site imposed by upgrading. There are two common approaches to upgrading password storage. You can switch cold turkey to the improved algorithms and force password resets on all of your users. Alternatively, you can migrate incrementally such that new users and any user who changes their password gets the increased security. The cold turkey approach is not a great user experience and sites might choose to delay an upgrade to avoid admitting to a weak security implementation and disrupting their site by forcing password resets. The incremental approach is more appealing, but the security benefit is drastically diminished for any site with a substantial set of existing users. Given the above migration choices, perhaps it's (slightly) less surprising that businesses choose to prioritize other work ahead of fixing poorly stored user password data. ## The Idea What if you could upgrade a site so that both new and existing users immediately benefited from the increased security, but without the disruption of password resets? It turns out that you can and it isn't very hard. Consider a user table with columns: userid salt hashed_pass  Where the hashed_pass column is computed using a weak fast algorithm, for example SHA1(salt + plain_pass). The core of the idea is to apply a proper algorithm on top of the data we already have. I'll use bcrypt to make the discussion concrete. Add columns to the user table as follows: userid salt hashed_pass hash_type salt2  Process the existing user table by computing bcrypt(salt2 + hashed_pass) and storing the result in the hashed_pass column (overwriting the less secure value); save the new salt value to salt2 and set hash_type to bycrpt+sha1. To verify a user where hash_type is bcrypt+sha1, compute bcrypt(salt2 + SHA1(salt + plain_pass)) and compare to the hashed_pass value. Note that bcrypt implementations encode the salt as a prefix of the hashed value so you could avoid the salt2 column, but it makes the idea easier to explain to have it there. You can take this approach further and have any user that logs in (as well as new users) upgrade to a "clean" bcrypt only algorithm since you can now support different verification algorithms using hash_type. With the proper application code changes in place, the upgrade can be done live. This scheme will also work for sites storing non-salted password hashes as well as those storing plain text passwords (THE HORROR). ## Less Sadness, Maybe Perhaps this approach makes implementing a password storage security upgrade more palatable and more likely to be prioritized. And if there's a horrible flaw in this approach, maybe you'll let me know without turning this blog into a tangle of cat photos and rainbows. ## December 26, 2013 ### Seth Falcon #### A Rebar Plugin for Locking Deps: Reproducible Erlang Project Builds For Fun and Profit ## What's this lock-deps of which you speak? If you use rebar to generate an OTP release project and want to have reproducible builds, you need the rebar_lock_deps_plugin plugin. The plugin provides a lock-deps command that will generate a rebar.config.lock file containing the complete flattened set of project dependencies each pegged to a git SHA. The lock file acts similarly to Bundler's Gemfile.lock file and allows for reproducible builds (*). Without lock-deps you might rely on the discipline of using a tag for all of your application's deps. This is insufficient if any dep depends on something not specified as a tag. It can also be a problem if a third party dep doesn't provide a tag. Generating a rebar.config.lock file solves these issues. Moreover, using lock-deps can simplify the work of putting together a release consisting of many of your own repos. If you treat the master branch as shippable, then rather than tagging each subproject and updating rebar.config throughout your project's dependency chain, you can run get-deps (without the lock file), compile, and re-lock at the latest versions throughout your project repositories. The reproducibility of builds when using lock-deps depends on the SHAs captured in rebar.config.lock. The plugin works by scanning the cloned repos in your project's deps directory and extracting the current commit SHA. This works great until a repository's history is rewritten with a force push. If you really want reproducible builds, you need to not nuke your SHAs and you'll need to fork all third party repos to ensure that someone else doesn't screw you over in this fashion either. If you make a habit of only depending on third party repos using a tag, assume that upstream maintainers are not completely bat shit crazy, and don't force push your master branch, then you'll probably be fine. ## Getting Started Install the plugin in your project by adding the following to your rebar.config file: %% Plugin dependency {deps, [ {rebar_lock_deps_plugin, ".*", {git, "git://github.com/seth/rebar_lock_deps_plugin.git", {branch, "master"}}} ]}. %% Plugin usage {plugins, [rebar_lock_deps_plugin]}.  To test it out do: rebar get-deps # the plugin has to be compiled so you can use it rebar compile rebar lock-deps  If you'd like to take a look at a project that uses the plugin, take a look at CHEF's erchef project. ## Bonus features If you are building an OTP release project using rebar generate then you can use rebar_lock_deps_plugin to enhance your build experience in three easy steps. 1. Use rebar bump-rel-version version=$BUMP to automate the process of editing rel/reltool.config to update the release version. The argument $BUMP can be major, minor, or patch (default) to increment the specified part of a semver X.Y.Z version. If $BUMP is any other value, it is used as the new version verbatim. Note that this function rewrites rel/reltool.config using ~p. I check-in the reformatted version and maintain the formatting when editing. This way, the general case of a version bump via bump-rel-version results in a minimal diff.

2. Autogenerate a change summary commit message for all project deps. Assuming you've generated a new lock file and bumped the release version, use rebar commit-release to commit the changes to rebar.config.lock and rel/reltool.config with a commit message that summarizes the changes made to each dependency between the previously locked version and the newly locked version. You can get a preview of the commit message via rebar log-changed-deps.

3. Finally, create an annotated tag for your new release with rebar tag-release which will read the current version from rel/reltool.config and create an annotated tag named with the version.

## The dependencies, they are ordered

Up to version 2.0.1 of rebar_lock_deps_plugin, the dependencies in the generated lock file were ordered alphabetically. This was a side-effect of using filelib:wildcard/1 to list the dependencies in the top-level deps directory. In most cases, the order of the full dependency set does not matter. However, if some of the code in your project uses parse transforms, then it will be important for the parse transform to be compiled and on the code path before attempting to compile code that uses the parse transform.

This issue was recently discovered by a colleague who ran into build issues using the lock file for a project that had recently integrated lager for logging. He came up with the idea of maintaining the order of deps as they appear in the various rebar.config files along with a prototype patch proving out the idea. As of rebar_lock_deps_plugin 3.0.0, the lock-deps command will (mostly) maintain the relative order of dependencies as found in the rebar.config files.

The "mostly" is that when a dep is shared across two subprojects, it will appear in the expected order for the first subproject (based on the ordering of the two subprojects). The deps for the second subproject will not be in strict rebar.config order, but the resulting order should address any compile-time dependencies and be relatively stable (only changing when project deps alter their deps with larger impact when shared deps are introduced or removed).

## Digression: fun with dependencies

There are times, as a programmer, when a real-world problem looks like a text book exercise (or an interview whiteboard question). Just the other day at work we had to design some manhole covers, but I digress.

Fixing the order of the dependencies in the generated lock file is (nearly) the same as finding an install order for a set of projects with inter-dependencies. I had some fun coding up the text book solution even though the approach doesn't handle the constraint of respecting the order provided by the rebar.config files. Onward with the digression.

We have a set of "packages" where some packages depend on others and we want to determine an install order such that a package's dependencies are always installed before the package. The set of packages and the relation "depends on" form a directed acyclic graph or DAG. The topological sort of a DAG produces an install order for such a graph. The ordering is not unique. For example, with a single package C depending on A and B, valid install orders are [A, B, C] and [B, A, C].

To setup the problem, we load all of the project dependency information into a proplist mapping each package to a list of its dependencies extracted from the package's rebar.config file.

read_all_deps(Config, Dir) ->
TopDeps = rebar_config:get(Config, deps, []),
Acc = [{top, dep_names(TopDeps)}],
DepDirs = filelib:wildcard(filename:join(Dir, "*")),
Acc ++ [
{filename:basename(D), dep_names(extract_deps(D))}
|| D <- DepDirs ].


Erlang's standard library provides the digraph and digraph_utils modules for constructing and operating on directed graphs. The digraph_utils module includes a topsort/1 function which we can make use of for our "exercise". The docs say:

Returns a topological ordering of the vertices of the digraph Digraph if such an ordering exists, false otherwise. For each vertex in the returned list, there are no out-neighbours that occur earlier in the list.

To figure out which way to point the edges when building our graph, consider two packages A and B with A depending on B. We know we want to end up with an install order of [B, A]. Rereading the topsort/1 docs, we must want an edge B => A. With that, we can build our DAG and obtain an install order with the topological sort:

load_digraph(Config, Dir) ->
G = digraph:new(),
Nodes = all_nodes(AllDeps),
[ digraph:add_vertex(G, N) || N <- Nodes ],
%% If A depends on B, then we add an edge A <= B
[
|| Dep <- DepList ]
|| {Item, DepList} <- AllDeps, Item =/= top ],
digraph_utils:topsort(G).

%% extract a sorted unique list of all deps
all_nodes(AllDeps) ->
lists:usort(lists:foldl(fun({top, L}, Acc) ->
L ++ Acc;
({K, L}, Acc) ->
[K|L] ++ Acc
end, [], AllDeps)).


The digraph module manages graphs using ETS giving it a convenient API, though one that feels un-erlang-y in its reliance on side-effects.

The above gives an install order, but doesn't take into account the relative order of deps as specified in the rebar.config files. The solution implemented in the plugin is a bit less fancy, recursing over the deps and maintaining the desired ordering. The only tricky bit being that shared deps are ignored until the end and the entire linearized list is de-duped which required a . Here's the code:

order_deps(AllDeps) ->
Top = proplists:get_value(top, AllDeps),
order_deps(lists:reverse(Top), AllDeps, []).

order_deps([], _AllDeps, Acc) ->
de_dup(Acc);
order_deps([Item|Rest], AllDeps, Acc) ->
ItemDeps = proplists:get_value(Item, AllDeps),
order_deps(lists:reverse(ItemDeps) ++ Rest, AllDeps, [Item | Acc]).

de_dup(AccIn) ->
WithIndex = lists:zip(AccIn, lists:seq(1, length(AccIn))),
UWithIndex = lists:usort(fun({A, _}, {B, _}) ->
A =< B
end, WithIndex),
Ans0 = lists:sort(fun({_, I1}, {_, I2}) ->
I1 =< I2
end, UWithIndex),
[ V || {V, _} <- Ans0 ].


## Conclusion and the end of this post

The great thing about posting to your blog is, you don't have to have a proper conclusion if you don't want to.

# Probabilistic bug hunting

Have you ever run into a bug that, no matter how careful you are trying to reproduce it, it only happens sometimes? And then, you think you've got it, and finally solved it - and tested a couple of times without any manifestation. How do you know that you have tested enough? Are you sure you were not "lucky" in your tests?

In this article we will see how to answer those questions and the math behind it without going into too much detail. This is a pragmatic guide.

## The Bug

The following program is supposed to generate two random 8-bit integer and print them on stdout:


#include <stdio.h>
#include <fcntl.h>
#include <unistd.h>

/* Returns -1 if error, other number if ok. */
int get_random_chars(char *r1, char*r2)
{
int f = open("/dev/urandom", O_RDONLY);

if (f < 0)
return -1;
if (read(f, r1, sizeof(*r1)) < 0)
return -1;
if (read(f, r2, sizeof(*r2)) < 0)
return -1;
close(f);

return *r1 & *r2;
}

int main(void)
{
char r1;
char r2;
int ret;

ret = get_random_chars(&r1, &r2);

if (ret < 0)
fprintf(stderr, "error");
else
printf("%d %d\n", r1, r2);

return ret < 0;
}



On my architecture (Linux on IA-32) it has a bug that makes it print "error" instead of the numbers sometimes.

## The Model

Every time we run the program, the bug can either show up or not. It has a non-deterministic behaviour that requires statistical analysis.

We will model a single program run as a Bernoulli trial, with success defined as "seeing the bug", as that is the event we are interested in. We have the following parameters when using this model:

• $$n$$: the number of tests made;
• $$k$$: the number of times the bug was observed in the $$n$$ tests;
• $$p$$: the unknown (and, most of the time, unknowable) probability of seeing the bug.

As a Bernoulli trial, the number of errors $$k$$ of running the program $$n$$ times follows a binomial distribution $$k \sim B(n,p)$$. We will use this model to estimate $$p$$ and to confirm the hypotheses that the bug no longer exists, after fixing the bug in whichever way we can.

By using this model we are implicitly assuming that all our tests are performed independently and identically. In order words: if the bug happens more ofter in one environment, we either test always in that environment or never; if the bug gets more and more frequent the longer the computer is running, we reset the computer after each trial. If we don't do that, we are effectively estimating the value of $$p$$ with trials from different experiments, while in truth each experiment has its own $$p$$. We will find a single value anyway, but it has no meaning and can lead us to wrong conclusions.

### Physical analogy

Another way of thinking about the model and the strategy is by creating a physical analogy with a box that has an unknown number of green and red balls:

• Bernoulli trial: taking a single ball out of the box and looking at its color - if it is red, we have observed the bug, otherwise we haven't. We then put the ball back in the box.
• $$n$$: the total number of trials we have performed.
• $$k$$: the total number of red balls seen.
• $$p$$: the total number of red balls in the box divided by the total number of green balls in the box.

• If we open the box and count the balls, we can know $$p$$, in contrast with our original problem.
• Without opening the box, we can estimate $$p$$ by repeating the trial. As $$n$$ increases, our estimate for $$p$$ improves. Mathematically: $p = \lim_{n\to\infty}\frac{k}{n}$
• Performing the trials in different conditions is like taking balls out of several different boxes. The results tell us nothing about any single box.

## Estimating $$p$$

Before we try fixing anything, we have to know more about the bug, starting by the probability $$p$$ of reproducing it. We can estimate this probability by dividing the number of times we see the bug $$k$$ by the number of times we tested for it $$n$$. Let's try that with our sample bug:

  $./hasbug 67 -68$ ./hasbug
79 -101
$./hasbug error  We know from the source code that $$p=25%$$, but let's pretend that we don't, as will be the case with practically every non-deterministic bug. We tested 3 times, so $$k=1, n=3 \Rightarrow p \sim 33%$$, right? It would be better if we tested more, but how much more, and exactly what would be better? ### $$p$$ precision Let's go back to our box analogy: imagine that there are 4 balls in the box, one red and three green. That means that $$p = 1/4$$. What are the possible results when we test three times? Red balls Green balls $$p$$ estimate 0 3 0% 1 2 33% 2 1 66% 3 0 100% The less we test, the smaller our precision is. Roughly, $$p$$ precision will be at most $$1/n$$ - in this case, 33%. That's the step of values we can find for $$p$$, and the minimal value for it. Testing more improves the precision of our estimate. ### $$p$$ likelihood Let's now approach the problem from another angle: if $$p = 1/4$$, what are the odds of seeing one error in four tests? Let's name the 4 balls as 0-red, 1-green, 2-green and 3-green: The table above has all the possible results for getting 4 balls out of the box. That's $$4^4=256$$ rows, generated by this python script. The same script counts the number of red balls in each row, and outputs the following table: k rows % 0 81 31.64% 1 108 42.19% 2 54 21.09% 3 12 4.69% 4 1 0.39% That means that, for $$p=1/4$$, we see 1 red ball and 3 green balls only 42% of the time when getting out 4 balls. What if $$p = 1/3$$ - one red ball and two green balls? We would get the following table: k rows % 0 16 19.75% 1 32 39.51% 2 24 29.63% 3 8 9.88% 4 1 1.23% What about $$p = 1/2$$? k rows % 0 1 6.25% 1 4 25.00% 2 6 37.50% 3 4 25.00% 4 1 6.25% So, let's assume that you've seen the bug once in 4 trials. What is the value of $$p$$? You know that can happen 42% of the time if $$p=1/4$$, but you also know it can happen 39% of the time if $$p=1/3$$, and 25% of the time if $$p=1/2$$. Which one is it? The graph bellow shows the discrete likelihood for all $$p$$ percentual values for getting 1 red and 3 green balls: The fact is that, given the data, the estimate for $$p$$ follows a beta distribution $$Beta(k+1, n-k+1) = Beta(2, 4)$$ (1) The graph below shows the probability distribution density of $$p$$: The R script used to generate the first plot is here, the one used for the second plot is here. ### Increasing $$n$$, narrowing down the interval What happens when we test more? We obviously increase our precision, as it is at most $$1/n$$, as we said before - there is no way to estimate that $$p=1/3$$ when we only test twice. But there is also another effect: the distribution for $$p$$ gets taller and narrower around the observed ratio $$k/n$$: ### Investigation framework So, which value will we use for $$p$$? • The smaller the value of $$p$$, the more we have to test to reach a given confidence in the bug solution. • We must, then, choose the probability of error that we want to tolerate, and take the smallest value of $$p$$ that we can. A usual value for the probability of error is 5% (2.5% on each side). • That means that we take the value of $$p$$ that leaves 2.5% of the area of the density curve out on the left side. Let's call this value $$p_{min}$$. • That way, if the observed $$k/n$$ remains somewhat constant, $$p_{min}$$ will raise, converging to the "real" $$p$$ value. • As $$p_{min}$$ raises, the amount of testing we have to do after fixing the bug decreases. By using this framework we have direct, visual and tangible incentives to test more. We can objectively measure the potential contribution of each test. In order to calculate $$p_{min}$$ with the mentioned properties, we have to solve the following equation: $\sum_{k=0}^{k}{n\choose{k}}p_{min} ^k(1-p_{min})^{n-k}=\frac{\alpha}{2}$ $$alpha$$ here is twice the error we want to tolerate: 5% for an error of 2.5%. That's not a trivial equation to solve for $$p_{min}$$. Fortunately, that's the formula for the confidence interval of the binomial distribution, and there are a lot of sites that can calculate it: ## Is the bug fixed? So, you have tested a lot and calculated $$p_{min}$$. The next step is fixing the bug. After fixing the bug, you will want to test again, in order to confirm that the bug is fixed. How much testing is enough testing? Let's say that $$t$$ is the number of times we test the bug after it is fixed. Then, if our fix is not effective and the bug still presents itself with a probability greater than the $$p_{min}$$ that we calculated, the probability of not seeing the bug after $$t$$ tests is: $\alpha = (1-p_{min})^t$ Here, $$\alpha$$ is also the probability of making a type I error, while $$1 - \alpha$$ is the statistical significance of our tests. We now have two options: • arbitrarily determining a standard statistical significance and testing enough times to assert it. • test as much as we can and report the achieved statistical significance. Both options are valid. The first one is not always feasible, as the cost of each trial can be high in time and/or other kind of resources. The standard statistical significance in the industry is 5%, we recommend either that or less. Formally, this is very similar to a statistical hypothesis testing. ## Back to the Bug ### Testing 20 times This file has the results found after running our program 5000 times. We must never throw out data, but let's pretend that we have tested our program only 20 times. The observed $$k/n$$ ration and the calculated $$p_{min}$$ evolved as shown in the following graph: After those 20 tests, our $$p_{min}$$ is about 12%. Suppose that we fix the bug and test it again. The following graph shows the statistical significance corresponding to the number of tests we do: In words: we have to test 24 times after fixing the bug to reach 95% statistical significance, and 35 to reach 99%. Now, what happens if we test more before fixing the bug? ### Testing 5000 times Let's now use all the results and assume that we tested 5000 times before fixing the bug. The graph bellow shows $$k/n$$ and $$p_{min}$$: After those 5000 tests, our $$p_{min}$$ is about 23% - much closer to the real $$p$$. The following graph shows the statistical significance corresponding to the number of tests we do after fixing the bug: We can see in that graph that after about 11 tests we reach 95%, and after about 16 we get to 99%. As we have tested more before fixing the bug, we found a higher $$p_{min}$$, and that allowed us to test less after fixing the bug. ## Optimal testing We have seen that we decrease $$t$$ as we increase $$n$$, as that can potentially increases our lower estimate for $$p$$. Of course, that value can decrease as we test, but that means that we "got lucky" in the first trials and we are getting to know the bug better - the estimate is approaching the real value in a non-deterministic way, after all. But, how much should we test before fixing the bug? Which value is an ideal value for $$n$$? To define an optimal value for $$n$$, we will minimize the sum $$n+t$$. This objective gives us the benefit of minimizing the total amount of testing without compromising our guarantees. Minimizing the testing can be fundamental if each test costs significant time and/or resources. The graph bellow shows us the evolution of the value of $$t$$ and $$t+n$$ using the data we generated for our bug: We can see clearly that there are some low values of $$n$$ and $$t$$ that give us the guarantees we need. Those values are $$n = 15$$ and $$t = 24$$, which gives us $$t+n = 39$$. While you can use this technique to minimize the total number of tests performed (even more so when testing is expensive), testing more is always a good thing, as it always improves our guarantee, be it in $$n$$ by providing us with a better $$p$$ or in $$t$$ by increasing the statistical significance of the conclusion that the bug is fixed. So, before fixing the bug, test until you see the bug at least once, and then at least the amount specified by this technique - but also test more if you can, there is no upper bound, specially after fixing the bug. You can then report a higher confidence in the solution. ## Conclusions When a programmer finds a bug that behaves in a non-deterministic way, he knows he should test enough to know more about the bug, and then even more after fixing it. In this article we have presented a framework that provides criteria to define numerically how much testing is "enough" and "even more." The same technique also provides a method to objectively measure the guarantee that the amount of testing performed provides, when it is not possible to test "enough." We have also provided a real example (even though the bug itself is artificial) where the framework is applied. As usual, the source code of this page (R scripts, etc) can be found and downloaded in https://github.com/lpenz/lpenz.github.io ## December 01, 2013 ### Gregor Gorjanc #### Read line by line of a file in R Are you using R for data manipulation for later use with other programs, i.e., a workflow something like this: 1. read data sets from a disk, 2. modify the data, and 3. write it back to a disk. All fine, but of data set is really big, then you will soon stumble on memory issues. If data processing is simple and you can read only chunks, say only line by line, then the following might be useful: ## Filefile <- "myfile.txt" ## Create connectioncon <- file(description=file, open="r") ## Hopefully you know the number of lines from some other source orcom <- paste("wc -l ", file, " | awk '{ print$1 }'", sep="")n <- system(command=com, intern=TRUE) ## Loop over a file connectionfor(i in 1:n) {  tmp <- scan(file=con, nlines=1, quiet=TRUE)  ## do something on a line of data }
Created by Pretty R at inside-R.org

## November 20, 2013

#### Sending data from client to server and back using shiny

After some time of using shiny I got to the point where I needed to send some arbitrary data from the client to the server, process it with R and return some other data to the client. As a client/server programming newbie this was a challenge for me as I did not want to dive too deep into the world of web programming. I wanted to get the job done using shiny and preferably as little JS/PHP etc. scripting as possible.

It turns out that the task is quite simple as shiny comes with some currently undocumented functions under the hood that will make this task quite easy. You can find some more information on these functions here.

As mentioned above, I am a web programming newbie. So this post may be helpful for people with little web programming experience (just a few lines of JavaScript are needed) and who want to see a simple way of how to get the job done.

## Sending data from client to server

Sending the data from the client to the server is accomplished by the JS function Shiny.onInputChange. This function takes a JS object and sends it to the shiny server. On the server side the object will be accessible as an R object under the name which is given as the second argument to the Shiny.onInputChange function. Let’s start by sending a random number to the server. The name of the object on the server side will be mydata.

Let’s create the shiny user interface file (ui.R). I will add a colored div, another element for verbatim text output called results and add the JavaScript code to send the data. The workhorse line is Shiny.onInputChange(“mydata”, number);. The JS code is included by passing it as a string to the tags$script function. # ui.R shinyUI( bootstrapPage( # a div named mydiv tags$div(id="mydiv", style="width: 50px; height :50px;
left: 100px; top: 100px;
background-color: gray; position: absolute"),

# a shiny element to display unformatted text
verbatimTextOutput("results"),

# javascript code to send data to shiny server
tags$script(' document.getElementById("mydiv").onclick = function() { var number = Math.random(); Shiny.onInputChange("mydata", number); }; ') ))  Now, on the server side, we can simply access the data that was sent by addressing it the usual way via the input object (i.e. input$mydata. The code below will make the verbatimTextOutput element results show the value that was initially passed to the server.

# server.R

shinyServer(function(input, output, session) {

output$results = renderPrint({ input$mydata
})

})


You can copy the above files from here or run the code directly. When you run the code you will find that the random value in the upper box is updated if you click on the div.

library(shiny)
runGist("https://gist.github.com/markheckmann/7554422")


What we have achieved so far is to pass some data to the server, access it and pass it back to a display on the client side. For the last part however, we have used a standard shiny element to send back the data to the client.

## Sending data from server to client

Now let’s add a component to send custom data from the server back to the client. This task has two parts. On the client side we need to define a handler function. This is a function that will receive the data from the server and perform some task with it. In other words, the function will handle the received data. To register a handler the function Shiny.addCustomMessageHandler is used. I will name our handler function myCallbackHandler. Our handler function will use the received data and execute some JS code. In our case it will change the color of our div called mydiv according to the color value that is passed from the server to the handler. Let’s add the JS code below to the ui.R file.

# ui.R

# handler to receive data from server
tags$script(' Shiny.addCustomMessageHandler("myCallbackHandler", function(color) { document.getElementById("mydiv").style.backgroundColor = color; }); ')  Let’s move to the server side. I want the server to send the data to the handler function whenever the div is clicked, i.e. when the value of input$mydata changes. The sending of the data to the client is accomplished by an R function called sendCustomMessage which can be found in the session object. The function is passed the name of the client side handler function and the R object we want to pass to the function. Here, I create a random hex color value string that gets sent to a client handler function myCallbackHandler. The line sending the data to the client is contained in an observer. The observer includes the reactive object input$mydata, so the server will send someting to the client side handler function whenever the values of input$mydata changes. And it changes each time we click on the div. Let’s add the code below to the server.R file.

# server.R

# observes if value of mydata sent from the client changes.  if yes
# generate a new random color string and send it back to the client
# handler function called 'myCallbackHandler'
observe({
input$mydata color = rgb(runif(1), runif(1), runif(1)) session$sendCustomMessage(type = "myCallbackHandler", color)
})


You can copy the above files from here or run the code directly. When you run the code you will see that the div changes color when you click on it.

runGist("https://gist.github.com/markheckmann/7554458")


That’s it. We have passed custom data from the client to the server and back. The following graphics sums up the functions that were used.

## Passing more complex objects

The two functions also do a good job passing more complex JS or R objects. If you modify your code to send a JS object to shiny, it will be converted into an R list object on the server side. Let’s replace the JS object we send to the server (in ui.R) with following lines. On the server side, we will get a list.

document.getElementById("mydiv").onclick = function() {
var obj = {one: [1,2,3,4],
two: ["a", "b", "c"]};
Shiny.onInputChange("mydata", obj);
};


Note that now however the shiny server will only execute the function once (on loading), not each time the click event is fired. The reason is, that now the input data is static, i.e. the JS object we send via onInputChange does not change. To reduce workload on the server side, the code in the observer will only be executed if the reactive value under observation (i.e. the value of input$mydata) changes. As this is not the case anymore as the value we pass is static, the observer that sends back the color information to the client to change the color of the div is not executed a second time. The conversion also works nicely the other way round. We can pass an R list object to the sendCustomMessage function and on the client side it will appear as a JS object. So we are free to pass almost any type of data we need to. ## Putting the JS code in a separate file To keep things simple I included the JS code directly into the ui.R file using tags$script. This does not look very nice and you may want to put the JS code in a separate file instead. For this purpose I will create a JS file called mycode.js and include all the above JS code in it. Additionally, this file has another modification: All the code is wrapped into some JS/jQuery code ($(document).ready(function() { })that will make sure the JS code is run after the DOM (that is all the HTML elements) is loaded. Before, I simply placed the JS code below the HTML elements to make sure they are loaded, but I guess this is no good practice. // mycode.js$(document).ready(function() {

document.getElementById("mydiv").onclick = function() {
var number = Math.random();
Shiny.onInputChange("mydata", number);
};

function(color) {
document.getElementById("mydiv").style.backgroundColor = color;
}
);

});


To include the JS file shiny offers the includeScript function to include JS files. The server.R file has not changed, the ui.R file now looks like this.

# server.R

library(shiny)

shinyUI( bootstrapPage(

# include the js code
includeScript("mycode.js"),

# a div named mydiv

## July 02, 2013

### Gregor Gorjanc

#### Parse arguments of an R script

R can be used also as a scripting tool. We just need to add shebang in the first line of a file (script):

#!/usr/bin/Rscript

and then the R code should follow.

Often we want to pass arguments to such a script, which can be collected in the script by the commandArgs() function. Then we need to parse the arguments and conditional on them do something. I came with a rather general way of parsing these arguments using simply these few lines:

## Collect argumentsargs <- commandArgs(TRUE) ## Default setting when no arguments passedif(length(args) < 1) {  args <- c("--help")} ## Help sectionif("--help" %in% args) {  cat("      The R Script       Arguments:      --arg1=someValue   - numeric, blah blah      --arg2=someValue   - character, blah blah      --arg3=someValue   - logical, blah blah      --help              - print this text       Example:      ./test.R --arg1=1 --arg2="output.txt" --arg3=TRUE \n\n")   q(save="no")} ## Parse arguments (we expect the form --arg=value)parseArgs <- function(x) strsplit(sub("^--", "", x), "=")argsDF <- as.data.frame(do.call("rbind", parseArgs(args)))argsL <- as.list(as.character(argsDF$V2))names(argsL) <- argsDF$V1 ## Arg1 defaultif(is.null(args$arg1)) { ## do something} ## Arg2 defaultif(is.null(args$arg2)) {  ## do something} ## Arg3 defaultif(is.null(args\$arg3)) {  ## do something}

## ... your code here ...
Created by Pretty R at inside-R.org

It is some work, but I find it pretty neat and use it for quite a while now. I do wonder what others have come up for this task. I hope I did not miss some very general solution.

## March 24, 2013

### Romain Francois

#### Moving

This blog is moving to blog.r-enthusiasts.com. The new one is powered by wordpress and gets a subdomain of r-enthusiasts.com.

See you there

## March 17, 2013

### Modern Toolmaking

#### caretEnsemble Classification example

Here's a quick demo of how to fit a binary classification model with caretEnsemble.  Please note that I haven't spent as much time debugging caretEnsemble for classification models, so there's probably more bugs than my last post.  Also note that multi class models are not yet supported.

Right now, this code fails for me if I try a model like a nnet or an SVM for stacking, so there's clearly bugs to fix.

The greedy model relies 100% on the gbm, which makes sense as the gbm has an AUC of 1 on the training set.  The linear model uses all of the models, and achieves an AUC of .5.  This is a little weird, as the gbm, rf, SVN, and knn all achieve an AUC of close to 1.0 on the training set, and I would have expected the linear model to focus on these predictions. I'm not sure if this is a bug, or a failure of my stacking model.

## March 13, 2013

### Modern Toolmaking

#### New package for ensembling R models

I've written a new R package called caretEnsemble for creating ensembles of caret models in R.  It currently works well for regression models, and I've written some preliminary support for binary classification models.

At this point, I've got 2 different algorithms for combining models:

1. Greedy stepwise ensembles (returns a weight for each model)
2. Stacks of caret models

(You can also manually specify weights for a greedy ensemble)

The greedy algorithm is based on the work of Caruana et al., 2004, and inspired by the medley package here on github.  The stacking algorithm simply builds a second caret model on top of the existing models (using their predictions as input), and employs all of the flexibility of the caret package.

All the models in the ensemble must use the same training/test folds.  Both algorithms use the out-of-sample predictions to find the weights and train the stack. Here's a brief script demonstrating how to use the package:

Please feel free to submit any comments here or on github.  I'd also be happy to include any patches you feel like submitting.  In particular, I could use some help writing support for multi-class models, writing more tests, and fixing bugs.

## February 18, 2013

### Romain Francois

#### Improving the graph gallery

I'm trying to make improvements to the R Graph Gallery, I'm looking for suggestions from users of the website.

I've started a question on the website's facebook page. Please take a few seconds to vote to existing improvements possibilities and perhaps offer some of your own ideas.

## February 04, 2013

### Romain Francois

#### bibtex 0.3-5

The version 0.3-5 of the bibtex package is on CRAN. This fixes a corner case issue about empty bib files thanks to Kurt Hornik.