Planet R

July 24, 2016

CRANberries

New package qat with initial version 0.74

Package: qat
Type: Package
Title: Quality Assurance Toolkit
Version: 0.74
Date: 2016-07-21
Author: Andre Duesterhus
Maintainer: Andre Duesterhus <andre.duesterhus@uni-hamburg.de>
Encoding: latin1
Description: Functions for a scientific quality assurance of meteorological data.
Depends: R (>= 2.6.1)
Imports: ncdf4, gplots, XML, gdata, moments, boot, fields
License: GPL-2
URL: http://www.aduesterhus.net/qat
NeedsCompilation: no
Packaged: 2016-07-22 17:03:50 UTC; andue
X-CRAN-Comment: Archived on 2016-01-26 as check problems were not corrected despite reminders.
Repository: CRAN
Date/Publication: 2016-07-24 01:26:59

More information about qat at CRAN

July 24, 2016 01:14 AM

New package magick with initial version 0.1

Package: magick
Type: Package
Title: Advanced Image-Processing in R
Version: 0.1
Author: Jeroen Ooms
Maintainer: Jeroen Ooms <jeroen.ooms@stat.ucla.edu>
Description: Bindings to ImageMagick: the most comprehensive open-source image processing library available. Supports many common formats (png, jpeg, tiff, pdf, etc) and manipulations (rotate, scale, crop, trim, flip, blur, etc). All operations are vectorized via the Magick++ STL meaning they operate either on a single frame or a series of frames for working with layers, collages, or animation. In RStudio images are automatically previewed when printed to the console, resulting in an interactive editing environment.
License: MIT + file LICENSE
URL: https://github.com/jeroenooms/magick#readme
BugReports: https://github.com/jeroenooms/magick/issues
SystemRequirements: ImageMagick++: ImageMagick-c++-devel (rpm) or libmagick++-dev (deb)
Imports: Rcpp (>= 0.12), curl
LinkingTo: Rcpp
Suggests: rsvg, webp
RoxygenNote: 5.0.1
NeedsCompilation: yes
Packaged: 2016-07-22 22:53:02 UTC; jeroen
Repository: CRAN
Date/Publication: 2016-07-24 01:24:35

More information about magick at CRAN

July 24, 2016 01:14 AM

New package RobustGaSP with initial version 0.5

Package: RobustGaSP
Type: Package
Title: Robust Gaussian Stochastic Process Emulation
Version: 0.5
Date/Publication: 2016-07-24 01:23:33
Authors@R: c(person(given="Mengyang",family="Gu",role=c("aut","cre"),email="mg211@stat.duke.edu"), person(given="Jesus",family="Palomo", role=c("aut")), person(given="James",family="Berger", role="aut"))
Maintainer: Mengyang Gu <mg211@stat.duke.edu>
Author: Mengyang Gu [aut, cre], Jesus Palomo [aut], James Berger [aut]
Description: Robust parameter estimation and prediction of Gaussian stochastic process emulators. Important functions : rgasp(), predict.rgasp().
License: GPL (>= 2)
Depends: methods
Imports: Rcpp (>= 0.12.3)
Suggests: nloptr (>= 1.0.4)
LinkingTo: Rcpp, RcppEigen
NeedsCompilation: yes
Repository: CRAN
Packaged: 2016-07-22 21:48:32 UTC; gumengyang
RoxygenNote: 5.0.1

More information about RobustGaSP at CRAN

July 24, 2016 01:14 AM

July 23, 2016

CRANberries

New package clubSandwich with initial version 0.2.1

Package: clubSandwich
Title: Cluster-Robust (Sandwich) Variance Estimators with Small-Sample Corrections
Version: 0.2.1
Authors@R: person("James", "Pustejovsky", email = "jepusto@gmail.com", role = c("aut", "cre"))
Description: Provides several cluster-robust variance estimators (i.e., sandwich estimators) for ordinary and weighted least squares linear regression models. Several adjustments are incorporated to improve small- sample performance. The package includes functions for estimating the variance- covariance matrix and for testing single- and multiple-contrast hypotheses based on Wald test statistics. Tests of single regression coefficients use Satterthwaite or saddle-point corrections. Tests of multiple-contrast hypotheses use an approximation to Hotelling's T-squared distribution. Methods are provided for a variety of fitted models, including lm(), plm() (from package 'plm'), gls() and lme() (from 'nlme'), robu() (from 'robumeta'), and rma.uni() and rma.mv() (from 'metafor').
Depends: R (>= 3.0.0)
License: GPL-3
VignetteBuilder: knitr
LazyData: true
Imports: stats
Suggests: Formula, knitr, car, metafor, robumeta, nlme, mlmRev, AER, plm, testthat, rmarkdown
RoxygenNote: 5.0.1
NeedsCompilation: no
Packaged: 2016-07-23 12:57:42 UTC; jep2963
Author: James Pustejovsky [aut, cre]
Maintainer: James Pustejovsky <jepusto@gmail.com>
Repository: CRAN
Date/Publication: 2016-07-23 21:07:09

More information about clubSandwich at CRAN

July 23, 2016 07:14 PM

Journal of Statistical Software

Software Alchemy: Turning Complex Statistical Computations into Embarrassingly-Parallel Ones

The growth in the use of computationally intensive statistical procedures, especially with big data, has necessitated the usage of parallel computation on diverse platforms such as multicore, GPUs, clusters and clouds. However, slowdown due to interprocess communication costs typically limits such methods to "embarrassingly parallel" (EP) algorithms, especially on non-shared memory platforms. This paper develops a broadlyapplicable method for converting many non-EP algorithms into statistically equivalent EP ones. The method is shown to yield excellent levels of speedup for a variety of statistical computations. It also overcomes certain problems of memory limitations.

by Norman Matloff at July 23, 2016 12:00 AM

July 22, 2016

Dirk Eddelbuettel

RcppCCTZ 0.0.5

Version 0.0.5 of RcppCCTZ arrived on CRAN a couple of days ago. It reflects an upstream fixed made a few weeks ago. CRAN tests revealed that g++-6 was tripping over one missing #define; this was added upstream and I subsequently synchronized with upstream. At the same time the set of examples was extended (see below).

Somehow useR! 2016 got in the way and while working on the then-incomplete examples during the traveling I forgot to release this until CRAN reminded me that their tests still failed. I promptly prepared the 0.0.5 release but somehow failed to update NEWS files etc. They are correct in the repo but not in the shipped package. Oh well.

CCTZ is a C++ library for translating between absolute and civil times using the rules of a time zone. In fact, it is two libraries. One for dealing with civil time: human-readable dates and time, and one for converting between between absolute and civil times via time zones. It requires only a proper C++11 compiler and the standard IANA time zone data base which standard Unix, Linux, OS X, ... computers tend to have in /usr/share/zoneinfo. RcppCCTZ connects this library to R by relying on Rcpp.

Two good examples are now included, and shown here. The first one tabulates the time difference between New York and London (at a weekly level for compactness):

R> example(tzDiff)

tzDiffR> # simple call: difference now
tzDiffR> tzDiff("America/New_York", "Europe/London", Sys.time())
[1] 5

tzDiffR> # tabulate difference for every week of the year
tzDiffR> table(sapply(0:52, function(d) tzDiff("America/New_York", "Europe/London",
tzDiff+                                       as.POSIXct(as.Date("2016-01-01") + d*7))))

 4  5 
 3 50 
R> 

Because the two continents happen to spring forward and fall backwards between regular and daylight savings times, there are, respectively, two and one week periods where the difference is one hour less than usual.

A second example shifts the time to a different time zone:

R> example(toTz)

toTzR> toTz(Sys.time(), "America/New_York", "Europe/London")
[1] "2016-07-14 10:28:39.91740 CDT"
R> 

Note that because we return a POSIXct object, it is printed by R with the default (local) TZ attribute (for "America/Chicago" in my case). A more direct example asks what time it is in my time zone when it is midnight in Tokyo:

R> toTz(ISOdatetime(2016,7,15,0,0,0), "Japan", "America/Chicago")
[1] "2016-07-14 15:00:00 CDT"
R>

More changes will come in 0.0.6 as soon as I find time to translate the nice time_tool (command-line) example into an R function.

Changes in this version are summarized here:

Changes in version 0.0.5 (2016-07-09)

  • New utility example functions toTz() and tzDiff

  • Synchronized with small upstream change for additional #ifdef for compiler differentiation

We also have a diff to the previous version thanks to CRANberries. More details are at the RcppCCTZ page; code, issue tickets etc at the GitHub repository.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

July 22, 2016 03:07 AM

July 19, 2016

Dirk Eddelbuettel

Rcpp 0.12.6: Rolling on

The sixth update in the 0.12.* series of Rcpp has arrived on the CRAN network for GNU R a few hours ago, and was just pushed to Debian. This 0.12.6 release follows the 0.12.0 release from late July, the 0.12.1 release in September, the 0.12.2 release in November, the 0.12.3 release in January, the 0.12.4 release in March, and the 0.12.5 release in May --- making it the tenth release at the steady bi-montly release frequency. Just like the previous release, this one is once again more of a refining maintenance release which addresses small bugs, nuisances or documentation issues without adding any major new features. That said, some nice features (such as caching support for sourceCpp() and friends) were added.

Rcpp has become the most popular way of enhancing GNU R with C or C++ code. As of today, 703 packages on CRAN depend on Rcpp for making analytical code go faster and further. That is up by about fourty packages from the last release in May!

Similar to the previous releases, we have contributions from first-time committers. Artem Klevtsov made na_omit run faster on vectors without NA values. Otherwise, we had many contributions from "regulars" like Kirill Mueller, James "coatless" Balamuta and Dan Dillon as well as from fellow Rcpp Core contributors. Some noteworthy highlights are encoding and string fixes, generally more robust builds, a new iterator-based approach for vectorized programming, the aforementioned caching for sourceCpp(), and several documentation enhancements. More details are below.

Changes in Rcpp version 0.12.6 (2016-07-18)

  • Changes in Rcpp API:

    • The long long data type is used only if it is available, to avoid compiler warnings (Kirill Müller in #488).

    • The compiler is made aware that stop() never returns, to improve code path analysis (Kirill Müller in #487 addressing issue #486).

    • String replacement was corrected (Qiang in #479 following mailing list bug report by Masaki Tsuda)

    • Allow for UTF-8 encoding in error messages via RCPP_USING_UTF8_ERROR_STRING macro (Qin Wenfeng in #493)

    • The R function Rf_warningcall is now provided as well (as usual without leading Rf_) (#497 fixing #495)

  • Changes in Rcpp Sugar:

    • Const-ness of min and max functions has been corrected. (Dan Dillon in PR #478 fixing issue #477).

    • Ambiguities for matrix/vector and scalar operations have been fixed (Dan Dillon in PR #476 fixing issue #475).

    • New algorithm header using iterator-based approach for vectorized functions (Dan in PR #481 revisiting PR #428 and addressing issue #426, with futher work by Kirill in PR #488 and Nathan in #503 fixing issue #502).

    • The na_omit() function is now faster for vectors without NA values (Artem Klevtsov in PR #492)

  • Changes in Rcpp Attributes:

    • Add cacheDir argument to sourceCpp() to enable caching of shared libraries across R sessions (JJ in #504).

    • Code generation now deals correctly which packages containing a dot in their name (Qiang in #501 fixing #500).

  • Changes in Rcpp Documentation:

    • A section on default parameters was added to the Rcpp FAQ vignette (James Balamuta in #505 fixing #418).

    • The Rcpp-attributes vignette is now mentioned more prominently in question one of the Rcpp FAQ vignette.

    • The Rcpp Quick Reference vignette received a facelift with new sections on Rcpp attributes and plugins begin added. (James Balamuta in #509 fixing #484).

    • The bib file was updated with respect to the recent JSS publication for RProtoBuf.

Thanks to CRANberries, you can also look at a diff to the previous release. As always, even fuller details are on the Rcpp Changelog page and the Rcpp page which also leads to the downloads page, the browseable doxygen docs and zip files of doxygen output for the standard formats. A local directory has source and documentation too. Questions, comments etc should go to the rcpp-devel mailing list off the R-Forge page.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

July 19, 2016 12:49 PM

July 18, 2016

R you ready?

Populating data frame cells with more than one value

Data frames are lists

Most R users will know that data frames are lists. You can easily verify that a data frame is a list by typing

d <- data.frame(id=1:2, name=c("Jon", "Mark"))
d
 id name
1 1 Jon
2 2 Mark
is.list(d)
[1] TRUE

However, data frames are lists with some special properties. For example, all entries in the list must have the same length (here 2), etc. You can find a nice description of the differences between lists and data frames here. To access the first column of d, we find that it contains a vector (and a factor in case of column name). Note, that [[ ]] is an operator to select a list element. As data frames are lists, they will work here as well.

is.vector(d[[1]])
[1] TRUE

Data frame columns can contain lists

A long time, I was unaware of the fact, that data frames may also contain lists as columns instead of vectors. For example, let’s assume Jon’s children are Mary and James, and Mark’s children are called Greta and Sally. Their names are stored in a list with two elements. We can add them to the data frame like this:

d$children <-  list(c("Mary", "James"), c("Greta", "Sally"))
d
 id name children
1 1 Jon Mary, James
2 2 Mark Greta, Sally

A single data frame entry in column children now contains more than one value. Given that the column is a list, not a vector, we cannot go as usual when modifying an entry of the column. For example, to change Jon’s children, we cannot do

> d[1 , "children"] <- c("Mary", "James", "Thomas")

Error in `[<-.data.frame`(`*tmp*`, 1, "children", value = c("Mary", "James", :
replacement has 3 rows, data has 1

Taking into account the list structure of the column, we can type the following to change the values in a single cell.

d[1 , "children"][[1]] <- list(c("Mary", "James", "Thomas"))

# or also

d$children[1] <- list(c("Mary", "James", "Thomas"))
d
 id name children
1 1 Jon Mary, James, Thomas
2 2 Mark Greta, Sally

You can also create a data frame having a list as a column using the <tt>data.frame</tt> function, but with a little tweak. The list column has to be wrapped inside the function <tt>I</tt>. This will protect it from several conversions taking place in <tt>data.frame</tt> (see <tt>?I</tt> documentation).

d <- data.frame(id = 1:2,
                   name = c("Jon", "Mark"),
                   children = I(list(c("Mary", "James"),
                                     c("Greta", "Sally")))
                )

This is an interesting feature, which gives me a deeper understanding of what a data frame is. But when exactly would I want to use it? I have not encountered the need to use it very often yet (though of course there may be plenty of situations where it makes sense). But today I had a case where this feature seemed particularly useful.

Converting lists and data frames to JSON

I had two separate types of information. One stored in a data frame and the other one in a list Referring to the example above, I had

d <- data.frame(id=1:2, name=c("Jon", "Mark"))
d
 id name
1 1 Jon
2 2 Mark

and

ch <- list(c("Mary", "James"), c("Greta", "Sally"))
ch
[[1]]
[1] "Mary" "James"

[[2]]
[1] "Greta" "Sally"

I needed to return an array of JSON objects which look like this.

[
  {
    "id": 1,
    "name": "Jon",
    "children": ["Mary", "James"]
  }, 
  {
    "id": 2,
    "name": "Mark",
    "children": ["Greta", "Sally"]
  }
]

Working with the superb jsonlite package to convert R to JSON, I could do the following to get the result above.

library(jsonlite)

l <- split(d, seq(nrow(d))) # convert data frame rows to list
l <- unname(l)              # remove list names
for (i in seq_along(l))     # add element from ch to list
    l[[i]] <- c(l[[i]], children=ch[i])

toJSON(l, pretty=T, auto_unbox = T) # convert to JSON

The results are correct, but getting there involved quite a number of tedious steps. These can be avoided by directly placing the list into a column of the data frame. Then jsonlite::toJSON takes care of the rest.

d$children <- list(c("Mary", "James"), c("Greta", "Sally"))
toJSON(d, pretty=T, auto_unbox = T)
[
  {
    "id": 1,
    "name": "Jon",
    "children": ["Mary", "James"]
  },
  {
    "id": 2,
    "name": "Mark",
    "children": ["Greta", "Sally"]
  }
]

Nice :) What we do here, is basically creating the same nested list structure as above, only now it is disguised as a data frame. However, this approach is much more convenient.


by markheckmann at July 18, 2016 05:01 PM

July 12, 2016

Dirk Eddelbuettel

RProtoBuf 0.4.4, and new JSS paper

A new release 0.4.4 of RProtoBuf is now on CRAN, and corresponds to the source archive for the Journal of Statistical Software paper about RProtoBuf as JSS vol71 issue 02. The paper is also included as a pre-print in the updated package.

RProtoBuf provides R bindings for the Google Protocol Buffers ("Protobuf") data encoding library used and released by Google, and deployed as a language and operating-system agnostic protocol by numerous projects.

This release brings small cleanups as well as build-system updates for the updated R 3.3.0 infrastructure based on g++ 4.9.*.

Changes in RProtoBuf version 0.4.4 (2016-07-10)

  • New vignette based on our brand-new JSS publication (v71 i02)

  • Some documentation enhancements were made, as well as other minor cleanups to file modes and operations

  • Unit-test vignette no longer writes to /tmp per CRAN request

  • The new Windows toolchain (based on g++ 4.9.*) is supported

CRANberries also provides a diff to the previous release. The RProtoBuf page has an older package vignette, a 'quick' overview vignette, a unit test summary vignette, and the pre-print for the JSS paper. Questions, comments etc should go to the GitHub issue tracker off the GitHub repo.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

July 12, 2016 01:43 AM

July 11, 2016

Journal of Statistical Software

JMFit: A SAS Macro for Joint Models of Longitudinal and Survival Data

Joint models for longitudinal and survival data now have a long history of being used in clinical trials or other studies in which the goal is to assess a treatment effect while accounting for a longitudinal biomarker such as patient-reported outcomes or immune responses. Although software has been developed for fitting the joint model, no software packages are currently available for simultaneously fitting the joint model and assessing the fit of the longitudinal component and the survival component of the model separately as well as the contribution of the longitudinal data to the fit of the survival model. To fulfill this need, we develop a SAS macro, called JMFit. JMFit implements a variety of popular joint models and provides several model assessment measures including the decomposition of AIC and BIC as well as ∆AIC and ∆BIC recently developed in Zhang, Chen, Ibrahim, Boye, Wang, and Shen (2014). Examples with real and simulated data are provided to illustrate the use of JMFit.

by Danjie Zhang, Ming-Hui Chen, Joseph G. Ibrahim, Mark E. Boye, Wei Shen at July 11, 2016 03:23 PM

RProtoBuf: Efficient Cross-Language Data Serialization in R

Modern data collection and analysis pipelines often involve a sophisticated mix of applications written in general purpose and specialized programming languages. Many formats commonly used to import and export data between different programs or systems, such as CSV or JSON, are verbose, inefficient, not type-safe, or tied to a specific programming language. Protocol Buffers are a popular method of serializing structured data between applications - while remaining independent of programming languages or operating systems. They offer a unique combination of features, performance, and maturity that seems particularly well suited for data-driven applications and numerical computing. The RProtoBuf package provides a complete interface to Protocol Buffers from the R environment for statistical computing. This paper outlines the general class of data serialization requirements for statistical computing, describes the implementation of the RProtoBuf package, and illustrates its use with example applications in large-scale data collection pipelines and web services.

by Dirk Eddelbuettel, Murray Stokely, Jeroen Ooms at July 11, 2016 03:23 PM

Dirk Eddelbuettel

Rcpp now used by over 700 CRAN packages

600 Rcpp packages

Earlier this morning, Rcpp reached another milestone: 701 packages on CRAN now depend on it (as measured by Depends, Imports and LinkingTo declarations). The graph is on the left depicts the growth of Rcpp usage over time.

Rcpp cleared 300 packages in November 2014. It passed 400 packages in June of last year (when I only tweeted about it), 500 packages in late October and 600 packages exactly four months ago in March. The chart extends to the very beginning via manually compiled data from CRANberries and checked with crandb. Then next part uses manually saved entries, and the final and largest part of the data set was generated semi-automatically via a short script appending updates to a small file-based backend. A list of packages using Rcpp is kept on this page.

Also displayed in the graph is the relative proportion of CRAN packages using Rcpp. The four per-cent hurdle was cleared just before useR! 2014 where I showed a similar graph (as two distinct graphs) in my invited talk. We passed five percent in December of 2014, six percent last July, seven percent just before Christmas and now criss-crosses 8 eight percent, or a little less than one in twelve R packages.

700 user packages is a really large and humbling number. This places quite some responsibility on us in the Rcpp team as we continue to try our best try to keep Rcpp as performant and reliable as it has been.

So with that a very big Thank You! to all users and contributors of Rcpp for help, suggestions, bug reports, documentation or, of course, code.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

July 11, 2016 11:41 AM

July 07, 2016

Bioconductor Project Working Papers

Practical Targeted Learning from Large Data Sets by Survey Sampling

We address the practical construction of asymptotic confidence intervals for smooth (i.e., pathwise differentiable), real-valued statistical
parameters by targeted learning from independent and identically
distributed data in contexts where sample size is so large that it poses
computational challenges. We observe some summary measure of all data and select a sub-sample from the complete data set by Poisson rejective sampling with unequal inclusion probabilities based on the summary measures. Targeted learning is carried out from the easier to handle sub-sample. We derive a central limit theorem for the targeted minimum loss estimator (TMLE) which enables the construction of the confidence intervals. The inclusion probabilities can be optimized to reduce the asymptotic variance of the TMLE. We illustrate the procedure with two examples where the parameters of interest are variable importance measures of an exposure (binary or continuous) on an outcome. We also conduct a simulation study and comment on its results.

by Patrice Bertail et al. at July 07, 2016 11:08 PM

Estimation of long-term area-average PM2.5 concentrations for area-level health analyses

Introduction: There is increasing evidence of an association between individual long-term PM2.5 exposure and human health. Mortality and morbidity data collected at the area-level are valuable resources for investigating corresponding population-level health effects. However, PM2.5 monitoring data are available for limited periods of time and locations, and are not adequate for estimating area-level concentrations. We developed a general approach to estimate county-average concentrations representative of population exposures for 1980-2010 in the continental U.S.

Methods: We predicted annual average PM2.5 concentrations at about 70,000 census tract centroids, using a point prediction model previously developed for estimating annual average PM2.5 concentrations in the continental U.S. for 1980-2010. We then averaged these predicted PM2.5 concentrations in all counties weighted by census tract population. In sensitivity analyses, we compared the resulting estimates to four alternative county average estimates using MSE-based R2 in order to capture both systematic and random differences in estimates. These estimates included crude aggregates of regulatory monitoring data, averages of predictions at residential addresses in Southern California, and two sets of averages of census tract centroid predictions unweighted by population and interpolated from predictions at 25-km national grid coordinates.

Results: The county-average mean PM2.5 was 14.40 (standard deviation=3.94) µg/m3 in 1980 and decreased to 12.24 (3.24), 10.42 (3.30), and 8.06 (2.06) µg/m3 in 1990, 2000, and 2010, respectively. These estimates were moderately related with crude averages in 2000 and 2010 when monitoring data were available (R2= 0.70-0.82) and almost identical to the unweighted averages in all four decennial years. County averages were also consistent with the county averages derived from residential estimates in Southern California (0.95-0.96). We found grid-based estimates of county-average PM2.5 were more consistent with our estimates when we also included monitoring data (0.95-0.98) than grid-only estimates (0.91-0.96); both had slightly lower concentrations than census tract-based estimates.

Conclusions: Our approach to estimating population representative area-level PM2.5 concentrations is consistent with averages across residences. These exposure estimates will allow us to assess health impacts of ambient PM2.5 concentration in datasets with area-level health data.

by Sun-Young Kim et al. at July 07, 2016 09:39 PM

June 29, 2016

Bioconductor Project Working Papers

Scalable Collaborative Targeted Learning for Large Scale and High-dimensional Data

The collaborative double robust targeted maximum likelihood estimator (C-TMLE) is an extension of targeted minimum loss-based estimators (TMLE) that pursues an optimal strategy for estimation of the nuisance parameter. The original implementation of C-TMLE algorithm uses a greedy forward stepwise selection procedure to construct a nested sequence of candidate nuisance parameter estimators. Cross-validation is then used to select the candidate that minimizes bias in the estimate of the target parameter, rather than basing selection on the fit of the nuisance parameter model. C-TMLE has exhibited superior relative performance in analyses of sparse data, but the time complexity of the algorithm is $\mathcal{O}(p^2)$, where $p$, is the number of covariates available for inclusion in the model. Despite a criterion that allows for early termination, the greedy algorithm does not scale to large scale and high dimensional data.
This article introduces two scalable versions of C-TMLE. Each relies on an easily computed data adaptive pre-ordering of the variables. The time complexity of these scalable algorithms is $\mathcal{O}(p)$, and an early data adaptive stopping rule further reduces computation time without sacrificing statistical performance. We also introduce SL-CTMLE, an approach that uses super learning to select the best variable ordering from a set of ordering strategies. Simulation studies illustrate the performance of the scalable C-TMLEs relative to the original C-TMLE, the augmented inverse probability of treatment weighted estimator (A-IPTW), the probability of treatment weighting (IPTW) estimator, and standard TMLE using an external non-collaborative estimator of the treatment mechanism. Scalable C-TMLEs were also applied to three real-world health insurance claims datasets to estimate an average treatment effect. High-dimensional covariates were generated from the claims data based on high-dimensional propensity score (hdPS) screening. All C-TMLEs provided similar estimates and mean squared errors. Scalable C-TMLE analyses ran ten times faster than the original C-TMLE in larger datasets, making C-TMLE a feasible option for the analysis of large scale high dimensional data.

by Cheng Ju et al. at June 29, 2016 06:04 PM

June 28, 2016

Bioconductor Project Working Papers

Propensity score prediction for electronic healthcare dataset using Super Learner and High-dimensional Propensity Score Method

The optimal learner for prediction modeling varies depending on the underlying data-generating distribution. To select the best algorithm for a given set of data we must therefore use cross-validation to compare several candidate algorithms. Super Learner (SL) is an ensemble learning algorithm that uses cross-validation to select among a "library" of candidate algorithms. The SL is not restricted to a single prediction algorithm, but uses the strengths of a variety of learning algorithms to adapt to different datasets.
While the SL has been shown to perform well in a number of settings, it has not been evaluated in large electronic healthcare datasets that are common in recent pharmacoepidemiology and medical research. In this article, we applied the SL on electronic healthcare datasets and evaluated the performance of the SL in its ability to predict treatment assignment (i.e., the propensity score) using three electronic healthcare datasets. We considered a library of algorithms that consisted of both nonparametric and parametric models. We also considered a novel strategy for prediction modeling that combines the SL with the high-dimensional Propensity Score (hdPS) variable selection algorithm. While the goal of the propensity score model is to control for confounding by balancing covariates across treatment groups, in this study we were interested in the predictive performance of these modeling strategies.
Prediction performance was assessed across three datasets using three metrics: likelihood, area under the curve, and time complexity. The results showed: 1. The hdPS often outperforms other algorithms that consisted of only user-specified baseline variables. 2. The best individual algorithm was highly dependent on the data set. An ensemble data-adaptive method like SL generated the predictive performance. 3. SL which utilized the hdPS methodology outperformed all other algorithms considered in this study. 4. Moreover, in our study, the results showed the reliability of SL: though the SL was optimized with respect to the negative likelihood, it performed best with respect to area under the curve (AUC) in all three data sets.

by Cheng Ju et al. at June 28, 2016 11:12 PM

June 27, 2016

Statistical Modelling

June 25, 2016

RCpp Gallery

Introducing Rcpp::algorithm

Introduction

A while back, I saw a post on StackOverflow where the user was trying to use Rcpp::sugar::sum() on an RcppParallel::RVector. Obviously, this does not work (as Rcpp Sugar pertains to Rcpp types, but not RcppParallel which cannot rely on SEXP-based representation to allow multi-threaded execution). It raised the question “Why doesn’t something more generic exist to provide functions with R semantics that can be used on arbitrary data structures?” As a result, I set out to create a set of such functions in Rcpp::algorithm which follow the pattern of std::algorithm.

Rcpp::algorithm

Currently Rcpp::algorithm contains only a few simple functions. If these are found to be useful, more will be added. Examples of using the currently implemented iterator-based functions are below.

sum, sum_nona, prod, and prod_nona

#include <Rcpp.h>

using namespace Rcpp;

// [[Rcpp::export]]
double sum_of_matrix_row(NumericMatrix m, int row) {
    NumericMatrix::Row r = m.row(row);

    return algorithm::sum(r.begin(), r.end());
}

min, max, and mean

#include <Rcpp.h>

using namespace Rcpp;

// [[Rcpp::export]]
double mean_of_matrix_row(NumericMatrix m, int row) {
    NumericMatrix::Row r = m.row(row);

    return algorithm::mean(r.begin(), r.end());
}

log, exp, and sqrt

#include <Rcpp.h>

using namespace Rcpp;

// [[Rcpp::export]]
NumericVector log_of_matrix_row(NumericMatrix m, int row) {
    NumericMatrix::Row r = m.row(row);

    NumericVector retval(m.cols());
    algorithm::log(r.begin(), r.end(), retval.begin());

    return retval;
}

Additional Benefits

Through the coding of these simple “algorithms”, a few needs arose.

First, the ability to deduce the appropriate C numeric type given an Rcpp iterator was necessary. This gave birth to the Rcpp::algorithm::helpers::decays_to_ctype and Rcpp::algorithm::helpers::ctype type traits. Given a type, these allow you to determine whether it can be cast to a C numeric type and which type that would be.

Second, the need arose for more information about R types. This gave birth to the Rcpp::algorithm::helpers::rtype traits. These are defined as follows:

template< typename T >
struct rtype_helper {};

template<>
struct rtype_helper< double > {
    typedef double type;
    static RCPP_CONSTEXPR int RTYPE = REALSXP;
    static inline double NA() { return NA_REAL; }
    static inline RCPP_CONSTEXPR double ZERO() { return 0.0; }
    static inline RCPP_CONSTEXPR double ONE() { return 1.0; }
};

template<>
struct rtype_helper< int > {
    typedef int type;
    static RCPP_CONSTEXPR int RTYPE = INTSXP;
    static inline int NA() { return NA_INTEGER; }
    static inline RCPP_CONSTEXPR int ZERO() { return 0; }
    static inline RCPP_CONSTEXPR int ONE() { return 1; }
};

template< typename T >
struct rtype {
    typedef typename rtype_helper< typename ctype< T >::type >::type type;
    typedef rtype_helper< typename ctype< T >::type > helper_type;
    static RCPP_CONSTEXPR int RTYPE = helper_type::RTYPE;
    static inline T NA() { return helper_type::NA(); }
    static inline RCPP_CONSTEXPR T ZERO() { return helper_type::ZERO(); }
    static inline RCPP_CONSTEXPR T ONE() { return helper_type::ONE(); }
};

These additional benefits may actually prove more useful than the algorithms themselves. Only time will tell.

Wrapping Up

There are now some simple iterator-based algorithms that can be used with any structure that supports iterators. They apply the same semantics as the analogous Rcpp::sugar functions, but give us more flexibility in their usage. If you find these to be useful, feel free to request more.

June 25, 2016 12:00 AM

Custom Templated as and wrap Functions within Rcpp.

Introduction

Consider a need to be able to interface with a data type that is not presently supported by Rcpp. The data type might come from a new library, or from within one of our own applications. In either cases, Rcpp is faced with an issue of consciousness as the new data type is not similar to known types so the autocoversion or seamless R to C++ integration cannot be applied correctly. The issue is two-fold as we need to consider both directions:

  1. Converting from R to C++ using Rcpp::as<T>(obj)
  2. Converting from C++ to R using Rcpp::wrap(obj)

Luckily, there is a wonderful Rcpp vignette called Extending Rcpp that addresses custom objects. However, the details listed are more abstracted than one would like. So, I am going to try to take you through the steps with a bit of commentary. Please note that the approach used is via Templates and partial specialization and will result in some nice automagic at the end.

The overview of the discussion will focus on:

  • Stage 1 - Forward Declarations
  • Stage 2 - Including the Rcpp Header
  • Stage 3 - Implementation of Forward Declarations
  • Stage 4 - Testing Functionality
  • Stage 5 - All together

Explanation of Stages

Stage 1 - Forward Declarations

In the first stage, we must declare our intent to the features we wish to use prior to engaging Rcpp.h. To do so, we will load a different header file and add some definitions to the Rcpp::traits namespace.

Principally, when we start writing the file, the first header that we must load is RcppCommon.h and not the usual Rcpp.h!! If we do not place the forward declaration prior to the Rcpp.h call, we will be unable to appropriately register our extension.

Then, we must add in the different plugin markup for sourceCpp() to set the appropriate flags during the compilation of the code. After the plugins, we will include the actual headers that we want to use. In this document, we will focus on Boost headers for the concrete example. Lastly, we must add two special Rcpp function declaration, Rcpp::as<T>(obj) and Rcpp::wrap(obj), within the Rcpp::traits namespace. To enable multiple types, we must create an Exporter class instead of a more direct call to template <> ClassName as( SEXP ).

// -------------- Stage 1: Forward Declarations with `RcppCommon.h`

#include <RcppCommon.h>

// Flags for C++ compiler: include Boost headers, use the C++11 standard

// [[Rcpp::depends(BH)]]
// [[Rcpp::plugins("cpp11")]]

// Third party library includes that provide the template class of ublas
#include <boost/numeric/ublas/matrix_sparse.hpp>
#include <boost/numeric/ublas/matrix.hpp>

// Provide Forward Declarations
namespace Rcpp {

    namespace traits{
  
        // Setup non-intrusive extension via template specialization for
        // 'ublas' class boost::numeric::ublas
    
        // Support for wrap
        template <typename T> SEXP wrap(const boost::numeric::ublas::vector<T> & obj);
    
        // Support for as<T>
        template <typename T> class Exporter< boost::numeric::ublas::vector<T> >;
  
    }
}

Stage 2 - Include the Rcpp.h header

It might seem frivolous to have a stage just to declare import order, but if Rcpp.h is included before the forward declaration then Rcpp::traits is not updated and we enter the abyss. Template programming can be delicate, respecting this include order is one of many small details one must get right.

Thus:

// -------------- Stage 2: Including Rcpp.h

// ------ Place <Rcpp.h> AFTER the Forward Declaration!!!!

#include <Rcpp.h>

// ------ Place Implementations of Forward Declarations AFTER <Rcpp.h>!

Stage 3 - Implementing the Declarations

Now, we must actually implement the forward declarations. In particular, the only implementation that will be slightly problematic is the as<> since the wrap() is straight forward.

wrap()

To implement wrap() we must appeal to a built-in type conversion index within Rcpp which is called Rcpp::traits::r_sexptype_traits<T>::rtype. From this, we are able to obtain an int containing the RTYPE and then construct an Rcpp::Vector. For the construction of a matrix, the same ideas hold true.

as()

For as<>(), we need to consider the template that will be passed in. Furthermore, we setup a typedef directly underneath the Exporter class definition to easily define an OUT object to be used within the get() method. Outside of that, we use the same trick to move back and forth from a C++ T template type to an R type (implemented as one of several SEXP types).

In order to accomplish the as<>, or the direct port from R to C++, I had to do something dirty: I copied the vector contents. The code that governs this output is given within the get() of the Exporter class. You may wish to spend some time looking into changing the assignment using pointers perhaps. I am not very well versed with ublas so I did not see an easy approach to resolve the pointer pass.

// -------------- Stage 3: Implementation the Declarations

// Define template specializations for as<> and wrap
namespace Rcpp {

    namespace traits{
  
        // Defined wrap case
        template <typename T> SEXP wrap(const boost::numeric::ublas::vector<T> & obj){
            const int RTYPE = Rcpp::traits::r_sexptype_traits<T>::rtype ;
      
            return Rcpp::Vector< RTYPE >(obj.begin(), obj.end());
        };
    
    
        // Defined as< > case
        template<typename T> class Exporter< boost::numeric::ublas::vector<T> > {
            typedef typename boost::numeric::ublas::vector<T> OUT ;
      
            // Convert the type to a valid rtype. 
            const static int RTYPE = Rcpp::traits::r_sexptype_traits< T >::rtype ;
            Rcpp::Vector<RTYPE> vec;
      
            public:
            Exporter(SEXP x) : vec(x) {
                if (TYPEOF(x) != RTYPE)
                    throw std::invalid_argument("Wrong R type for mapped 1D array");
            }
            OUT get() {
        
                // Need to figure out a way to perhaps do a pointer pass?
                OUT x(vec.size());
        
                std::copy(vec.begin(), vec.end(), x.begin()); // have to copy data
        
                return x;
            }
        };
    }
}

Stage 4 - Testing

Okay, let’s see if what we worked on paid off (spoiler It did! spoiler). To check, we should look at two different areas:

  1. Trace diagnostics within the function and;
  2. An automagic test.

Both of which are given below. Note that I’ve opted to shorten the ublas setup to just be:

// -------------- Stage 4: Testing

// Here we define a shortcut to the Boost ublas class to enable multiple ublas
// types via a template.
// ublas::vector<T> => ublas::vector<double>, ... , ublas::vector<int>
namespace ublas = ::boost::numeric::ublas;

Trace Diagnostics

// [[Rcpp::export]]
void containment_test(Rcpp::NumericVector x1) {
  
    Rcpp::Rcout << "Converting from Rcpp::NumericVector to ublas::vector<double>" << std::endl;

    // initialize the vector to all zero
    ublas::vector<double> x = Rcpp::as< ublas::vector<double> >(x1); 
  
    Rcpp::Rcout << "Running output test with ublas::vector<double>" << std::endl;
  
    for (unsigned i = 0; i < x.size (); ++ i)
        Rcpp::Rcout  << x(i) << std::endl;
  
    Rcpp::Rcout << "Converting from ublas::vector<double> to Rcpp::NumericVector" << std::endl;
  
    Rcpp::NumericVector test = Rcpp::wrap(x);
  
    Rcpp::Rcout << "Running output test with Rcpp::NumericVector" << std::endl;
  
    for (unsigned i = 0; i < test.size (); ++ i)
        Rcpp::Rcout  << test(i) << std::endl;
  
}

Test Call:

containment_test(c(1,2,3,4))

Results:

Converting from Rcpp::NumericVector to ublas::vector<double>
Running output test with ublas::vector<double>
1
2
3
4
Converting from ublas::vector<double> to Rcpp::NumericVector
Running output test with Rcpp::NumericVector
1
2
3
4

This test performed as expected. Onto the next test!

Automagic test

// [[Rcpp::export]]
ublas::vector<double> automagic_ublas_rcpp(ublas::vector<double> x1) {
    return x1;
}

Test Call:

automagic_ublas_rcpp(c(1,2,3.2,1.2))

Results:

automagic_ublas_rcpp(c(1,2,3.2,1.2))
[1] 1.0 2.0 3.2 1.2

Success!

Stage 5 - All together

Here is the combination of the above code chunks given by stage. If you copy and paste this into your .cpp file, then everything should work. If not, let me know.

// -------------- Stage 1: Forward Declarations with `RcppCommon.h`

#include <RcppCommon.h>

// Flags for C++ compiler: include Boost headers, use the C++11 standard

// [[Rcpp::depends(BH)]]
// [[Rcpp::plugins("cpp11")]]

// Third party library includes that provide the template class of ublas
#include <boost/numeric/ublas/matrix_sparse.hpp>
#include <boost/numeric/ublas/matrix.hpp>

// Provide Forward Declarations
namespace Rcpp {

    namespace traits{
  
        // Setup non-intrusive extension via template specialization for
        // 'ublas' class boost::numeric::ublas
    
        // Support for wrap
        template <typename T> SEXP wrap(const boost::numeric::ublas::vector<T> & obj);
    
        // Support for as<T>
        template <typename T> class Exporter< boost::numeric::ublas::vector<T> >;
  
    }
}

// -------------- Stage 2: Including Rcpp.h

// ------ Place <Rcpp.h> AFTER the Forward Declaration!!!!

#include <Rcpp.h>

// ------ Place Implementations of Forward Declarations AFTER <Rcpp.h>!

// -------------- Stage 3: Implementation the Declarations

// Define template specializations for as<> and wrap
namespace Rcpp {

    namespace traits{
  
        // Defined wrap case
        template <typename T> SEXP wrap(const boost::numeric::ublas::vector<T> & obj){
            const int RTYPE = Rcpp::traits::r_sexptype_traits<T>::rtype ;
      
            return Rcpp::Vector< RTYPE >(obj.begin(), obj.end());
        };
    
    
        // Defined as< > case
        template<typename T> class Exporter< boost::numeric::ublas::vector<T> > {
            typedef typename boost::numeric::ublas::vector<T> OUT ;
      
            // Convert the type to a valid rtype. 
            const static int RTYPE = Rcpp::traits::r_sexptype_traits< T >::rtype ;
            Rcpp::Vector<RTYPE> vec;
      
            public:
            Exporter(SEXP x) : vec(x) {
                if (TYPEOF(x) != RTYPE)
                    throw std::invalid_argument("Wrong R type for mapped 1D array");
            }
            OUT get() {
        
                // Need to figure out a way to perhaps do a pointer pass?
                OUT x(vec.size());
        
                std::copy(vec.begin(), vec.end(), x.begin()); // have to copy data
        
                return x;
            }
        };
    }
}

// -------------- Stage 4: Testing

// Here we define a shortcut to the Boost ublas class to enable multiple ublas
// types via a template.
// ublas::vector<T> => ublas::vector<double>, ... , ublas::vector<int>
namespace ublas = ::boost::numeric::ublas;


// [[Rcpp::export]]
void containment_test(Rcpp::NumericVector x1) {
  
    Rcpp::Rcout << "Converting from Rcpp::NumericVector to ublas::vector<double>" << std::endl;

    // initialize the vector to all zero
    ublas::vector<double> x = Rcpp::as< ublas::vector<double> >(x1); 
  
    Rcpp::Rcout << "Running output test with ublas::vector<double>" << std::endl;
  
    for (unsigned i = 0; i < x.size (); ++ i)
        Rcpp::Rcout  << x(i) << std::endl;
  
    Rcpp::Rcout << "Converting from ublas::vector<double> to Rcpp::NumericVector" << std::endl;
  
    Rcpp::NumericVector test = Rcpp::wrap(x);
  
    Rcpp::Rcout << "Running output test with Rcpp::NumericVector" << std::endl;
  
    for (unsigned i = 0; i < test.size (); ++ i)
        Rcpp::Rcout  << test(i) << std::endl;
  
}


// [[Rcpp::export]]
ublas::vector<double> automagic_ublas_rcpp(ublas::vector<double> x1) {
    return x1;
}

Closing Remarks

Whew… That was a lot. Hopefully, the above provided enough information as you may want to extend this post’s content past 1D vectors to perhaps a ublas::matrix and so on. In addition, then you now have the autoconvert magic of Rcpp for ublas::vector<double>! Moreover, all one needs to do is specify the either the parameters or return type of the function to be ublas::vector<double> – and Voilà, automagic conversion!

June 25, 2016 12:00 AM

June 23, 2016

RCpp Gallery

Working with Rcpp::StringVector

Vectors are fundamental containers in R. This makes them equally important in Rcpp. Vectors can be useful for storing multiple elements of a common class (e.g., integer, numeric, character). In Rcpp, vectors come in the form of NumericVector, CharacterVector, LogicalVector, StringVector and more. Look in the header file Rcpp/include/Rcpp/vector/instantiation.h for more types. Here we explore how to work with Rcpp::StringVector as a way to manage non-numeric data.

We typically interface with Rcpp by creating functions. There are several ways to include Rcpp functions in R. The examples here can be copied and pasted into a text file named ‘source.cpp’ and compiled with the command Rcpp::sourceCpp("source.cpp") made from the R command prompt.

Initialization

Here we create a simple function which initializes an empty vector of three elements in size and returns it.

#include <Rcpp.h>
// [[Rcpp::export]]
Rcpp::StringVector basic_function() {
    Rcpp::StringVector myvector(3);
    return myvector;
}

We can call this function from R as follows.

x <- basic_function()
x
[1] "" "" ""

The first two lines are pretty much mandatory and you should copy and paste them into all your code until you understand them. The first line tells the program to use Rcpp. The second line exports this function for use, as opposed to keeping it as an internal function that is unavailable to users. Some people like to include using namespace Rcpp; to load the namespace. I prefer to use the scope operator (::) when calling functions. This is a matter of style and is therefore somewhat arbitrary. Whatever your perspective on this, its best to maintain consistency so that your code will be easier for others to understand.

We see that we’ve returned a vector of length three. We can also see that the default value is a string which contains nothing (“”). This is not a vector of NAs (missing data), even though NAs are supported by Rcpp::StringVector.

Accessing elements

The individual elements of a StringVector can be easily accessed. Here we’ll create an Rcpp function that accepts an Rcpp::StringVector as an argument. We’ll print the elements from within Rcpp. Then we’ll return the vector to R.

#include <Rcpp.h>
// [[Rcpp::export]]
Rcpp::StringVector basic_function( Rcpp::StringVector myvector ) {

    for( int i=0; i < myvector.size(); i++ ){
        Rcpp::Rcout << "i is: " << i << ", the element value is: " << myvector(i);
        Rcpp::Rcout << "\n";
    }
  
    return myvector;
}

After we’ve compiled it we can call it from R.

x1 <- c("apple", "banana", "orange")
x1
[1] "apple"  "banana" "orange"
x2 <- basic_function(x1)
i is: 0, the element value is: apple
i is: 1, the element value is: banana
i is: 2, the element value is: orange
x2
[1] "apple"  "banana" "orange"

We see that the R vector contains the elements “apple”, “banana” and “orange.” Within Rcpp we print each element to standard out with Rcpp::Rcout << statements. And we see that these values are returned to the vector x2.

We’ve also introduced the method .size() which returns the number of elements in an object. This brings up an important difference among C++ and R. Many function names in R may contain periods. For example, the function name write.table() delimits words with a period. However, in C++ the period indicates a method. This means that C++ object names can’t include a period. Camel code or underscores are good alternatives.

There are at least two other important issues to learn from the above example. First, in R we typically access elements with square brackets. While some C++ objects are also accessed with square brackets, the Rcpp::StringVector is accessed with either parentheses or square brackets. In the case of the Rcpp::StringVector these appear interchangeable. However, be very careful, they are different in other containers. A second, but very important, difference between R and C++ is that in R the vectors are 1-based, meaning the first element is accessed with a 1. In C++ the vectors are zero-based, meaning the first element is accessed with a zero. This creates an opportunity for one-off errors. If you notice that the number of elements you’ve passed to C++ and back are off by one element, this would be something good to check.

Elements of elements

In C++, a std::string can be see as a vector of chars. Individual elements of a Rcpp::StringVector behave similarly. Accessing each element of a StringVector is similar to working with a std::string. Here we access each character of the second element of our StringVector.

#include <Rcpp.h>
// [[Rcpp::export]]
void basic_function(Rcpp::StringVector x) {
    for(int i=0; i < x[1].size(); i++){
        Rcpp::Rcout << "i is: " << i << ", element is: ";
        Rcpp::Rcout << x[1][i];
        Rcpp::Rcout << "\n";
    }
}

And call the code from R.

x1 <- c("apple", "banana", "orange")
x2 <- basic_function(x1)
i is: 0, element is: b
i is: 1, element is: a
i is: 2, element is: n
i is: 3, element is: a
i is: 4, element is: n
i is: 5, element is: a

We see that we’ve accessed and printed the individual characters of the second element of the vector. We accomplish this by using the square brackets to access element one of the vector, and then use a second set of square brackets to access each character of this element.

Modifying elements

The modification of elements is fairly straight forward. We use the index (begining at zero) to modify the vector elements.

#include <Rcpp.h>
// [[Rcpp::export]]
Rcpp::StringVector basic_function(Rcpp::StringVector x) {
    Rcpp::StringVector myvector = x;
    myvector[1] = "watermelon";
    myvector(2) = "kumquat";
    return myvector;
}
x1 <- c("apple", "banana", "orange")
x2 <- basic_function(x1)
x2
[1] "apple"      "watermelon" "kumquat"   

We’ve successfully changed the second element from ‘banana’ to ‘watermelon’ and the third element from ‘orange’ to ‘kumquat.’ This also illustrates that Rcpp::StringVectors are flexible in their use of both square and round brackets. Trying that with standard library containers will usually result in an error.

In the above example we’ve passed an Rcpp::StringVector to a function and returned a new Rcpp::StringVector. By copying the container in this manner it may seem intuitive to work on it. If efficient use of memory is desired it is important to realize that pointers are being passed to the Rcpp function. This means we can create a function which returns void and modifies the elements we’re interested in modifying without the overhead of copying the container.

#include <Rcpp.h>
// [[Rcpp::export]]
void basic_function(Rcpp::StringVector x) {
    x(1) = "watermelon";
}
x1 <- c("apple", "banana", "orange")
basic_function(x1)
x1
[1] "apple"      "watermelon" "orange"    

Erasing elements

If we want to remove an element from a StringVector we can use the .erase() method.

#include <Rcpp.h>
// [[Rcpp::export]]
Rcpp::StringVector basic_function(Rcpp::StringVector x) {
    Rcpp::StringVector myvector = x;
    myvector.erase(1);
    return myvector;
}

And see our changes with R code.

x1 <- c("apple", "banana", "orange")
x2 <- basic_function(x1)
x2
[1] "apple"  "orange"

We see that we’ve erased the second element from the array.

Growing and shrinking Rcpp::StringVectors

If you have an Rcpp::StringVector and you want to add elements, you can use the method .push_back(). While Rcpp has push functionality, it does not appear to have pop functionality. However, using techniques learned above, we could use object.erase(object.size()) to attain similar functionality. Here I illustrate their use to remove an element and then add two elements.

#include <Rcpp.h>
// [[Rcpp::export]]
Rcpp::StringVector basic_function(Rcpp::StringVector x) {
    x.erase( x.size() - 1 ); // C++ vectors are zero based so remember the -1!
  
    for(int i=0; i < x.size(); i++){
        Rcpp::Rcout << "i is: " << i << ", the element value is: " << x(i);
        Rcpp::Rcout << "\n";
    }
  
    x.push_back("avocado");
    x.push_back("kumquat");
    return x;
}

And implement our example in R.

x1 <- c("apple", "banana", "orange")
x2 <- basic_function(x1)
i is: 0, the element value is: apple
i is: 1, the element value is: banana
x2
[1] "apple"   "banana"  "avocado" "kumquat"

From the Rcpp output we see that we’ve removed the last element from the vector. We also see that we’ve added two elements to the ‘back’ of the vector.

If we want to add to the front of our vector we can accomplish that as well. There does not appear to be ‘push_front’ or ‘pop_front’ methods, but we have the tools necessary to accomplish these tasks. We use the erase and insert methods to push and pop to the front of our Rcpp::StringVector.

#include <Rcpp.h>
// [[Rcpp::export]]
Rcpp::StringVector basic_function(Rcpp::StringVector x) {
    x.erase(0);
    x.erase(0);
  
    x.insert(0, "avocado");
    x.insert(0, "kumquat");
    return x;
}

And implement our example in R.

x1 <- c("apple", "banana", "orange")
x2 <- basic_function(x1)
x2
[1] "kumquat" "avocado" "orange" 

In general, growing and shrinking data structures comes with a performance cost. And if you’re interested in Rcpp, you’re probably interested in performance. You’ll typically be better off setting a container size and sticking with it. But there are times when growing and shrinking your container can be really helpful. My recommendation is to use this functionality sparingly.

Missing data

In R we handle missing data with ‘NAs.’ In C++ the concept of missing data does not exist. Instead, some sort of placeholder, such as -999, has to be used. The Rcpp containers do support missing data to help make the interface between R and C++ easy. We can see this by continuing our existing example, but use it to set the second element as missing.

#include <Rcpp.h>
// [[Rcpp::export]]
Rcpp::StringVector basic_function(Rcpp::StringVector myvector) {
    myvector(1) = NA_STRING;
    return myvector;
}
x1 <- c("apple", "banana", "orange")
x2 <- basic_function(x1)
x2
[1] "apple"  NA       "orange"

Finding other methods

The Rcpp header files contain valuable information about objects defined in Rcpp. However, they’re rather technical and may not be very approachable to the novice. (This document is an attempt to help users bridge that gap between a novice and someone who reads headers.) If you don’t know where the header files are, you can use .libPaths() to list the locations of your libraries. In one of these locations you should find a directory called ‘Rcpp.’ Within this directory you should find a directory named ‘include’ which is where the headers are. For example, the header for the String object on my system is at: Rcpp/include/Rcpp/String.h

Type conversion

Once we have our data in an Rcpp function we may want to make use of the functionality of C++ containers. This will require us to convert our Rcpp container to another C++ form. Once we’ve processed these data we may want convert them back to Rcpp (so we can return them to R). A good example is converting an element of a StringVector to a std::string.

Implicit type conversion

Conversion from an Rcpp:StringVector to a std::string is a compatible conversion. This means we can accomplish this implicitly by simply setting the value of one container to the other.

#include <Rcpp.h>
// [[Rcpp::export]]
Rcpp::StringVector StringV_to_vstrings(Rcpp::StringVector StringV){
    std::vector<std::string> vstrings(StringV.size());
    int i;
    for (i = 0; i < StringV.size(); i++){
        vstrings[i] = StringV(i);
    }

    Rcpp::StringVector StringV2(StringV.size());
    StringV2 = vstrings;
    return(StringV2);
}
x1 <- c("apple", "banana", "orange")
x2 <- StringV_to_vstrings(x1)
x2
[1] "apple"  "banana" "orange"

Note that while we have to load each element of the std::vector individually. However, the loading of the Rcpp::StringVector has been vectorized so that it works similar to R vectors.

Explicit type conversion

In some instances we may need explicit type conversion. Rcpp provides an ‘as’ method to accomplish this.

#include <Rcpp.h>
// [[Rcpp::export]]
Rcpp::StringVector StringV_to_vstrings(Rcpp::StringVector StringV){
    std::vector<std::string> vstrings(StringV.size());
    int i;
    for (i = 0; i < StringV.size(); i++){
        vstrings[i] = Rcpp::as< std::string >(StringV(i));
    }

    Rcpp::StringVector StringV2(StringV.size());
    StringV2 = vstrings;
    return(StringV2);
}
x1 <- c("apple", "banana", "orange")
x2 <- StringV_to_vstrings(x1)
x2
[1] "apple"  "banana" "orange"

Type conversion is a lengthy topic and is frequently specific to the types which are being converted to and from. Hopefully this introduction is enough to get you started with the tools provided in Rcpp.

Attributes

R objects include attributes which help describe the object. This is another concept that is absent in C++. Again, the Rcpp objects implement attributes to help us and to maintain a behavior that is similar to R.

#include <Rcpp.h>
// [[Rcpp::export]]
Rcpp::StringVector basic_function(Rcpp::StringVector myvector) {
    std::vector< std::string > mystrings =  myvector.attr("names");
    mystrings[2] = "citrus";
    myvector.attr("names") = mystrings;
    return myvector;
}
x1 <- c("apple", "banana", "orange")
names(x1) <- c("pome", "berry", "hesperidium")
x1
       pome       berry hesperidium 
    "apple"    "banana"    "orange" 
x2 <- basic_function(x1)
x2
    pome    berry   citrus 
 "apple" "banana" "orange" 

Here we’ve stored the names of the Rcpp:StringVector in a std::vector of strings. We’ve then modified one of the elements and reset the names attribute with this changed vector. This illustrates the use of standard library containers along with those provided by Rcpp. But we need to be a little careful of what we’re doing here. If we store the values of our vector in a vector of std::string we lose our attributes because neither a std::vector or std::string has attributes.

#include <Rcpp.h>
// [[Rcpp::export]]
Rcpp::StringVector basic_function(Rcpp::StringVector myvector) {
    std::vector< std::string > mystrings(myvector.size()); 
    mystrings[0] = myvector(0);
    mystrings[1] = myvector(1);
    mystrings[2] = myvector(2);
    myvector = mystrings;
    return myvector;
}
x1 <- c("apple", "banana", "orange")
names(x1) <- c("pome", "berry", "hesperidium")
x2 <- basic_function(x1)
x2
[1] "apple"  "banana" "orange"

Note that while we can assign a vector of strings to a Rcpp::StringVector we can not do the inverse. Instead we need to assign each element to the vector of strings. And we need to remember to keep our square brackets and round brackets associated with the correct data structure.

More information

Below are some links I’ve found useful in writing this document. Hopefully you’ll find them as gateways for your exploration of Rcpp.

Once you’ve crossed from R to C++ there are many of sources of information online. One of my favorites is included below.

June 23, 2016 12:00 AM

June 08, 2016

RCpp Gallery

Optimizing Code vs Recognizing Patterns with 3D Arrays

Intro

As is the case with the majority of posts normally born into existence, there was an interesting problem that arose recently on StackOverflow. Steffen, a scientist at an unnamed weather service, faced an issue with the amount of computational time required by a loop intensive operation in R. Specifically, Steffen needed to be able to sum over different continuous regions of elements within a 3D array structure with dimensions 2500 x 2500 x 50 to acquire the amount of neighbors. The issue is quite common within geography information sciences and/or spatial statistics. What follows next is a modified version of the response that I gave providing additional insight into various solutions.

Problem Statement

Consider an array of matrices that contain only 1 and 0 entries that are spread out over time with dimensions . Within each matrix, there are specific regions on the plane of a fixed time, , that must be summed over to obtain their neighboring elements. Each region is constrained to a four by four tile given by points inside , where , , , and .

# Matrix Dimensions
xdim <- 20
ydim <- 20

# Number of Time Steps
tdim <- 5

# Generation of 3D Arrays

# Initial Neighborhoods
a <- array(0:1, dim = c(xdim, ydim, tdim))

# Result storing matrix
res <- array(0:1, dim = c(xdim, ydim, tdim))

# Calculation loop over time
for (t in 1:tdim) {
    ## Subset by specific rows
    for (x in 3:(xdim-2)) {
        ## Subset by specific columns
        for (y in 3:(ydim-2)) {
            ## Sum over each region within a time point
            res[x,y,t] <- sum(a[(x-2):(x+2), (y-2):(y+2), t])
        }
    }
}

Sample Result:

Without a loss of generality, I’ve opted to downscale the problem to avoid a considerable amount of output while displaying a sample result. Thus, the sample result presented next is under the dimensions: . Therefore, the results of the neighboring clusters are:

, , 1

     [,1] [,2] [,3] [,4] [,5] [,6]
[1,]    0    0    0    0    0    0
[2,]    1    1    1    1    1    1
[3,]    0    0   10   10    0    0
[4,]    1    1   15   15    1    1
[5,]    0    0    0    0    0    0
[6,]    1    1    1    1    1    1

, , 2

     [,1] [,2] [,3] [,4] [,5] [,6]
[1,]    0    0    0    0    0    0
[2,]    1    1    1    1    1    1
[3,]    0    0   10   10    0    0
[4,]    1    1   15   15    1    1
[5,]    0    0    0    0    0    0
[6,]    1    1    1    1    1    1

Note: Both time points, t <- 1 and t <- 2, are the same!!! More on this interesting pattern later…

Possible Solutions

There are many ways to approach a problem like this. The first is to try to optimize the initial code base within R. Secondly, and one of the primary reasons for this article is, one can port over the computational portion of the code and perhaps add some parallelization component with the hope that such a setup would decrease the amount of time required to perform said computations. Thirdly, one can try to understand the underlying structure of the arrays by searching for patterns that can be exploited to create a cache of values which would be able to be reused instead of individually computed for each time point.

Thus, there are really three different components within this post that will be addressed:

  1. Optimizing R code directly in hopes of obtaining a speedup in brute forcing the problem;
  2. Porting over the R code into C++ and parallelizing the computation using more brute force;
  3. Trying to determine different patterns in the data and exploit their weakness

Optimizing within R

The initial problem statement has something that all users of R dread: a loop. However, it isn’t just one loop, it’s 3! As is known, one of the key downsides to R is looping. This problem has a lot of coverage — my favorite being the straight, curly, or compiled as it sheds light on R’s parser — and is primarily one of the key reasons why Rcpp is favored in loop heavy problems.

There are a few ways we can aim to optimize just the R code:

  1. Cache subsets within the loop.
  2. Parallelize the time computation stage.

Original

To show differences between functions, I’ll opt to declare the base computational loop given above as:

cube_r <- function(a, res, xdim, ydim, tdim){
    for (t in 1:tdim) {
        for (x in 3:(xdim-2)) {
            for (y in 3:(ydim-2)) {
                res[x,y,t] <- sum(a[(x-2):(x+2), (y-2):(y+2), t])
            }
        }
    }
    res
}

Cached R

The first order of business is to implement a cached value schema so that we avoid subsetting the different elements considerably.

cube_r_cache <- function(a, res, xdim, ydim, tdim){
    for (t in 1:tdim) {
        temp_time <- a[,,t]
        for (x in 3:(xdim-2)) {
            temp_row <- temp_time[(x-2):(x+2),]
            for (y in 3:(ydim-2)) {
                res[x,y,t] <- sum(temp_row[,(y-2):(y+2)])
            }
        }
    }
    res
}

Parallelized R

Next up, let’s implement a way to parallelize the computation over time using R’s built in parallel package.

library(parallel)

cube_r_parallel <- function(a, xdim, ydim, tdim, ncores){
    ## Create a cluster
    cl <- makeCluster(ncores)
  
    ## Create a time-based computation loop
    computation_loop <- function(X, a, xdim, ydim, tdim) {
        temp_time <- a[,,X]
        res <- temp_time
        for (x in 3:(xdim-2)) {
            temp_row <- temp_time[(x-2):(x+2),]
            for (y in 3:(ydim-2)) {
                res[x,y] <- sum(temp_row[,(y-2):(y+2)])
            }
        }
        res
    }
  
    ## Obtain the results
    r_parallel <- parSapply(cl = cl,
                            X = seq_len(tdim), 
                            FUN = computation_loop, 
                            a = a, xdim = xdim, ydim = ydim, tdim = tdim, 
                            simplify = "array")
  
    ## Kill the cluster
    stopCluster(cl)
  
    ## Return
    r_parallel
}

Validating Function Output

After modifying a function’s initial behavior, it is a good idea to check whether or not the new function obtains the same values. If not, chances are there was a bug that crept into the function after the change.

r_original  <- cube_r(a, res, xdim, ydim, tdim)
r_cached    <- cube_r_cache(a, res, xdim, ydim, tdim)
r_parallel1 <- cube_r_parallel(a, xdim, ydim, tdim, ncores = 1)

all.equal(r_original, r_cached)
[1] TRUE
all.equal(r_original, r_parallel1)
[1] TRUE

Thankfully, all the modifications yielded no numerical change from the original.

Benchmarking

As this is Rcpp, benchmarks are king. Here I’ve opted to create a benchmark of the different functions after trying to optimize them within R.

library("microbenchmark")
rtimings <- microbenchmark(r_orig   = cube_r(a, res, xdim, ydim, tdim),
                           r_cached = cube_r_cache(a, res, xdim, ydim, tdim),
                           r_cores1 = cube_r_parallel(a, xdim, ydim, tdim, ncores = 1),
                           r_cores2 = cube_r_parallel(a, xdim, ydim, tdim, ncores = 2),
                           r_cores3 = cube_r_parallel(a, xdim, ydim, tdim, ncores = 3),
                           r_cores4 = cube_r_parallel(a, xdim, ydim, tdim, ncores = 4))

Output:

Unit: milliseconds
     expr        min         lq       mean     median         uq        max neval
   r_orig   3.810462   4.616145   5.275087   4.948391   5.746963   9.232729   100
 r_cached   3.104683   3.529720   3.965664   3.753478   4.053211   5.793070   100
 r_cores1 146.083671 151.194330 155.888165 153.902970 157.469347 193.474503   100
 r_cores2 283.613516 294.775359 302.326447 299.960622 306.691685 367.890747   100
 r_cores3 429.393160 441.279938 449.346205 445.347866 452.774463 540.862980   100
 r_cores4 575.001192 588.513541 602.381051 596.365266 608.556754 713.669543   100

Not surprisingly, the cached subset function performed better than the original R function. On the other hand, what was particularly surpising was that the parallelization option within R took a considerably longer amount of time as additional cores were added. This was partly due to the fact that the a array had to be replicated out across the different ncores number of R processes spawned. In addition, there was also a time lag between spawning the processes and winding them down. This may prove to be a fatal flaw of the construction of cube_r_parallel as a normal user may wish to make repeated calls within the same parallelized session. However, I digress as I feel like I’ve spent too much time describing ways to optimize the R code.

Porting into C++

To keep in the spirit of optimizing a generalized solution, we can quickly port over the R cached version of the computational loop to Armadillo, C++ Linear Algebra Library, and use OpenMP with RcppArmadillo.

Why not use just Rcpp though? The rationale for using RcppArmadillo over Rcpp is principally because of the native support for multidimensional arrays via arma::cube. In addition, it’s a bit painful in the current verison of Rcpp to work with an array. Hence, I would rather face the cost of copying an object from SEXP to arma and back again than mess around with this.

Before we get started, it is very important to provide protection for systems that still lack OpenMP by default in R (**cough** OS X **cough**). Though, by changing how R’s ~/.R/Makevars compiler flag is set, OpenMP can be used within R. Discussion of this approach is left to http://thecoatlessprofessor.com/programming/openmp-in-r-on-os-x/. Now, there still does exist the real need to protect users of the default R ~/.R/Makevars configuration by guarding the header inclusion of OpenMP in the cpp code using preprocessor directives:

// Protect against compilers without OpenMP
#ifdef _OPENMP
#include <omp.h>
#endif

Given the above code, we have effectively provided protection from the compiler throwing an error due to OpenMP not being available on the system. Note: In the event that the system does not have OpenMP, the process will be executed serially just like always.

With this being said, let’s look at the C++ port of the R function:

#include <RcppArmadillo.h>

// Correctly setup the build environment 
// [[Rcpp::depends(RcppArmadillo)]]

// Add a flag to enable OpenMP at compile time
// [[Rcpp::plugins(openmp)]]

// Protect against compilers without OpenMP
#ifdef _OPENMP
#include <omp.h>
#endif

// Parallelized C++ equivalent
// [[Rcpp::export]]
arma::cube cube_cpp_parallel(arma::cube a, arma::cube res, int ncores = 1) {
  
    // Extract the different dimensions

    // Normal Matrix dimensions
    unsigned int xdim = res.n_rows;
  
    unsigned int ydim = res.n_cols;
  
    // Depth of Array
    unsigned int tdim = res.n_slices;
  
    // Added an omp pragma directive to parallelize the loop with ncores
    #pragma omp parallel for num_threads(ncores)
    for (unsigned int t = 0; t < tdim; t++) { // Note: Array Index in C++ starts at 0 and goes to N - 1
    
        // Pop time `t` from `a` e.g. `a[,,t]`
        arma::mat temp_mat = a.slice(t);
    
        // Begin region selection
        for (unsigned int x = 2; x < xdim-2; x++) {
      
            // Subset the rows
            arma::mat temp_row_sub = temp_mat.rows(x-2, x+2);
      
            // Iterate over the columns with unit accumulative sum
            for (unsigned int y = 2; y <  ydim-2; y++) {
                res(x,y,t) = accu(temp_row_sub.cols(y-2,y+2));
            }
        }
    }
  
    return res;
}

A few things to note here:

  1. a, res, xdim,ydim and ncores are shared across processes.
  2. t is unique to each process.
  3. We protect users that cannot use OpenMP!

To verify that this is an equivalent function, we opt to check the object equality:

cpp_parallel <- cube_cpp_parallel(a, res, ncores = 2)

all.equal(cpp_parallel, r_original)
[1] TRUE

Timings

Just as before, let’s check the benchmarks to see how well we did:

port_timings <- microbenchmark(r_orig    = cube_r(a, res, xdim, ydim, tdim),
                               r_cached  = cube_r_cache(a, res, xdim, ydim, tdim),
                               cpp_core1 = cube_cpp_parallel(a, res, 1),
                               cpp_core2 = cube_cpp_parallel(a, res, 2),
                               cpp_core3 = cube_cpp_parallel(a, res, 3),
                               cpp_core4 = cube_cpp_parallel(a, res, 4))

Output:

Unit: microseconds
      expr      min        lq      mean    median        uq       max neval
    r_orig 3641.936 4000.8825 4704.5218 4198.0920 4813.7600 40890.469   100
  r_cached 2835.476 3056.9185 3300.9981 3195.7130 3328.6795  4466.824   100
 cpp_core1  177.770  182.8920  202.1724  188.5620  199.7085  1076.910   100
 cpp_core2  114.871  117.5255  136.1581  120.6395  132.3940   704.835   100
 cpp_core3   83.156   87.1025  100.3414   93.1880  106.2620   207.876   100
 cpp_core4   82.987   87.2925  106.9642   97.5210  121.6275   183.434   100

Wow! The C++ version of the parallelization really did wonders when compared to the previous R implementation of parallelization. The speed ups when compared to the R implementations are about 44x vs. original and 34x vs. the optimized R loop (using 3 cores). Note, with the addition of the 4th core, the parallelization option performs poorly as the system running the benchmarks only has four cores. Thus, one of the cores is trying to keep up with the parallelization while also having to work on operating system tasks and so on.

Now, this isn’t to say that there is no cap to parallelization benefits given infinite amounts of processing power. In fact, there are two laws that govern general speedups from parallelization: Amdahl’s Law (Fixed Problem Size) and Gustafson’s Law (Scaled Speedup). The details are better left for another time on this matter.

Detecting Patterns

Previously, we made the assumption that the structure of the data within computation had no pattern. Thus, we opted to create a generalized algorithm to effectively compute each value. Within this section, we remove the assumption about no pattern being present. In this case, we opt to create a personalized solution to the problem at hand.

As a first step, notice how the array is constructed in this case with: array(0:1, dims). There seems to be some sort of pattern depending on the xdim, ydim, and tdim of the distribution of 1s and 0s. If we can recognize the pattern in advance, the amount of computation required decreases. However, this may also impact the ability to generalize the algorithm to other cases outside the problem statement. Thus, the reason for this part of the post being at the terminal part of this article.

After some trial and error using different dimensions, the different patterns become recognizable and reproducible. Most notably, we have three different cases:

  • Case 1: If xdim is even, then only the rows of a matrix alternate with rows containing all 1s or 0s.
  • Case 2: If xdim is odd and ydim is even, then only the rows alternate of a matrix alternate with rows containing a combination of both 1 or 0.
  • Case 3: If xdim is odd and ydim is odd, then rows alternate as well as the matrices alternate with a combination of both 1 or 0.

Examples of Pattern Cases

Let’s see the cases described previously in action to observe the patterns. Please note that the dimensions of the example case arrays are small and would likely yield issues within the for loop given above due to the indices being negative or zero triggering an out-of-bounds error.

Case 1:

xdim <- 2
ydim <- 3
tdim <- 2
a <- array(0:1, dim = c(xdim, ydim, tdim))

Output:

a

Case 2:

xdim <- 3
ydim <- 4
tdim <- 2
a <- array(0:1, dim = c(xdim, ydim, tdim))

Output:

, , 1

     [,1] [,2] [,3] [,4]
[1,]    0    1    0    1
[2,]    1    0    1    0
[3,]    0    1    0    1

, , 2

     [,1] [,2] [,3] [,4]
[1,]    0    1    0    1
[2,]    1    0    1    0
[3,]    0    1    0    1

Case 3:

xdim <- 3
ydim <- 3
tdim <- 3
a <- array(0:1, dim = c(xdim, ydim, tdim))

Output:

, , 1

     [,1] [,2] [,3] [,4]
[1,]    0    1    0    1
[2,]    1    0    1    0
[3,]    0    1    0    1

, , 2

     [,1] [,2] [,3] [,4]
[1,]    0    1    0    1
[2,]    1    0    1    0
[3,]    0    1    0    1

Pattern Hacking

Based on the above discussion, we opt to make a bit of code that exploits this unique pattern. The language that we can write this code in is either R or C++. Though, due to the nature of this website, we opt to proceed in the later. Nevertheless, as an exercise, feel free to backport this code into R.

Having acknowledged that, we opt to start off trying to create code that can fill a matrix with either an even or an odd column vector. e.g.

Odd Vector

The odd vector is defined as having the initial value being given by 1 instead of 0. This is used in heavily in both Case 2 and Case 3.

     [,1]
[1,]    1
[2,]    0
[3,]    1
[4,]    0
[5,]    1

Even

In comparison to the odd vector, the even vector starts at 0. This is used principally in Case 1 and then in Case 2 and Case 3 to alternate columns.

     [,1]
[1,]    0
[2,]    1
[3,]    0
[4,]    1
[5,]    0

Creating Alternating Vectors

To obtain such an alternating vector that switches between two values, we opt to create a vector using the modulus operator while iterating through element positions in an length vector. Specifically, we opt to use the fact that when i is an even number i % 2 must be 0 and when i is odd i % 2 gives 1. Thefore, we are able to create an alternating vector that switches between two different values with the following code:

#include <RcppArmadillo.h>

// Correctly setup the build environment 
// [[Rcpp::depends(RcppArmadillo)]]

// ------- Make Alternating Column Vectors
// [[Rcpp::export]]
// Creates a vector with initial value 1
arma::vec odd_vec(unsigned int xdim) {
  
    // make a temporary vector to create alternating 0-1 effect by row.
    arma::vec temp_vec(xdim);
  
    // Alternating vector (anyone have a better solution? )
    for (unsigned int i = 0; i < xdim; i++) {
        temp_vec(i) = (i % 2 ? 0 : 1); // Ternary operator in C++, e.g. if(TRUE){1}else{0}
    }
  
    return temp_vec;
}

// Creates a vector with initial value 0
// [[Rcpp::export]]
arma::vec even_vec(unsigned int xdim){
  
    // make a temporary vector to create alternating 0-1 effect by row.
    arma::vec temp_vec(xdim);
  
    // Alternating vector (anyone have a better solution? )
    for (unsigned int i = 0; i < xdim; i++) {
        temp_vec(i) = (i % 2 ? 1 : 0); // changed
    }
  
    return temp_vec;
}

Creating the three cases of matrix

With our ability to now generate odd and even vectors by column, we now need to figure out how to create the matrices described in each case. As mentioned above, there are three cases of matrix given as:

  1. The even,
  2. The bad odd,
  3. And the ugly odd.

Using a similar technique to obtain the alternating vector, we obtain the three cases of matrices:

// --- Handle the different matrix cases 

// Case 1: xdim is even
// [[Rcpp::export]]
arma::mat make_even_matrix_case1(unsigned int xdim, unsigned int ydim) {
  
    arma::mat temp_mat(xdim,ydim);
  
    temp_mat.each_col() = even_vec(xdim);
    
    return temp_mat;
}

// Case 2: xdim is odd and ydim is odd    
// [[Rcpp::export]]
arma::mat make_odd_matrix_case2(unsigned int xdim, unsigned int ydim){
  
    arma::mat temp_mat(xdim,ydim);
  
    // Cache values
    arma::vec e_vec = even_vec(xdim);
    arma::vec o_vec = odd_vec(xdim);
  
    // Alternating column 
    for (unsigned int i = 0; i < ydim; i++) {
        temp_mat.col(i) = (i % 2 ? e_vec : o_vec); 
    }
  
    return temp_mat;
}

// Case 3: xdim is odd and ydim is even
// [[Rcpp::export]]
arma::mat make_odd_matrix_case3(unsigned int xdim, unsigned int ydim){
  
    arma::mat temp_mat(xdim,ydim);
  
    // Cache values
    arma::vec e_vec = even_vec(xdim);
    arma::vec o_vec = odd_vec(xdim);
  
    // Alternating column 
    for (unsigned int i = 0; i < ydim; i++) {
        temp_mat.col(i) = (i % 2 ? o_vec : e_vec); // slight change
    }
  
    return temp_mat;
}

Calculation Engine

Next, we need to create a computational loop to subset the appropriate continuous areas of the matrix to figure out the amount of neighbors. In comparison to the problem statement, note that this loop is without the t as we no longer need to repeat calculations within this approach. Instead, we only need to compute the values once and then cache the result before duplicating it across the 3D array (e.g. arma::cube).

// --- Calculation engine

// [[Rcpp::export]]
arma::mat calc_matrix(arma::mat temp_mat) {
  
    // Obtain matrix dimensions
    unsigned int xdim = temp_mat.n_rows;
  
    unsigned int ydim = temp_mat.n_cols;
  
    // Create the 2D result matrix with temp_mat initial values
    arma::mat res = temp_mat;
  
    // Subset the rows
    for (unsigned int x = 2; x < xdim-2; x++){ // Note: Index Shift to C++!
    
        arma::mat temp_row_sub = temp_mat.rows(x-2, x+2);
    
        // Iterate over the columns with unit accumulative sum
        for (unsigned int y = 2; y <  ydim-2; y++){
            res(x,y) = accu(temp_row_sub.cols(y-2,y+2));
        }
    }
  
    return res;
}

Call Main Function

Whew, that was a lot of work. But, by approaching the problem this way, we have:

  1. Created reusable code snippets.
  2. Decreased the size of the main call function.
  3. Improved clarity of each operation.

Now, the we are ready to write the glue that combines all the different components. As a result, we will obtain the desired neighbor information.

// --- Main Engine

// Create the desired cube information
// [[Rcpp::export]]
arma::cube dim_to_cube(unsigned int xdim = 4, unsigned int ydim = 4, unsigned int tdim = 3) {
  
    // Initialize values in A
    arma::cube res(xdim,ydim,tdim);
  
    // Case 1
    if (xdim % 2 == 0) {
        res.each_slice() = calc_matrix(make_even_matrix_case1(xdim, ydim));
    } else { // Either Case 2 or Case 3
        if (ydim % 2 == 0) { // Case 3
            res.each_slice() = calc_matrix(make_odd_matrix_case3(xdim, ydim));
        } else { // Case 2
            arma::mat first_odd_mat = calc_matrix(make_odd_matrix_case2(xdim, ydim));
            arma::mat sec_odd_mat = calc_matrix(make_odd_matrix_case3(xdim, ydim));
      
            for (unsigned int t = 0; t < tdim; t++) {
                res.slice(t) = (t % 2 ? first_odd_mat  : sec_odd_mat);
            }
        }
    }
    return res;
}

Pieced together

When put together, the code forms:

#include <RcppArmadillo.h>

// Correctly setup the build environment 
// [[Rcpp::depends(RcppArmadillo)]]

// ------- Make Alternating Column Vectors
// [[Rcpp::export]]
// Creates a vector with initial value 1
arma::vec odd_vec(unsigned int xdim) {
  
    // make a temporary vector to create alternating 0-1 effect by row.
    arma::vec temp_vec(xdim);
  
    // Alternating vector (anyone have a better solution? )
    for (unsigned int i = 0; i < xdim; i++) {
        temp_vec(i) = (i % 2 ? 0 : 1); // Ternary operator in C++, e.g. if(TRUE){1}else{0}
    }
  
    return temp_vec;
}

// Creates a vector with initial value 0
// [[Rcpp::export]]
arma::vec even_vec(unsigned int xdim){
  
    // make a temporary vector to create alternating 0-1 effect by row.
    arma::vec temp_vec(xdim);
  
    // Alternating vector (anyone have a better solution? )
    for (unsigned int i = 0; i < xdim; i++) {
        temp_vec(i) = (i % 2 ? 1 : 0); // changed
    }
  
    return temp_vec;
}

// --- Handle the different matrix cases 

// Case 1: xdim is even
// [[Rcpp::export]]
arma::mat make_even_matrix_case1(unsigned int xdim, unsigned int ydim) {
  
    arma::mat temp_mat(xdim,ydim);
  
    temp_mat.each_col() = even_vec(xdim);
    
    return temp_mat;
}

// Case 2: xdim is odd and ydim is odd    
// [[Rcpp::export]]
arma::mat make_odd_matrix_case2(unsigned int xdim, unsigned int ydim){
  
    arma::mat temp_mat(xdim,ydim);
  
    // Cache values
    arma::vec e_vec = even_vec(xdim);
    arma::vec o_vec = odd_vec(xdim);
  
    // Alternating column 
    for (unsigned int i = 0; i < ydim; i++) {
        temp_mat.col(i) = (i % 2 ? e_vec : o_vec); 
    }
  
    return temp_mat;
}

// Case 3: xdim is odd and ydim is even
// [[Rcpp::export]]
arma::mat make_odd_matrix_case3(unsigned int xdim, unsigned int ydim){
  
    arma::mat temp_mat(xdim,ydim);
  
    // Cache values
    arma::vec e_vec = even_vec(xdim);
    arma::vec o_vec = odd_vec(xdim);
  
    // Alternating column 
    for (unsigned int i = 0; i < ydim; i++) {
        temp_mat.col(i) = (i % 2 ? o_vec : e_vec); // slight change
    }
  
    return temp_mat;
}

// --- Calculation engine

// [[Rcpp::export]]
arma::mat calc_matrix(arma::mat temp_mat) {
  
    // Obtain matrix dimensions
    unsigned int xdim = temp_mat.n_rows;
  
    unsigned int ydim = temp_mat.n_cols;
  
    // Create the 2D result matrix with temp_mat initial values
    arma::mat res = temp_mat;
  
    // Subset the rows
    for (unsigned int x = 2; x < xdim-2; x++){ // Note: Index Shift to C++!
    
        arma::mat temp_row_sub = temp_mat.rows(x-2, x+2);
    
        // Iterate over the columns with unit accumulative sum
        for (unsigned int y = 2; y <  ydim-2; y++){
            res(x,y) = accu(temp_row_sub.cols(y-2,y+2));
        }
    }
  
    return res;
}

// --- Main Engine

// Create the desired cube information
// [[Rcpp::export]]
arma::cube dim_to_cube(unsigned int xdim = 4, unsigned int ydim = 4, unsigned int tdim = 3) {
  
    // Initialize values in A
    arma::cube res(xdim,ydim,tdim);
  
    // Case 1
    if (xdim % 2 == 0) {
        res.each_slice() = calc_matrix(make_even_matrix_case1(xdim, ydim));
    } else { // Either Case 2 or Case 3
        if (ydim % 2 == 0) { // Case 3
            res.each_slice() = calc_matrix(make_odd_matrix_case3(xdim, ydim));
        } else { // Case 2
            arma::mat first_odd_mat = calc_matrix(make_odd_matrix_case2(xdim, ydim));
            arma::mat sec_odd_mat = calc_matrix(make_odd_matrix_case3(xdim, ydim));
      
            for (unsigned int t = 0; t < tdim; t++) {
                res.slice(t) = (t % 2 ? first_odd_mat  : sec_odd_mat);
            }
        }
    }
    return res;
}

Verification of Results

To verify, let’s quickly create similar cases and test them against the original R function:

Case 1:
xdim <- 6; ydim <- 5; tdim <- 3
a <- array(0:1,dim=c(xdim, ydim, tdim)); res <- a

all.equal(dim_to_cube(xdim, ydim, tdim), cube_r(a, res, xdim, ydim, tdim))
[1] TRUE
Case 2:
xdim <- 7; ydim <- 6; tdim <- 3
a <- array(0:1,dim=c(xdim, ydim, tdim)); res <- a

all.equal(dim_to_cube(xdim, ydim, tdim), cube_r(a, res, xdim, ydim, tdim))
[1] TRUE
Case 3:
xdim <- 7; ydim <- 7; tdim <- 3
a <- array(0:1,dim=c(xdim, ydim, tdim)); res <- a

all.equal(dim_to_cube(xdim, ydim, tdim), cube_r(a, res, xdim, ydim, tdim))
[1] TRUE

Closing Time - You don’t have to go home but you can’t stay here.

With all of these different methods now thoroughly described, let’s do one last benchmark to figure out the best of the best.

xdim <- 20
ydim <- 20
tdim <- 5
a <- array(0:1,dim=c(xdim, ydim, tdim))
res <- a

end_bench <- microbenchmark::microbenchmark(r_orig    = cube_r(a, res, xdim, ydim, tdim),
                                            r_cached  = cube_r_cache(a, res, xdim, ydim, tdim),
                                            cpp_core1 = cube_cpp_parallel(a, res, 1), 
                                            cpp_core2 = cube_cpp_parallel(a, res, 2), 
                                            cpp_core3 = cube_cpp_parallel(a, res, 3),
                                            cpp_cpat  = dim_to_cube(xdim, ydim, tdim))

Output:

Unit: microseconds
      expr      min        lq       mean    median        uq       max neval
    r_orig 3692.596 3977.0135 4350.60190 4106.0515 4857.9320  7182.160   100
  r_cached 2851.516 3100.4145 3498.17482 3234.1710 3514.4660 11165.995   100
 cpp_core1  175.402  179.6125  192.61909  188.2585  199.0685   273.855   100
 cpp_core2  112.878  115.8420  126.66236  119.7010  130.9980   302.339   100
 cpp_core3   81.540   84.8295  101.10615   89.4330  103.0835   400.662   100
  cpp_cpat   40.582   43.1645   51.00269   50.7720   55.7760    96.967   100

As can be seen, the customized approach based on the data’s pattern provided the fastest speed up. Though, by customizing to the pattern, we lost the ability to generalize the solution without being exposed to new cases. Meanwhile, the code port into C++ yielded much better results than the optimized R version. Both of these pieces of code were highly general to summing over specific portions of an array. Lastly, the parallelization within R was simply too time consuming for a one-off computation.

June 08, 2016 12:00 AM

May 13, 2016

Journal of the Royal Statistical Society: Series A

February 28, 2016

Journal of the Royal Statistical Society: Series C

January 29, 2016

Journal of the Royal Statistical Society: Series B

December 27, 2015

Alstatr

R and Python: Gradient Descent

One of the problems often dealt in Statistics is minimization of the objective function. And contrary to the linear models, there is no analytical solution for models that are nonlinear on the parameters such as logistic regression, neural networks, and nonlinear regression models (like Michaelis-Menten model). In this situation, we have to use mathematical programming or optimization. And one popular optimization algorithm is the gradient descent, which we're going to illustrate here. To start with, let's consider a simple function with closed-form solution given by \begin{equation} f(\beta) \triangleq \beta^4 - 3\beta^3 + 2. \end{equation} We want to minimize this function with respect to $\beta$. The quick solution to this, as what calculus taught us, is to compute for the first derivative of the function, that is \begin{equation} \frac{\text{d}f(\beta)}{\text{d}\beta}=4\beta^3-9\beta^2. \end{equation} Setting this to 0 to obtain the stationary point gives us \begin{align} \frac{\text{d}f(\beta)}{\text{d}\beta}&\overset{\text{set}}{=}0\nonumber\\ 4\hat{\beta}^3-9\hat{\beta}^2&=0\nonumber\\ 4\hat{\beta}^3&=9\hat{\beta}^2\nonumber\\ 4\hat{\beta}&=9\nonumber\\ \hat{\beta}&=\frac{9}{4}. \end{align} The following plot shows the minimum of the function at $\hat{\beta}=\frac{9}{4}$ (red line in the plot below).

R ScriptNow let's consider minimizing this problem using gradient descent with the following algorithm:
  1. Initialize $\mathbf{x}_{r},r=0$
  2. while $\lVert \mathbf{x}_{r}-\mathbf{x}_{r+1}\rVert > \nu$
  3.         $\mathbf{x}_{r+1}\leftarrow \mathbf{x}_{r} - \gamma\nabla f(\mathbf{x}_r)$
  4.         $r\leftarrow r + 1$
  5. end while
  6. return $\mathbf{x}_{r}$ and $r$
where $\nabla f(\mathbf{x}_r)$ is the gradient of the cost function, $\gamma$ is the learning-rate parameter of the algorithm, and $\nu$ is the precision parameter. For the function above, let the initial guess be $\hat{\beta}_0=4$ and $\gamma=.001$ with $\nu=.00001$. Then $\nabla f(\hat{\beta}_0)=112$, so that \[\hat{\beta}_1=\hat{\beta}_0-.001(112)=3.888.\] And $|\hat{\beta}_1 - \hat{\beta}_0| = 0.112> \nu$. Repeat the process until at some $r$, $|\hat{\beta}_{r}-\hat{\beta}_{r+1}| \ngtr \nu$. It will turn out that 350 iterations are needed to satisfy the desired inequality, the plot of which is in the following figure with estimated minimum $\hat{\beta}_{350}=2.250483\approx\frac{9}{4}$.

R Script with PlotPython ScriptObviously the convergence is slow, and we can adjust this by tuning the learning-rate parameter, for example if we try to increase it into $\gamma=.01$ (change gamma to .01 in the codes above) the algorithm will converge at 42nd iteration. To support that claim, see the steps of its gradient in the plot below.

If we try to change the starting value from 4 to .1 (change beta_new to .1) with $\gamma=.01$, the algorithm converges at 173rd iteration with estimate $\hat{\beta}_{173}=2.249962\approx\frac{9}{4}$ (see the plot below).

Now let's consider another function known as Rosenbrock defined as \begin{equation} f(\mathbf{w})\triangleq(1 - w_1) ^ 2 + 100 (w_2 - w_1^2)^2. \end{equation} The gradient is \begin{align} \nabla f(\mathbf{w})&=[-2(1 - w_1) - 400(w_2 - w_1^2) w_1]\mathbf{i}+200(w_2-w_1^2)\mathbf{j}\nonumber\\ &=\left[\begin{array}{c} -2(1 - w_1) - 400(w_2 - w_1^2) w_1\\ 200(w_2-w_1^2) \end{array}\right]. \end{align} Let the initial guess be $\hat{\mathbf{w}}_0=\left[\begin{array}{c}-1.8\\-.8\end{array}\right]$, $\gamma=.0002$, and $\nu=.00001$. Then $\nabla f(\hat{\mathbf{w}}_0)=\left[\begin{array}{c} -2914.4\\-808.0\end{array}\right]$. So that \begin{equation}\nonumber \hat{\mathbf{w}}_1=\hat{\mathbf{w}}_0-\gamma\nabla f(\hat{\mathbf{w}}_0)=\left[\begin{array}{c} -1.21712 \\-0.63840\end{array}\right]. \end{equation} And $\lVert\hat{\mathbf{w}}_0-\hat{\mathbf{w}}_1\rVert=0.6048666>\nu$. Repeat the process until at some $r$, $\lVert\hat{\mathbf{w}}_r-\hat{\mathbf{w}}_{r+1}\rVert\ngtr \nu$. It will turn out that 23,374 iterations are needed for the desired inequality with estimate $\hat{\mathbf{w}}_{23375}=\left[\begin{array}{c} 0.9464841 \\0.8956111\end{array}\right]$, the contour plot is depicted in the figure below.
R Script with Contour PlotPython ScriptNotice that I did not use ggplot for the contour plot, this is because the plot needs to be updated 23,374 times just to accommodate for the arrows for the trajectory of the gradient vectors, and ggplot is just slow. Finally, we can also visualize the gradient points on the surface as shown in the following figure.
R ScriptIn my future blog post, I hope to apply this algorithm on statistical models like linear/nonlinear regression models for simple illustration.

by Al Asaad (noreply@blogger.com) at December 27, 2015 01:52 AM

R: Principal Component Analysis on Imaging

Ever wonder what's the mathematics behind face recognition on most gadgets like digital camera and smartphones? Well for most part it has something to do with statistics. One statistical tool that is capable of doing such feature is the Principal Component Analysis (PCA). In this post, however, we will not do (sorry to disappoint you) face recognition as we reserve this for future post while I'm still doing research on it. Instead, we go through its basic concept and use it for data reduction on spectral bands of the image using R.

Let's view it mathematically

Consider a line $L$ in a parametric form described as a set of all vectors $k\cdot\mathbf{u}+\mathbf{v}$ parameterized by $k\in \mathbb{R}$, where $\mathbf{v}$ is a vector orthogonal to a normalized vector $\mathbf{u}$. Below is the graphical equivalent of the statement:
So if given a point $\mathbf{x}=[x_1,x_2]^T$, the orthogonal projection of this point on the line $L$ is given by $(\mathbf{u}^T\mathbf{x})\mathbf{u}+\mathbf{v}$. Graphically, we mean

$Proj$ is the projection of the point $\mathbf{x}$ on the line, where the position of it is defined by the scalar $\mathbf{u}^{T}\mathbf{x}$. Therefore, if we consider $\mathbf{X}=[X_1, X_2]^T$ be a random vector, then the random variable $Y=\mathbf{u}^T\mathbf{X}$ describes the variability of the data on the direction of the normalized vector $\mathbf{u}$. So that $Y$ is a linear combination of $X_i, i=1,2$. The principal component analysis identifies a linear combinations of the original variables $\mathbf{X}$ that contain most of the information, in the sense of variability, contained in the data. The general assumption is that useful information is proportional to the variability. PCA is used for data dimensionality reduction and for interpretation of data. (Ref 1. Bajorski, 2012)

To better understand this, consider two dimensional data set, below is the plot of it along with two lines ($L_1$ and $L_2$) that are orthogonal to each other:
If we project the points orthogonally to both lines we have,

So that if normalized vector $\mathbf{u}_1$ defines the direction of $L_1$, then the variability of the points on $L_1$ is described by the random variable $Y_1=\mathbf{u}_1^T\mathbf{X}$. Also if $\mathbf{u}_2$ is a normalized vector that defines the direction of $L_2$, then the variability of the points on this line is described by the random variable $Y_2=\mathbf{u}_2^T\mathbf{X}$. The first principal component is one with maximum variability. So in this case, we can see that $Y_2$ is more variable than $Y_1$, since the points projected on $L_2$ are more dispersed than in $L_1$. In practice, however, the linear combinations $Y_i = \mathbf{u}_i^T\mathbf{X}, i=1,2,\cdots,p$ is maximized sequentially so that $Y_1$ is the linear combination of the first principal component, $Y_2$ is the linear combination of the second principal component, and so on. Further, the estimate of the direction vector $\mathbf{u}$ is simply the normalized eigenvector $\mathbf{e}$ of the variance-covariance matrix $\mathbf{\Sigma}$ of the original variable $\mathbf{X}$. And the variability explained by the principal component is the corresponding eigenvalue $\lambda$. For more details on theory of PCA refer to (Bajorski, 2012) at Reference 1 below.

As promised we will do dimensionality reduction using PCA. We will use the Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) data from (Barjorski, 2012), you can use other locations of AVIRIS data that can be downloaded here. However, since for most cases the AVIRIS data contains thousands of bands so for simplicity we will stick with the data given in (Bajorski, 2012) as it was cleaned reducing to 152 bands only.

What is spectral bands?

In imaging, spectral bands refer to the third dimension of the image usually denoted as $\lambda$. For example, RGB image contains red, green and blue bands as shown below along with the first two dimensions $x$ and $y$ that define the resolution of the image.

These are few of the bands that are visible to our eyes, there are other bands that are not visible to us like infrared, and many other in electromagnetic spectrum. That is why in most cases AVIRIS data contains huge number of bands each captures different characteristics of the image. Below is the proper description of the data.

Data

The Airborne Visible/Infrared Imaging Spectrometer (AVIRIS), is a sensor collecting spectral radiance in the range of wavelengths from 400 to 2500 nm. It has been flown on various aircraft platforms, and many images of the Earth’s surface are available. A 100 by 100 pixel AVIRIS image of an urban area in Rochester, NY, near the Lake Ontario shoreline is shown below. The scene has a wide range of natural and man-made material including a mixture of commercial/warehouse and residential neighborhoods, which adds a wide range of spectral diversity. Prior to processing, invalid bands (due to atmospheric water absorption) were removed, reducing the overall dimensionality to 152 bands. This image has been used in Bajorski et al. (2004) and Bajorski (2011a, 2011b). The first 152 values in the AVIRIS Data represent the spectral radiance values (a spectral curve) for the top left pixel. This is followed by spectral curves of the pixels in the first row, followed by the next row, and so on. (Ref. 1 Bajorski, 2012)

To load the data, run the following code:

Above code uses EBImage package, and can be installed from my previous post.

Why do we need to reduce the dimension of the data?

Before we jump in to our analysis, in case you may ask why? Well sometimes it's just difficult to do analysis on high dimensional data, especially on interpreting it. This is because there are dimensions that aren't significant (like redundancy) which adds to our problem on the analysis. So in order to deal with this, we remove those nuisance dimension and deal with the significant one.

To perform PCA in R, we use the function princomp as seen below:

The structure of princomp consist of a list shown above, we will give description to selected outputs. Others can be found in the documentation of the function by executing ?princomp.
  • sdev - standard deviation, the square root of the eigenvalues $\lambda$ of the variance-covariance matrix $\mathbf{\Sigma}$ of the data, dat.mat;
  • loadings - eigenvectors $\mathbf{e}$ of the variance-covariance matrix $\mathbf{\Sigma}$ of the data, dat.mat;
  • scores - the principal component scores.
Recall that the objective of PCA is to find for a linear combination $Y=\mathbf{u}^T\mathbf{X}$ that will maximize the variance $Var(Y)$. So that from the output, the estimate of the components of $\mathbf{u}$ is the entries of the loadings which is a matrix of eigenvectors, where the columns corresponds to the eigenvectors of the sequence of principal components, that is if the first principal component is given by $Y_1=\mathbf{u}_1^T\mathbf{X}$, then the estimate of $\mathbf{u}_1$ which is $\mathbf{e}_1$ (eigenvector) is the set of coefficients obtained from the first column of the loadings. The explained variability of the first principal component is the square of the first standard deviation sdev, the explained variability of the second principal component is the square of the second standard deviation sdev, and so on. Now let's interpret the loadings (coefficients) of the first three principal components. Below is the plot of this,
Base above, the coefficients of the first principal component (PC1) are almost all negative. A closer look, the variability in this principal component is mainly explained by the weighted average of radiance of the spectral bands 35 to 100. Analogously, PC2 mainly represents the variability of the weighted average of radiance of spectral bands 1 to 34. And further, the fluctuation of the coefficients of PC3 makes it difficult to tell on which bands greatly contribute on its variability. Aside from examining the loadings, another way to see the impact of the PCs is through the impact plot where the impact curve $\sqrt{\lambda_j}\mathbf{e}_j$ are plotted, I want you to explore that.

Moving on, let's investigate the percent of variability in $X_i$ explained by the $j$th principal component, below is the formula of this, \begin{equation}\nonumber \frac{\lambda_j\cdot e_{ij}^2}{s_{ii}}, \end{equation} where $s_{ii}$ is the estimated variance of $X_i$. So that below is the percent of explained variability in $X_i$ of the first three principal components including the cumulative percent variability (sum of PC1, PC2, and PC3),
For the variability of the first 33 bands, PC2 takes on about 90 percent of the explained variability as seen in the above plot. And still have great contribution further to 102 to 152 bands. On the other hand, from bands 37 to 100, PC1 explains almost all the variability with PC2 and PC3 explain 0 to 1 percent only. The sum of the percentage of explained variability of these principal components is indicated as orange line in the above plot, which is the cumulative percent variability.

To wrap up this section, here is the percentage of the explained variability of the first 10 PCs.

PC1PC2PC3PC4PC5PC6PC7PC8PC9PC10
Table 1: Variability Explained by the First Ten Principal Components for the AVIRIS data.
82.05717.1760.3200.1820.0940.0650.0370.0290.0140.005

Above variability were obtained by noting that the variability explained by the principal component is simply the eigenvalue (square of the sdev) of the variance-covariance matrix $\mathbf{\Sigma}$ of the original variable $\mathbf{X}$, hence the percentage of variability explained by the $j$th PC is equal to its corresponding eigenvalue $\lambda_j$ divided by the overall variability which is the sum of the eigenvalues, $\sum_{j=1}^{p}\lambda_j$, as we see in the following code,

Stopping Rules

Given the list of percentage of variability explained by the PCs in Table 1, how many principal components should we take into account that would best represent the variability of the original data? To answer that, we introduce the following stopping rules that will guide us on deciding the number of PCs:
  1. Scree plot;
  2. Simple fare-share;
  3. Broken-stick; and,
  4. Relative broken-stick.
The scree plot is the plot of the variability of the PCs, that is the plot of the eigenvalues. Where we look for an elbow or sudden drop of the eigenvalues on the plot, hence for our example we have
Therefore, we need return the first two principal components based on the elbow shape. However, if the eigenvalues differ by order of magnitude, it is recommended to use the logarithmic scale which is illustrated below,
Unfortunately, sometimes it won't work as we can see here, it's just difficult to determine where the elbow is. The succeeding discussions on the last three stopping rules are based on (Bajorski, 2012). The simple fair-share stopping rule identifies the largest $k$ such that $\lambda_k$ is larger than its fair share, that is larger than $(\lambda_1+\lambda_2+\cdots+\lambda_p)/p$. To illustrate this, consider the following:

Thus, we need to stop at second principal component.

If one was concerned that the above method produces too many principal components, a broken-stick rule could be used. The rule is that it identifies the principal components with largest $k$ such that $\lambda_j/(\lambda_1+\lambda_2+\cdots +\lambda_p)>a_j$, for all $j\leq k$, where \begin{equation}\nonumber a_j = \frac{1}{p}\sum_{i=j}^{p}\frac{1}{i},\quad j =1,\cdots, p. \end{equation} Let's try it,

Above result coincides with the first two stopping rule. The draw back of simple fair-share and broken-stick rules is that it do not work well when the eigenvalues differ by orders of magnitude. In such case, we then use the relative broken-stick rule, where we analyze $\lambda_j$ as the first eigenvalue in the set $\lambda_j\geq \lambda_{j+1}\geq\cdots\geq\lambda_{p}$, where $j < p$. The dimensionality $k$ is chosen as the largest value such that $\lambda_j/(\lambda_j+\cdots +\lambda_p)>b_j$, for all $j\leq k$, where \begin{equation}\nonumber b_j = \frac{1}{p-j+1}\sum_{i=1}^{p-j+1}\frac{1}{i}. \end{equation} Applying this to the data we have,
According to the numerical output, the first 34 principal components are enough to represent the variability of the original data.

Principal Component Scores

The principal component scores is the resulting new data set obtained from the linear combinations $Y_j=\mathbf{e}_j(\mathbf{x}-\bar{\mathbf{x}}), j = 1,\cdots, p$. So that if we use the first three stopping rules, then below is the scores (in image) of PC1 and PC2,
If we base on the relative broken-stick rule then we return the first 34 PCs, and below is the corresponding scores (in image).
Click on the image to zoom in.

Residual Analysis

Of course when doing PCA there are errors to be considered unless one would return all the PCs, but that would not make any sense because why would someone apply PCA when you still take into account all the dimensions? An overview of the errors in PCA without going through the theory is that, the overall error is simply the excluded variability explained by the $k$th to $p$th principal components, $k>j$.

Reference

by Al Asaad (noreply@blogger.com) at December 27, 2015 01:52 AM

R: k-Means Clustering on Imaging

Enough with the theory we recently published, let's take a break and have fun on the application of Statistics used in Data Mining and Machine Learning, the k-Means Clustering.
k-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. (Wikipedia, Ref 1.)
We will apply this method to an image, wherein we group the pixels into k different clusters. Below is the image that we are going to use,
Colorful Bird From Wall321
We will utilize the following packages for input and output:
  1. jpeg - Read and write JPEG images; and,
  2. ggplot2 - An implementation of the Grammar of Graphics.

Download and Read the Image

Let's get started by downloading the image to our workspace, and tell R that our data is a JPEG file.

Cleaning the Data

Extract the necessary information from the image and organize this for our computation:

The image is represented by large array of pixels with dimension rows by columns by channels -- red, green, and blue or RGB.

Plotting

Plot the original image using the following codes:

Clustering

Apply k-Means clustering on the image:

Plot the clustered colours:

Possible clusters of pixels on different k-Means:

Originalk = 6
Table 1: Different k-Means Clustering.
k = 5k = 4
k = 3k = 2

I suggest you try it!

Reference

  1. K-means clustering. Wikipedia. Retrieved September 11, 2014.

by Al Asaad (noreply@blogger.com) at December 27, 2015 01:52 AM

December 16, 2015

Alstatr

R and Python: Theory of Linear Least Squares

In my previous article, we talked about implementations of linear regression models in R, Python and SAS. On the theoretical sides, however, I briefly mentioned the estimation procedure for the parameter $\boldsymbol{\beta}$. So to help us understand how software does the estimation procedure, we'll look at the mathematics behind it. We will also perform the estimation manually in R and in Python, that means we're not going to use any special packages, this will help us appreciate the theory.

Linear Least Squares

Consider the linear regression model, \[ y_i=f_i(\mathbf{x}|\boldsymbol{\beta})+\varepsilon_i,\quad\mathbf{x}_i=\left[ \begin{array}{cccc} 1&x_{11}&\cdots&x_{1p} \end{array}\right],\quad\boldsymbol{\beta}=\left[\begin{array}{c}\beta_0\\\beta_1\\\vdots\\\beta_p\end{array}\right], \] where $y_i$ is the response or the dependent variable at the $i$th case, $i=1,\cdots, N$. The $f_i(\mathbf{x}|\boldsymbol{\beta})$ is the deterministic part of the model that depends on both the parameters $\boldsymbol{\beta}\in\mathbb{R}^{p+1}$ and the predictor variable $\mathbf{x}_i$, which in matrix form, say $\mathbf{X}$, is represented as follows \[ \mathbf{X}=\left[ \begin{array}{cccccc} 1&x_{11}&\cdots&x_{1p}\\ 1&x_{21}&\cdots&x_{2p}\\ \vdots&\vdots&\ddots&\vdots\\ 1&x_{N1}&\cdots&x_{Np}\\ \end{array} \right]. \] $\varepsilon_i$ is the error term at the $i$th case which we assumed to be Gaussian distributed with mean 0 and variance $\sigma^2$. So that \[ \mathbb{E}y_i=f_i(\mathbf{x}|\boldsymbol{\beta}), \] i.e. $f_i(\mathbf{x}|\boldsymbol{\beta})$ is the expectation function. The uncertainty around the response variable is also modelled by Gaussian distribution. Specifically, if $Y=f(\mathbf{x}|\boldsymbol{\beta})+\varepsilon$ and $y\in Y$ such that $y>0$, then \begin{align*} \mathbb{P}[Y\leq y]&=\mathbb{P}[f(x|\beta)+\varepsilon\leq y]\\ &=\mathbb{P}[\varepsilon\leq y-f(\mathbf{x}|\boldsymbol{\beta})]=\mathbb{P}\left[\frac{\varepsilon}{\sigma}\leq \frac{y-f(\mathbf{x}|\boldsymbol{\beta})}{\sigma}\right]\\ &=\Phi\left[\frac{y-f(\mathbf{x}|\boldsymbol{\beta})}{\sigma}\right], \end{align*} where $\Phi$ denotes the Gaussian distribution with density denoted by $\phi$ below. Hence $Y\sim\mathcal{N}(f(\mathbf{x}|\boldsymbol{\beta}),\sigma^2)$. That is, \begin{align*} \frac{\operatorname{d}}{\operatorname{d}y}\Phi\left[\frac{y-f(\mathbf{x}|\boldsymbol{\beta})}{\sigma}\right]&=\phi\left[\frac{y-f(\mathbf{x}|\boldsymbol{\beta})}{\sigma}\right]\frac{1}{\sigma}=\mathbb{P}[y|f(\mathbf{x}|\boldsymbol{\beta}),\sigma^2]\\ &=\frac{1}{\sqrt{2\pi}\sigma}\exp\left\{-\frac{1}{2}\left[\frac{y-f(\mathbf{x}|\boldsymbol{\beta})}{\sigma}\right]^2\right\}. \end{align*} If the data are independent and identically distributed, then the log-likelihood function of $y$ is, \begin{align*} \mathcal{L}[\boldsymbol{\beta}|\mathbf{y},\mathbf{X},\sigma]&=\mathbb{P}[\mathbf{y}|\mathbf{X},\boldsymbol{\beta},\sigma]=\prod_{i=1}^N\frac{1}{\sqrt{2\pi}\sigma}\exp\left\{-\frac{1}{2}\left[\frac{y_i-f_i(\mathbf{x}|\boldsymbol{\beta})}{\sigma}\right]^2\right\}\\ &=\frac{1}{(2\pi)^{\frac{n}{2}}\sigma^n}\exp\left\{-\frac{1}{2}\sum_{i=1}^N\left[\frac{y_i-f_i(\mathbf{x}|\boldsymbol{\beta})}{\sigma}\right]^2\right\}\\ \log\mathcal{L}[\boldsymbol{\beta}|\mathbf{y},\mathbf{X},\sigma]&=-\frac{n}{2}\log2\pi-n\log\sigma-\frac{1}{2\sigma^2}\sum_{i=1}^N\left[y_i-f_i(\mathbf{x}|\boldsymbol{\beta})\right]^2. \end{align*} And because the likelihood function tells us about the plausibility of the parameter $\boldsymbol{\beta}$ in explaining the sample data. We therefore want to find the best estimate of $\boldsymbol{\beta}$ that likely generated the sample. Thus our goal is to maximize the likelihood function which is equivalent to maximizing the log-likelihood with respect to $\boldsymbol{\beta}$. And that's simply done by taking the partial derivative with respect to the parameter $\boldsymbol{\beta}$. Therefore, the first two terms in the right hand side of the equation above can be disregarded since it does not depend on $\boldsymbol{\beta}$. Also, the location of the maximum log-likelihood with respect to $\boldsymbol{\beta}$ is not affected by arbitrary positive scalar multiplication, so the factor $\frac{1}{2\sigma^2}$ can be omitted. And we are left with the following equation, \begin{equation}\label{eq:1} -\sum_{i=1}^N\left[y_i-f_i(\mathbf{x}|\boldsymbol{\beta})\right]^2. \end{equation} One last thing is that, instead of maximizing the log-likelihood function we can do minimization on the negative log-likelihood. Hence we are interested on minimizing the negative of Equation (\ref{eq:1}) which is \begin{equation}\label{eq:2} \sum_{i=1}^N\left[y_i-f_i(\mathbf{x}|\boldsymbol{\beta})\right]^2, \end{equation} popularly known as the residual sum of squares (RSS). So RSS is a consequence of maximum log-likelihood under the Gaussian assumption of the uncertainty around the response variable $y$. For models with two parameters, say $\beta_0$ and $\beta_1$ the RSS can be visualized like the one in my previous article, that is
Error Surface
Performing differentiation under $(p+1)$-dimensional parameter $\boldsymbol{\beta}$ is manageable in the context of linear algebra, so Equation (\ref{eq:2}) is equivalent to \begin{align*} \lVert\mathbf{y}-\mathbf{X}\boldsymbol{\beta}\rVert^2&=\langle\mathbf{y}-\mathbf{X}\boldsymbol{\beta},\mathbf{y}-\mathbf{X}\boldsymbol{\beta}\rangle=\mathbf{y}^{\text{T}}\mathbf{y}-\mathbf{y}^{\text{T}}\mathbf{X}\boldsymbol{\beta}-(\mathbf{X}\boldsymbol{\beta})^{\text{T}}\mathbf{y}+(\mathbf{X}\boldsymbol{\beta})^{\text{T}}\mathbf{X}\boldsymbol{\beta}\\ &=\mathbf{y}^{\text{T}}\mathbf{y}-\mathbf{y}^{\text{T}}\mathbf{X}\boldsymbol{\beta}-\boldsymbol{\beta}^{\text{T}}\mathbf{X}^{\text{T}}\mathbf{y}+\boldsymbol{\beta}^{\text{T}}\mathbf{X}^{\text{T}}\mathbf{X}\boldsymbol{\beta} \end{align*} And the derivative with respect to the parameter is \begin{align*} \frac{\operatorname{\partial}}{\operatorname{\partial}\boldsymbol{\beta}}\lVert\mathbf{y}-\mathbf{X}\boldsymbol{\beta}\rVert^2&=-2\mathbf{X}^{\text{T}}\mathbf{y}+2\mathbf{X}^{\text{T}}\mathbf{X}\boldsymbol{\beta} \end{align*} Taking the critical point by setting the above equation to zero vector, we have \begin{align} \frac{\operatorname{\partial}}{\operatorname{\partial}\boldsymbol{\beta}}\lVert\mathbf{y}-\mathbf{X}\hat{\boldsymbol{\beta}}\rVert^2&\overset{\text{set}}{=}\mathbf{0}\nonumber\\ -\mathbf{X}^{\text{T}}\mathbf{y}+\mathbf{X}^{\text{T}}\mathbf{X}\hat{\boldsymbol{\beta}}&=\mathbf{0}\nonumber\\ \mathbf{X}^{\text{T}}\mathbf{X}\hat{\boldsymbol{\beta}}&=\mathbf{X}^{\text{T}}\mathbf{y}\label{eq:norm} \end{align} Equation (\ref{eq:norm}) is called the normal equation. If $\mathbf{X}$ is full rank, then we can compute the inverse of $\mathbf{X}^{\text{T}}\mathbf{X}$, \begin{align} \mathbf{X}^{\text{T}}\mathbf{X}\hat{\boldsymbol{\beta}}&=\mathbf{X}^{\text{T}}\mathbf{y}\nonumber\\ (\mathbf{X}^{\text{T}}\mathbf{X})^{-1}\mathbf{X}^{\text{T}}\mathbf{X}\hat{\boldsymbol{\beta}}&=(\mathbf{X}^{\text{T}}\mathbf{X})^{-1}\mathbf{X}^{\text{T}}\mathbf{y}\nonumber\\ \hat{\boldsymbol{\beta}}&=(\mathbf{X}^{\text{T}}\mathbf{X})^{-1}\mathbf{X}^{\text{T}}\mathbf{y}.\label{eq:betahat} \end{align} That's it, since both $\mathbf{X}$ and $\mathbf{y}$ are known.

Prediction

If $\mathbf{X}$ is full rank and spans the subspace $V\subseteq\mathbb{R}^N$, where $\mathbb{E}\mathbf{y}=\mathbf{X}\boldsymbol{\beta}\in V$. Then the predicted values of $\mathbf{y}$ is given by, \begin{equation}\label{eq:pred} \hat{\mathbf{y}}=\mathbb{E}\mathbf{y}=\mathbf{P}_{V}\mathbf{y}=\mathbf{X}(\mathbf{X}^{\text{T}}\mathbf{X})^{-1}\mathbf{X}^{\text{T}}\mathbf{y}, \end{equation} where $\mathbf{P}$ is the projection matrix onto the space $V$. For proof of the projection matrix in Equation (\ref{eq:pred}) please refer to reference (1) below. Notice that this is equivalent to \begin{equation}\label{eq:yhbh} \hat{\mathbf{y}}=\mathbb{E}\mathbf{y}=\mathbf{X}\hat{\boldsymbol{\beta}}. \end{equation}

Computation

Let's fire up R and Python and see how we can apply those equations we derived. For purpose of illustration, we're going to simulate data from Gaussian distributed population. To do so, consider the following codes

R ScriptPython ScriptHere we have two predictors x1 and x2, and our response variable y is generated by the parameters $\beta_1=3.5$ and $\beta_2=2.8$, and it has Gaussian noise with variance 7. While we set the same random seeds for both R and Python, we should not expect the random values generated in both languages to be identical, instead both values are independent and identically distributed (iid). For visualization, I will use Python Plotly, you can also translate it to R Plotly.

Now let's estimate the parameter $\boldsymbol{\beta}$ which by default we set to $\beta_1=3.5$ and $\beta_2=2.8$. We will use Equation (\ref{eq:betahat}) for estimation. So that we have

R ScriptPython ScriptThat's a good estimate, and again just a reminder, the estimate in R and in Python are different because we have different random samples, the important thing is that both are iid. To proceed, we'll do prediction using Equations (\ref{eq:pred}). That is,

R ScriptPython ScriptThe first column above is the data y and the second column is the prediction due to Equation (\ref{eq:pred}). Thus if we are to expand the prediction into an expectation plane, then we have

You have to rotate the plot by the way to see the plane, I still can't figure out how to change it in Plotly. Anyway, at this point we can proceed computing for other statistics like the variance of the error, and so on. But I will leave it for you to explore. Our aim here is just to give us an understanding on what is happening inside the internals of our software when we try to estimate the parameters of the linear regression models.

Reference

  1. Arnold, Steven F. (1981). The Theory of Linear Models and Multivariate Analysis. Wiley.
  2. OLS in Matrix Form

by Al Asaad (noreply@blogger.com) at December 16, 2015 11:02 AM

May 12, 2015

Chris Lawrence

That'll leave a mark

Here’s a phrase you never want to see in print (in a legal decision, no less) pertaining to your academic research: “The IRB process, however, was improperly engaged by the Dartmouth researcher and ignored completely by the Stanford researchers.”

Whole thing here; it’s a doozy.

by Chris Lawrence at May 12, 2015 12:00 AM

April 14, 2015

R you ready?

Beautiful plots while simulating loss in two-part procrustes problem

loss2Today I was working on a two-part procrustes problem and wanted to find out why my minimization algorithm sometimes does not converge properly or renders unexpected results. The loss function to be minimized is

\displaystyle  L(\mathbf{Q},c) = \| c \mathbf{A_1Q} - \mathbf{B_1} \|^2 + \| \mathbf{A_2Q} - \mathbf{B_2} \|^2 \rightarrow min

with \| \cdot \| denoting the Frobenius norm, c is an unknown scalar and \mathbf{Q} an unknown rotation matrix, i.e. \mathbf{Q}^T\mathbf{Q}=\mathbf{I}. \;\mathbf{A_1}, \mathbf{A_2}, \mathbf{B_1}, and \mathbf{B_1} are four real valued matrices. The minimum for c is easily found by setting the partial derivation of L(\mathbf{Q},c) w.r.t c equal to zero.

\displaystyle  c = \frac {tr \; \mathbf{Q}^T \mathbf{A_1}^T \mathbf{B_1}}  { \| \mathbf{A_1} \|^2 }

By plugging c into the loss function L(\mathbf{Q},c) we get a new loss function L(\mathbf{Q}) that only depends on \mathbf{Q}. This is the starting situation.

When trying to find out why the algorithm to minimize L(\mathbf{Q}) did not work as expected, I got stuck. So I decided to conduct a small simulation and generate random rotation matrices to study the relation between the parameter c and the value of the loss function L(\mathbf{Q}). Before looking at the results for the entire two-part procrustes problem from above, let’s visualize the results for the first part of the loss function only, i.e.

\displaystyle  L(\mathbf{Q},c) = \| c \mathbf{A_1Q} - \mathbf{B_1} \|^2 \rightarrow min

Here, c has the same minimum as for the whole formula above. For the simulation I used

\mathbf{A_1}= \begin{pmatrix}  0.0 & 0.4 & -0.5 \\  -0.4 & -0.8 & -0.5 \\  -0.1 & -0.5 & 0.2 \\  \end{pmatrix} \mkern18mu \qquad \text{and} \qquad \mkern36mu \mathbf{B_1}= \begin{pmatrix}  -0.1 & -0.8 & -0.1 \\  0.3 & 0.2 & -0.9 \\  0.1 & -0.3 & -0.5 \\  \end{pmatrix}

as input matrices. Generating many random rotation matrices \mathbf{Q} and plotting c against the value of the loss function yields the following plot.

This is a well behaved relation, for each scaling parameter c the loss is identical. Now let’s look at the full two-part loss function. As input matrices I used

\displaystyle  A1= \begin{pmatrix}  0.0 & 0.4 & -0.5 \\  -0.4 & -0.8 & -0.5 \\  -0.1 & -0.5 & 0.2 \\  \end{pmatrix} \mkern18mu , \mkern36mu B1= \begin{pmatrix}  -0.1 & -0.8 & -0.1 \\  0.3 & 0.2 & -0.9 \\  0.1 & -0.3 & -0.5 \\  \end{pmatrix}
A2= \begin{pmatrix}  0 & 0 & 1 \\  1 & 0 & 0 \\  0 & 1 & 0 \\  \end{pmatrix} \mkern18mu , \mkern36mu B2= \begin{pmatrix}  0 & 0 & 1 \\  1 & 0 & 0 \\  0 & 1 & 0 \\  \end{pmatrix}

and the following R-code.

# trace function
tr <- function(X) sum(diag(X))

# random matrix type 1
rmat_1 <- function(n=3, p=3, min=-1, max=1){
  matrix(runif(n*p, min, max), ncol=p)
}

# random matrix type 2, sparse
rmat_2 <- function(p=3) {
  diag(p)[, sample(1:p, p)]
}

# generate random rotation matrix Q. Based on Q find 
# optimal scaling factor c and calculate loss function value
#
one_sample <- function(n=2, p=2)
{
  Q <- mixAK::rRotationMatrix(n=1, dim=p) %*%         # random rotation matrix det(Q) = 1
    diag(sample(c(-1,1), p, rep=T))                   # additional reflections, so det(Q) in {-1,1}
  s <- tr( t(Q) %*% t(A1) %*% B1 ) / norm(A1, "F")^2  # scaling factor c
  rss <- norm(s*A1 %*% Q - B1, "F")^2 +               # get residual sum of squares
         norm(A2 %*% Q - B2, "F")^2 
  c(s=s, rss=rss)
}

# find c and rss or many random rotation matrices
#
set.seed(10)  # nice case for 3 x 3
n <- 3
p <- 3
A1 <- round(rmat_1(n, p), 1)
B1 <- round(rmat_1(n, p), 1)
A2 <- rmat_2(p)
B2 <- rmat_2(p)

x <- plyr::rdply(40000, one_sample(3,3)) 
plot(x$s, x$rss, pch=16, cex=.4, xlab="c", ylab="L(Q)", col="#00000010")

This time the result turns out to be very different and … beautiful :)

Here, we do not have a one to one relation between the scaling parameter and the loss function any more. I do not quite know what to make of this yet. But for now I am happy that it has aestethic value. Below you find some more beautiful graphics with different matrices as inputs.

Cheers!


by markheckmann at April 14, 2015 04:53 PM

February 24, 2015

Douglas Bates

RCall: Running an embedded R in Julia

I have used R (and S before it) for a couple of decades. In the last few years most of my coding has been in Julia, a language for technical computing that can provide remarkable performance for a dynamically typed language via Just-In-Time (JIT) compilation of functions and via multiple dispatch.

Nonetheless there are facilities in R that I would like to have access to from Julia. I created the RCall package for Julia to do exactly that. This IJulia notebook provides an introduction to RCall.

This is not a novel idea by any means. Julia already has PyCall and JavaCall packages that provide access to Python and to Java. These packages are used extensively and are much more sophisticated than RCall, at present. Many other languages have facilities to run an embedded instance of R. In fact, Python has several such interfaces.

The things I plan to do using RCall is to access datasets from R and R packages, to fit models that are not currently implemented in Julia and to use R graphics, especially the ggplot2 and lattice packages. Unfortunately I am not currently able to start a graphics device from the embedded R but I expect that to be fixed soon.

I can tell you the most remarkable aspect of RCall although it may not mean much if you haven't tried to do this kind of thing. It is written entirely in Julia. There is absolutely no "glue" code written in a compiled language like C or C++. As I said, this may not mean much to you unless you have tried to do something like this, in which case it is astonishing.

by Douglas Bates (noreply@blogger.com) at February 24, 2015 11:05 PM

January 16, 2015

Modern Toolmaking

caretEnsemble

My package caretEnsemble, for making ensembles of caret models, is now on CRAN.

Check it out, and let me know what you think! (Submit bug reports and feature requests to the issue tracker)

by Zachary Deane-Mayer (noreply@blogger.com) at January 16, 2015 10:22 PM

January 15, 2015

Gregor Gorjanc

cpumemlog: Monitor CPU and RAM usage of a process (and its children)

Long time no see ...

Today I pushed the cpumemlog script to GitHub https://github.com/gregorgorjanc/cpumemlog. Read more about this useful utility at the GitHub site.

by Gregor Gorjanc (noreply@blogger.com) at January 15, 2015 10:16 PM

December 15, 2014

R you ready?

QQ-plots in R vs. SPSS – A look at the differences

qq_plot_spps_r

We teach two software packages, R and SPSS, in Quantitative Methods 101 for psychology freshman at Bremen University (Germany). Sometimes confusion arises, when the software packages produce different results. This may be due to specifics in the implemention of a method or, as in most cases, to different default settings. One of these situations occurs when the QQ-plot is introduced. Below we see two QQ-plots, produced by SPSS and R, respectively. The data used in the plots were generated by:

set.seed(0)
x <- sample(0:9, 100, rep=T)    

SPSS

QQ-plot in SPSS using Blom's method

R

qqnorm(x, datax=T)      # uses Blom's method by default
qqline(x, datax=T)

There are some obvious differences:

  1. The most obvious one is that the R plot seems to contain more data points than the SPSS plot. Actually, this is not the case. Some data points are plotted on top of each in SPSS while they are spread out vertically in the R plot. The reason for this difference is that SPSS uses a different approach assigning probabilities to the values. We will expore the two approaches below.
  2. The scaling of the y-axis differs. R uses quantiles from the standard normal distribution. SPSS by default rescales these values using the mean and standard deviation from the original data. This allows to directly compare the original and theoretical values. This is a simple linear transformation and will not be explained any further here.
  3. The QQ-lines are not identical. R uses the 1st and 3rd quartile from both distributions to draw the line. This is different in SPSS where of a line is drawn for identical values on both axes. We will expore the differences below.

QQ-plots from scratch

To get a better understanding of the difference we will build the R and SPSS-flavored QQ-plot from scratch.

R type

In order to calculate theoretical quantiles corresponding to the observed values, we first need to find a way to assign a probability to each value of the original data. A lot of different approaches exist for this purpose (for an overview see e.g. Castillo-Gutiérrez, Lozano-Aguilera, & Estudillo-Martínez, 2012b). They usually build on the ranks of the observed data points to calculate corresponding p-values, i.e. the plotting positions for each point. The qqnorm function uses two formulae for this purpose, depending on the number of observations n (Blom’s mfethod, see ?qqnorm; Blom, 1958). With r being the rank, for n > 10 it will use the formula p = (r - 1/2) / n, for n \leq 10 the formula p = (r - 3/8) / (n + 1/4) to determine the probability value p for each observation (see the help files for the functions qqnorm and ppoint). For simplicity reasons, we will only implement the n > 10 case here.

n <- length(x)          # number of observations
r <- order(order(x))    # order of values, i.e. ranks without averaged ties
p <- (r - 1/2) / n      # assign to ranks using Blom's method
y <- qnorm(p)           # theoretical standard normal quantiles for p values
plot(x, y)              # plot empirical against theoretical values

Before we take at look at the code, note that our plot is identical to the plot generated by qqnorm above, except that the QQ-line is missing. The main point that makes the difference between R and SPSS is found in the command order(order(x)). The command calculates ranks for the observations using ordinal ranking. This means that all observations get different ranks and no average ranks are calculated for ties, i.e. for observations with equal values. Another approach would be to apply fractional ranking and calculate average values for ties. This is what the function rank does. The following codes shows the difference between the two approaches to assign ranks.

v <- c(1,1,2,3,3)
order(order(v))     # ordinal ranking used by R
## [1] 1 2 3 4 5
rank(v)             # fractional ranking used by SPSS
## [1] 1.5 1.5 3.0 4.5 4.5

R uses ordinal ranking and SPSS uses fractional ranking by default to assign ranks to values. Thus, the positions do not overlap in R as each ordered observation is assigned a different rank and therefore a different p-value. We will pick up the second approach again later, when we reproduce the SPSS-flavored plot in R.1

The second difference between the plots concerned the scaling of the y-axis and was already clarified above.

The last point to understand is how the QQ-line is drawn in R. Looking at the probs argument of qqline reveals that it uses the 1st and 3rd quartile of the original data and theoretical distribution to determine the reference points for the line. We will draw the line between the quartiles in red and overlay it with the line produced by qqline to see if our code is correct.

plot(x, y)                      # plot empirical against theoretical values
ps <- c(.25, .75)               # reference probabilities
a <- quantile(x, ps)            # empirical quantiles
b <- qnorm(ps)                  # theoretical quantiles
lines(a, b, lwd=4, col="red")   # our QQ line in red
qqline(x, datax=T)              # R QQ line

The reason for different lines in R and SPSS is that several approaches to fitting a straight line exist (for an overview see e.g. Castillo-Gutiérrez, Lozano-Aguilera, & Estudillo-Martínez, 2012a). Each approach has different advantages. The method used by R is more robust when we expect values to diverge from normality in the tails, and we are primarily interested in the normality of the middle range of our data. In other words, the method of fitting an adequate QQ-line depends on the purpose of the plot. An explanation of the rationale of the R approach can e.g. be found here.

SPSS type

The default SPSS approach also uses Blom’s method to assign probabilities to ranks (you may choose other methods is SPSS) and differs from the one above in the following aspects:

  • a) As already mentioned, SPSS uses ranks with averaged ties (fractional rankings) not the plain order ranks (ordinal ranking) as in R to derive the corresponding probabilities for each data point. The rest of the code is identical to the one above, though I am not sure if SPSS distinguishes between the n  10 case.
  • b) The theoretical quantiles are scaled to match the estimated mean and standard deviation of the original data.
  • c) The QQ-line goes through all quantiles with identical values on the x and y axis.
n <- length(x)                # number of observations
r <- rank(x)                  # a) ranks using fractional ranking (averaging ties)
p <- (r - 1/2) / n            # assign to ranks using Blom's method
y <- qnorm(p)                 # theoretical standard normal quantiles for p values
y <- y * sd(x) + mean(x)      # b) transform SND quantiles to mean and sd from original data
plot(x, y)                    # plot empirical against theoretical values

Lastly, let us add the line. As the scaling of both axes is the same, the line goes through the origin with a slope of 1.

abline(0,1)                   # c) slope 0 through origin

The comparison to the SPSS output shows that they are (visually) identical.

Function for SPSS-type QQ-plot

The whole point of this demonstration was to pinpoint and explain the differences between a QQ-plot generated in R and SPSS, so it will no longer be a reason for confusion. Note, however, that SPSS offers a whole range of options to generate the plot. For example, you can select the method to assign probabilities to ranks and decide how to treat ties. The plots above used the default setting (Blom’s method and averaging across ties). Personally I like the SPSS version. That is why I implemented the function qqnorm_spss in the ryouready package, that accompanies the course. The formulae for the different methods to assign probabilities to ranks can be found in Castillo-Gutiérrez et al. (2012b). The implentation is a preliminary version that has not yet been thoroughly tested. You can find the code here. Please report any bugs or suggestions for improvements (which are very welcome) in the github issues section.

library(devtools) 
install_github("markheckmann/ryouready")                # install from github repo
library(ryouready)                                      # load package
library(ggplot2)
qq <- qqnorm_spss(x, method=1, ties.method="average")   # Blom's method with averaged ties
plot(qq)                                                # generate QQ-plot
ggplot(qq)                                              # use ggplot2 to generate QQ-plot

Literature


  1. Technical sidenote: Internally, qqnorm uses the function ppoints to generate the p-values. Type in stats:::qqnorm.default to the console to have a look at the code. 

by markheckmann at December 15, 2014 08:55 AM

October 20, 2014

Modern Toolmaking

For faster R on a mac, use veclib

Update: The links to all my github gists on blogger are broken, and I can't figure out how to fix them.  If you know how to insert gitub gists on a dynamic blogger template, please let me known.

In the meantime, here are instructions with links to the code:
First of all, use homebrew to compile openblas.  It's easy!  Second of all, you can also use homebrew to install R! (But maybe stick with the CRAN version unless you really want to compile your own R binary)

To use openblas with R, follow these instructions:
https://gist.github.com/zachmayer/e591cf868b3a381a01d6#file-openblas-sh

To use veclib with R, follow these intructions:
https://gist.github.com/zachmayer/e591cf868b3a381a01d6#file-veclib-sh

OLD POST:

Inspired by this post, I decided to try using OpenBLAS for R on my mac.  However, it turns out there's a simpler option, using the vecLib BLAS library, which is provided by Apple as part of the accelerate framework.

If you are using R 2.15, follow these instructions to change your BLAS from the default to vecLib:


However, as noted in r-sig-mac, these instructions do not work for R 3.0.  You have to directly link to the accelerate framework's version of vecLib:


Finally, test your new blas using this script:


On my system (a retina macbook pro), the default BLAS takes 141 seconds and vecLib takes 43 seconds, which is a significant speedup.  If you plan to use vecLib, note the following warning from the R development team "Although fast, it is not under our control and may possibly deliver inaccurate results."

So far, I have not encountered any issues using vecLib, but it's only been a few hours :-).

UPDATE: you can also install OpenBLAS on a mac:

If you do this, make sure to change the directories to point to the correct location on your system  (e.g. change /users/zach/source to whatever directory you clone the git repo into).  On my system, the benchmark script takes ~41 seconds when using openBLAS, which is a small but significant speedup.

by Zachary Deane-Mayer (noreply@blogger.com) at October 20, 2014 04:24 PM

September 19, 2014

Chris Lawrence

What could a federal UK look like?

Assuming that the “no” vote prevails in the Scottish independence referendum, the next question for the United Kingdom is to consider constitutional reform to implement a quasi-federal system and resolve the West Lothian question once and for all. In some ways, it may also provide an opportunity to resolve the stalled reform of the upper house as well. Here’s the rough outline of a proposal that might work.

  • Devolve identical powers to England, Northern Ireland, Scotland, and Wales, with the proviso that local self-rule can be suspended if necessary by the federal legislature (by a supermajority).

  • The existing House of Commons becomes the House of Commons for England, which (along with the Sovereign) shall comprise the English Parliament. This parliament would function much as the existing devolved legislatures in Scotland and Wales; the consociational structure of the Northern Ireland Assembly (requiring double majorities) would not be replicated.

  • The House of Lords is abolished, and replaced with a directly-elected Senate of the United Kingdom. The Senate will have authority to legislate on the non-devolved powers (in American parlance, “delegated” powers) such as foreign and European Union affairs, trade and commerce, national defense, and on matters involving Crown dependencies and territories, the authority to legislate on devolved matters in the event self-government is suspended in a constituent country, and dilatory powers including a qualified veto (requiring a supermajority) over the legislation proposed by a constituent country’s parliament. The latter power would effectively replace the review powers of the existing House of Lords; it would function much as the Council of Revision in Madison’s original plan for the U.S. Constitution.

    As the Senate will have relatively limited powers, it need not be as large as the existing Lords or Commons. To ensure the countries other than England have a meaningful voice, given that nearly 85% of the UK’s population is in England, two-thirds of the seats would be allocated proportionally based on population and one-third allocated equally to the four constituent countries. This would still result in a chamber with a large English majority (around 64.4%) but nonetheless would ensure the other three countries would have meaningful representation as well.

by Chris Lawrence at September 19, 2014 12:00 AM

September 12, 2014

R you ready?

Using colorized PNG pictograms in R base plots

Today I stumbled across a figure in an explanation on multiple factor analysis which contained pictograms.

 

abdi_mfa

Figure 1 from Abdi & Valentin (2007), p. 8.

I wanted to reproduce a similar figure in R using pictograms and additionally color them e.g. by group membership . I have almost no knowledge about image processing, so I tried out several methods of how to achieve what I want. The first thing I did was read in an PNG file and look at the data structure. The package png allows to read in PNG files. Note that all of the below may not work on Windows machines, as it does not support semi-transparency (see ?readPNG).

library(png)
img <- readPNG(system.file("img", "Rlogo.png", package="png"))
class(img)
## [1] "array"
dim(img)
## [1]  76 100   4

The object is a numerical array with four layers (red, green, blue, alpha; short RGBA). Let’s have a look at the first layer (red) and replace all non-zero entries by a one and the zeros by a dot. This will show us the pattern of non-zero values and we already see the contours.

l4 <- img[,,1]
l4[l4 > 0] <- 1
l4[l4 == 0] <- "."
d <- apply(l4, 1, function(x) {
 cat(paste0(x, collapse=""), "\n") 
})

To display the image in R one way is to raster the image (i.e. the RGBA layers are collapsed into a layer of single HEX value) and print it using rasterImage.

r_logo_layer_1

rimg <- as.raster(img) # raster multilayer object
r <- nrow(rimg) / ncol(rimg) # image ratio
plot(c(0,1), c(0,r), type = "n", xlab = "", ylab = "", asp=1)
rasterImage(rimg, 0, 0, 1, r) 

plot_1

Let’s have a look at a small part the rastered image object. It is a matrix of HEX values.

rimg[40:50, 1:6]
## [1,] "#C4C5C202" "#858981E8" "#838881FF" "#888D86FF" "#8D918AFF" "#8F938CFF"
## [2,] "#00000000" "#848881A0" "#80847CFF" "#858A83FF" "#898E87FF" "#8D918BFF"
## [3,] "#00000000" "#8B8E884C" "#7D817AFF" "#82867EFF" "#868B84FF" "#8A8E88FF"
## [4,] "#00000000" "#9FA29D04" "#7E827BE6" "#7E817AFF" "#838780FF" "#878C85FF"
## [5,] "#00000000" "#00000000" "#81857D7C" "#797E75FF" "#7F827BFF" "#838781FF"
## [6,] "#00000000" "#00000000" "#898C8510" "#787D75EE" "#797E76FF" "#7F837BFF"
## [7,] "#00000000" "#00000000" "#00000000" "#7F837C7B" "#747971FF" "#797E76FF"
## [8,] "#00000000" "#00000000" "#00000000" "#999C9608" "#767C73DB" "#747971FF"
## [9,] "#00000000" "#00000000" "#00000000" "#00000000" "#80847D40" "#71766EFD"
## [10,] "#00000000" "#00000000" "#00000000" "#00000000" "#00000000" "#787D7589"
## [11,] "#00000000" "#00000000" "#00000000" "#00000000" "#00000000" "#999C9604"

And print this small part.

plot(c(0,1), c(0,.6), type = "n", xlab = "", ylab = "", asp=1)
rasterImage(rimg[40:50, 1:6], 0, 0, 1, .6) 

plot_2

Now we have an idea of how the image object and the rastered object look like from the inside. Let’s start to modify the images to suit our needs.

In order to change the color of the pictograms, my first idea was to convert the graphics to greyscale and remap the values to a color ramp of may choice. To convert to greyscale there are tons of methods around (see e.g. here). I just pick one of them I found on SO by chance. With R=Red, G=Green and B=Blue we have

brightness = sqrt(0.299 * R^2 + 0.587 * G^2 + 0.114 * B^2)

This approach modifies the PNG files after they have been coerced into a raster object.

# function to calculate brightness values
brightness <- function(hex) {
  v <- col2rgb(hex)
  sqrt(0.299 * v[1]^2 + 0.587 * v[2]^2 + 0.114 * v[3]^2) /255
}

# given a color ramp, map brightness to ramp also taking into account 
# the alpha level. The defaul color ramp is grey
#
img_to_colorramp <- function(img, ramp=grey) {
  cv <- as.vector(img)
  b <- sapply(cv, brightness)
  g <- ramp(b)
  a <- substr(cv, 8,9)     # get alpha values
  ga <- paste0(g, a)       # add alpha values to new colors
  img.grey <- matrix(ga, nrow(img), ncol(img), byrow=TRUE)  
}

# read png and modify
img <- readPNG(system.file("img", "Rlogo.png", package="png"))
img <- as.raster(img)           # raster multilayer object
r <- nrow(img) / ncol(img)      # image ratio
s <- 3.5                        # size

plot(c(0,10), c(0,3.5), type = "n", xlab = "", ylab = "", asp=1)

rasterImage(img, 0, 0, 0+s/r, 0+s)  # original
img2 <- img_to_colorramp(img)       # modify using grey scale
rasterImage(img2, 5, 0, 5+s/r, 0+s)

plot_3

Great, it works! Now Let’s go and try out some other color palettes using colorRamp to create a color ramp.

plot(c(0,10),c(0,8.5), type = "n", xlab = "", ylab = "", asp=1)

img1 <- img_to_colorramp(img)
rasterImage(img1, 0, 5, 0+s/r, 5+s)

reds <- function(x) 
  rgb(colorRamp(c("darkred", "white"))(x), maxColorValue = 255)
img2 <- img_to_colorramp(img, reds)
rasterImage(img2, 5, 5, 5+s/r, 5+s)

greens <- function(x) 
  rgb(colorRamp(c("darkgreen", "white"))(x), maxColorValue = 255)
img3 <- img_to_colorramp(img, greens)
rasterImage(img3, 0, 0, 0+s/r, 0+s)

single_color <- function(...) "#0000BB"
img4 <- img_to_colorramp(img, single_color)
rasterImage(img4, 5, 0, 5+s/r, 0+s)

plot_4

Okay, that basically does the job. Now we will apply it to the wine pictograms.
Let’s use this wine glass from Wikimedia Commons. It’s quite big so I uploaded a reduced size version to imgur . We will use it for our purposes.

# load file from web
f <- tempfile()
download.file("http://i.imgur.com/A14ntCt.png", f)
img <- readPNG(f)
img <- as.raster(img)
r <- nrow(img) / ncol(img)
s <- 1

# let's create a function that returns a ramp function to save typing
ramp <- function(colors) 
  function(x) rgb(colorRamp(colors)(x), maxColorValue = 255)

# create dataframe with coordinates and colors
set.seed(1)
x <- data.frame(x=rnorm(16, c(2,2,4,4)), 
                y=rnorm(16, c(1,3)), 
                colors=c("black", "darkred", "garkgreen", "darkblue"))

plot(c(1,6), c(0,5), type="n", xlab="", ylab="", asp=1)
for (i in 1L:nrow(x)) {
  colorramp <- ramp(c(x[i,3], "white"))
  img2 <- img_to_colorramp(img, colorramp)
  rasterImage(img2, x[i,1], x[i,2], x[i,1]+s/r, x[i,2]+s)
}

plot_5

Another approach would be to modifying the RGB layers before rastering to HEX values.

img <- readPNG(system.file("img", "Rlogo.png", package="png"))
img2 <- img
img[,,1] <- 0    # remove Red component
img[,,2] <- 0    # remove Green component
img[,,3] <- 1    # Set Blue to max
img <- as.raster(img)
r <- nrow(img) / ncol(img)  # size ratio
s <- 3.5   # size
plot(c(0,10), c(0,3.5), type = "n", xlab = "", ylab = "", asp=1)
rasterImage(img, 0, 0, 0+s/r, 0+s)

img2[,,1] <- 1   # Red to max
img2[,,2] <- 0
img2[,,3] <- 0
rasterImage(as.raster(img2), 5, 0, 5+s/r, 0+s)

plot_6

To just colorize the image, we could weight each layer.

# wrap weighting into function
weight_layers <- function(img, w) {
  for (i in seq_along(w))
    img[,,i] <- img[,,i] * w[i]
  img
}

plot(c(0,10), c(0,3.5), type = "n", xlab = "", ylab = "", asp=1)
img <- readPNG(system.file("img", "Rlogo.png", package="png"))
img2 <- weight_layers(img, c(.2, 1,.2))
rasterImage(img2, 0, 0, 0+s/r, 0+s)

img3 <- weight_layers(img, c(1,0,0))
rasterImage(img3, 5, 0, 5+s/r, 0+s)

plot_7

After playing around and hard-coding the modifications I started to search and found the EBimage package which has a lot of features for image processing that make ones life (in this case only a bit) easier.

library(EBImage)
f <- system.file("img", "Rlogo.png", package="png")
img <- readImage(f) 
img2 <- img

img[,,2] = 0      # zero out green layer
img[,,3] = 0      # zero out blue layer
img <- as.raster(img)

img2[,,1] = 0
img2[,,3] = 0
img2 <- as.raster(img2)

r <- nrow(img) / ncol(img)
s <- 3.5
plot(c(0,10), c(0,3.5), type = "n", xlab = "", ylab = "", asp=1)
rasterImage(img, 0, 0, 0+s/r, 0+s)
rasterImage(img2, 5, 0, 5+s/r, 0+s)

plot_8

EBImage is a good choice and fairly easy to handle. Now let’s again print the pictograms.

f <- tempfile(fileext=".png")
download.file("http://i.imgur.com/A14ntCt.png", f)
img <- readImage(f)

# will replace whole image layers by one value
# only makes sense if there is a alpha layer that 
# gives the contours
#
mod_color <- function(img, col) {
  v <- col2rgb(col) / 255
  img = channel(img, 'rgb')
  img[,,1] = v[1]   # Red
  img[,,2] = v[2]   # Green
  img[,,3] = v[3]   # Blue
  as.raster(img)  
}

r <- nrow(img) / ncol(img)  # get image ratio
s <- 1                      # size

# create random data
set.seed(1)
x <- data.frame(x=rnorm(16, c(2,2,4,4)), 
                y=rnorm(16, c(1,3)), 
                colors=1:4)

# plot pictograms 
plot(c(1,6), c(0,5), type="n", xlab="", ylab="", asp=1)
for (i in 1L:nrow(x)) {
  img2 <- mod_color(img, x[i, 3])
  rasterImage(img2, x[i,1], x[i,2], x[i,1]+s*r, x[i,2]+s)
}

plot_9

Note, that above I did not bother to center each pictogram to position it correctly. This still needs to be done. Anyway, that’s it! Mission completed.

Literature

Abdi, H., & Valentin, D. (2007). Multiple factor analysis (MFA). In N. Salkind (Ed.), Encyclopedia of Measurement and Statistics (pp. 1–14). Thousand Oaks, CA: Sage Publications. Retrieved from https://www.utdallas.edu/~herve/Abdi-MFA2007-pretty.pdf


by markheckmann at September 12, 2014 09:19 AM

June 18, 2014

Chris Lawrence

Soccer queries answered

Kevin Drum asks a bunch of questions about soccer:

  1. Outside the penalty area there’s a hemisphere about 20 yards wide. I can’t recall ever seeing it used for anything. What’s it for?
  2. On several occasions, I’ve noticed that if the ball goes out of bounds at the end of stoppage time, the referee doesn’t whistle the match over. Instead, he waits for the throw-in, and then immediately whistles the match over. What’s the point of this?
  3. Speaking of stoppage time, how has it managed to last through the years? I know, I know: tradition. But seriously. Having a timekeeper who stops the clock for goals, free kicks, etc. has lots of upside and no downside. Right? It wouldn’t change the game in any way, it would just make timekeeping more accurate, more consistent, and more transparent for the fans and players. Why keep up the current pretense?
  4. What’s the best way to get a better sense of what’s a foul and what’s a legal tackle? Obviously you can’t tell from the players’ reactions, since they all writhe around like landed fish if they so much as trip over their own shoelaces. Reading the rules provides the basics, but doesn’t really help a newbie very much. Maybe a video that shows a lot of different tackles and explains why each one is legal, not legal, bookable, etc.?

The first one’s easy: there’s a general rule that no defensive player can be within 10 yards of the spot of a direct free kick. A penalty kick (which is a type of direct free kick) takes place in the 18-yard box, and no players other than the player taking the kick and the goalkeeper are allowed in the box. However, owing to geometry, the 18 yard box and the 10 yard exclusion zone don’t fully coincide, hence the penalty arc. (That’s also why there are two tiny hash-marks on the goal line and side line 10 yards from the corner flag. And why now referees have a can of shaving cream to mark the 10 yards for other free kicks, one of the few MLS innovations that has been a good idea.)

Second one’s also easy: the half and the game cannot end while the ball is out of play.

Third one’s harder. First, keeping time inexactly forestalls the silly premature celebrations that are common in most US sports. You’d never see the Stanford-Cal play happen in a soccer game. Second, it allows some slippage for short delays and doesn’t require exact timekeeping; granted, this was more valuable before instant replays and fourth officials, but most US sports require a lot of administrative record-keeping by ancillary officials. A soccer game can be played with one official (and often is, particularly at the amateur level) without having to change timing rules;* in developing countries in particular this lowers the barriers to entry for the sport (along with the low equipment requirements) without changing the nature of the game appreciably. Perhaps most importantly, if the clock was allowed to stop regularly it would create an excuse for commercial timeouts and advertising breaks, which would interrupt the flow of the game and potentially reduce the advantages of better-conditioned and more skilled athletes. (MLS tried this, along with other exciting American ideas like “no tied games,” and it was as appealing to actual soccer fans as ketchup on filet mignon would be to a foodie, and perhaps more importantly didn’t make any non-soccer fans watch.)

Fourth, the key distinction is usually whether there was an obvious attempt to play the ball; in addition, in the modern game, even some attempts to play the ball are considered inherently dangerous (tackling from behind, many sliding tackles, etc.) and therefore are fouls even if they are successful in getting more ball than human.

* To call offside, you’d also probably need what in my day we called a “linesman.”

by Chris Lawrence at June 18, 2014 12:00 AM

May 07, 2014

Chris Lawrence

The mission and vision thing

Probably the worst-kept non-secret is that the next stage of the institutional evolution of my current employer is to some ill-defined concept of “university status,” which mostly involves the establishment of some to-be-determined master’s degree programs. In the context of the University System of Georgia, it means a small jump from the “state college” prestige tier (a motley collection of schools that largely started out as two-year community colleges and transfer institutions) to the “state university” tier (which is where most of the ex-normal schools hang out these days). What is yet to be determined is how that transition will affect the broader institution that will be the University of Middle Georgia.* People on high are said to be working on these things; in any event, here are my assorted random thoughts on what might be reasonable things to pursue:

  • Marketing and positioning: Unlike the situation facing many of the other USG institutions, the population of the two anchor counties of our core service area (Bibb and Houston) is growing, and Houston County in particular has a statewide reputation for the quality of its public school system. Rather than conceding that the most prepared students from these schools will go to Athens or Atlanta or Valdosta, we should strongly market our institutional advantages over these more “prestigious” institutions, particularly in terms of the student experience in the first two years and the core curriculum: we have no large lecture courses, no teaching assistants, no lengthy bus rides to and from class every day, and the vast majority of the core is taught by full-time faculty with terminal degrees. Not to mention costs to students are much lower, particularly in the case of students who do not qualify for need-based aid. Even if we were to “lose” these students as transfers to the top-tier institutions after 1–4 semesters, we’d still benefit from the tuition and fees they bring in and we would not be penalized in the upcoming state performance funding formula. Dual enrollment in Warner Robins in particular is an opportunity to showcase our institution as a real alternative for better prepared students rather than a safety school.
  • Comprehensive offerings at the bachelor’s level: As a state university, we will need to offer a comprehensive range of options for bachelor’s students to attract and retain students, both traditional and nontraditional. In particular, B.S. degrees in political science and sociology with emphasis in applied empirical skills would meet public and private employer demand for workers who have research skills and the ability to collect, manage, understand, and use data appropriately. There are other gaps in the liberal arts and sciences as well that need to be addressed to become a truly comprehensive state university.
  • Create incentives to boost the residential population: The college currently has a heavy debt burden inherited from the overbuilding of dorms at the Cochran campus. We need to identify ways to encourage students to live in Cochran, which may require public-private partnerships to try to build a “college town” atmosphere in the community near campus. We also need to work with wireless providers like Sprint and T-Mobile to ensure that students from the “big city” can fully use their cell phones and tablets in Cochran and Eastman without roaming fees or changing wireless providers.
  • Tie the institution more closely to the communities we serve: This includes both physical ties and psychological ties. The Macon campus in particular has poor physical links to the city itself for students who might walk or ride bicycles; extending the existing bike/walking trail from Wesleyan to the Macon campus should be a priority, as should pedestrian access and bike facilities along Columbus Road. Access to the Warner Robins campus is somewhat better but still could be improved. More generally, the institution is perceived as an afterthought or alternative of last resort in the community. Improving this situation and perception among community leaders and political figures may require a physical presence in or near downtown Macon, perhaps in partnership with the GCSU Graduate Center.

* There is no official name-in-waiting, but given that our former interim president seemed to believe he could will this name into existence by repeating it enough I’ll stick with it. The straw poll of faculty trivia night suggests that it’s the least bad option available, which inevitably means the regents will choose something else instead (if the last name change is anything to go by).

by Chris Lawrence at May 07, 2014 12:00 AM

February 17, 2014

Seth Falcon

Have Your SHA and Bcrypt Too

Fear

I've been putting off sharing this idea because I've heard the rumors about what happens to folks who aren't security experts when they post about security on the internet. If this blog is replaced with cat photos and rainbows, you'll know what happened.

The Sad Truth

It's 2014 and chances are you have accounts on websites that are not properly handling user passwords. I did no research to produce the following list of ways passwords are mishandled in decreasing order of frequency:

  1. Site uses a fast hashing algorithm, typically SHA1(salt + plain-password).
  2. Site doesn't salt password hashes
  3. Site stores raw passwords

We know that sites should be generating secure random salts and using an established slow hashing algorithm (bcrypt, scrypt, or PBKDF2). Why are sites not doing this?

While security issues deserve a top spot on any site's priority list, new features often trump addressing legacy security concerns. The immediacy of the risk is hard to quantify and it's easy to fall prey to a "nothing bad has happened yet, why should we change now" attitude. It's easy for other bugs, features, or performance issues to win out when measured by immediate impact. Fixing security or other "legacy" issues is the Right Thing To Do and often you will see no measurable benefit from the investment. It's like having insurance. You don't need it until you do.

Specific to the improper storage of user password data is the issue of the impact to a site imposed by upgrading. There are two common approaches to upgrading password storage. You can switch cold turkey to the improved algorithms and force password resets on all of your users. Alternatively, you can migrate incrementally such that new users and any user who changes their password gets the increased security.

The cold turkey approach is not a great user experience and sites might choose to delay an upgrade to avoid admitting to a weak security implementation and disrupting their site by forcing password resets.

The incremental approach is more appealing, but the security benefit is drastically diminished for any site with a substantial set of existing users.

Given the above migration choices, perhaps it's (slightly) less surprising that businesses choose to prioritize other work ahead of fixing poorly stored user password data.

The Idea

What if you could upgrade a site so that both new and existing users immediately benefited from the increased security, but without the disruption of password resets? It turns out that you can and it isn't very hard.

Consider a user table with columns:

userid
salt
hashed_pass

Where the hashed_pass column is computed using a weak fast algorithm, for example SHA1(salt + plain_pass).

The core of the idea is to apply a proper algorithm on top of the data we already have. I'll use bcrypt to make the discussion concrete. Add columns to the user table as follows:

userid
salt
hashed_pass
hash_type
salt2

Process the existing user table by computing bcrypt(salt2 + hashed_pass) and storing the result in the hashed_pass column (overwriting the less secure value); save the new salt value to salt2 and set hash_type to bycrpt+sha1.

To verify a user where hash_type is bcrypt+sha1, compute bcrypt(salt2 + SHA1(salt + plain_pass)) and compare to the hashed_pass value. Note that bcrypt implementations encode the salt as a prefix of the hashed value so you could avoid the salt2 column, but it makes the idea easier to explain to have it there.

You can take this approach further and have any user that logs in (as well as new users) upgrade to a "clean" bcrypt only algorithm since you can now support different verification algorithms using hash_type. With the proper application code changes in place, the upgrade can be done live.

This scheme will also work for sites storing non-salted password hashes as well as those storing plain text passwords (THE HORROR).

Less Sadness, Maybe

Perhaps this approach makes implementing a password storage security upgrade more palatable and more likely to be prioritized. And if there's a horrible flaw in this approach, maybe you'll let me know without turning this blog into a tangle of cat photos and rainbows.

February 17, 2014 07:08 PM

December 26, 2013

Seth Falcon

A Rebar Plugin for Locking Deps: Reproducible Erlang Project Builds For Fun and Profit

What's this lock-deps of which you speak?

If you use rebar to generate an OTP release project and want to have reproducible builds, you need the rebar_lock_deps_plugin plugin. The plugin provides a lock-deps command that will generate a rebar.config.lock file containing the complete flattened set of project dependencies each pegged to a git SHA. The lock file acts similarly to Bundler's Gemfile.lock file and allows for reproducible builds (*).

Without lock-deps you might rely on the discipline of using a tag for all of your application's deps. This is insufficient if any dep depends on something not specified as a tag. It can also be a problem if a third party dep doesn't provide a tag. Generating a rebar.config.lock file solves these issues. Moreover, using lock-deps can simplify the work of putting together a release consisting of many of your own repos. If you treat the master branch as shippable, then rather than tagging each subproject and updating rebar.config throughout your project's dependency chain, you can run get-deps (without the lock file), compile, and re-lock at the latest versions throughout your project repositories.

The reproducibility of builds when using lock-deps depends on the SHAs captured in rebar.config.lock. The plugin works by scanning the cloned repos in your project's deps directory and extracting the current commit SHA. This works great until a repository's history is rewritten with a force push. If you really want reproducible builds, you need to not nuke your SHAs and you'll need to fork all third party repos to ensure that someone else doesn't screw you over in this fashion either. If you make a habit of only depending on third party repos using a tag, assume that upstream maintainers are not completely bat shit crazy, and don't force push your master branch, then you'll probably be fine.

Getting Started

Install the plugin in your project by adding the following to your rebar.config file:

%% Plugin dependency
{deps, [
    {rebar_lock_deps_plugin, ".*",
     {git, "git://github.com/seth/rebar_lock_deps_plugin.git", {branch, "master"}}}
]}.

%% Plugin usage
{plugins, [rebar_lock_deps_plugin]}.

To test it out do:

rebar get-deps
# the plugin has to be compiled so you can use it
rebar compile
rebar lock-deps

If you'd like to take a look at a project that uses the plugin, take a look at CHEF's erchef project.

Bonus features

If you are building an OTP release project using rebar generate then you can use rebar_lock_deps_plugin to enhance your build experience in three easy steps.

  1. Use rebar bump-rel-version version=$BUMP to automate the process of editing rel/reltool.config to update the release version. The argument $BUMP can be major, minor, or patch (default) to increment the specified part of a semver X.Y.Z version. If $BUMP is any other value, it is used as the new version verbatim. Note that this function rewrites rel/reltool.config using ~p. I check-in the reformatted version and maintain the formatting when editing. This way, the general case of a version bump via bump-rel-version results in a minimal diff.

  2. Autogenerate a change summary commit message for all project deps. Assuming you've generated a new lock file and bumped the release version, use rebar commit-release to commit the changes to rebar.config.lock and rel/reltool.config with a commit message that summarizes the changes made to each dependency between the previously locked version and the newly locked version. You can get a preview of the commit message via rebar log-changed-deps.

  3. Finally, create an annotated tag for your new release with rebar tag-release which will read the current version from rel/reltool.config and create an annotated tag named with the version.

The dependencies, they are ordered

Up to version 2.0.1 of rebar_lock_deps_plugin, the dependencies in the generated lock file were ordered alphabetically. This was a side-effect of using filelib:wildcard/1 to list the dependencies in the top-level deps directory. In most cases, the order of the full dependency set does not matter. However, if some of the code in your project uses parse transforms, then it will be important for the parse transform to be compiled and on the code path before attempting to compile code that uses the parse transform.

This issue was recently discovered by a colleague who ran into build issues using the lock file for a project that had recently integrated lager for logging. He came up with the idea of maintaining the order of deps as they appear in the various rebar.config files along with a prototype patch proving out the idea. As of rebar_lock_deps_plugin 3.0.0, the lock-deps command will (mostly) maintain the relative order of dependencies as found in the rebar.config files.

The "mostly" is that when a dep is shared across two subprojects, it will appear in the expected order for the first subproject (based on the ordering of the two subprojects). The deps for the second subproject will not be in strict rebar.config order, but the resulting order should address any compile-time dependencies and be relatively stable (only changing when project deps alter their deps with larger impact when shared deps are introduced or removed).

Digression: fun with dependencies

There are times, as a programmer, when a real-world problem looks like a text book exercise (or an interview whiteboard question). Just the other day at work we had to design some manhole covers, but I digress.

Fixing the order of the dependencies in the generated lock file is (nearly) the same as finding an install order for a set of projects with inter-dependencies. I had some fun coding up the text book solution even though the approach doesn't handle the constraint of respecting the order provided by the rebar.config files. Onward with the digression.

We have a set of "packages" where some packages depend on others and we want to determine an install order such that a package's dependencies are always installed before the package. The set of packages and the relation "depends on" form a directed acyclic graph or DAG. The topological sort of a DAG produces an install order for such a graph. The ordering is not unique. For example, with a single package C depending on A and B, valid install orders are [A, B, C] and [B, A, C].

To setup the problem, we load all of the project dependency information into a proplist mapping each package to a list of its dependencies extracted from the package's rebar.config file.

read_all_deps(Config, Dir) ->
    TopDeps = rebar_config:get(Config, deps, []),
    Acc = [{top, dep_names(TopDeps)}],
    DepDirs = filelib:wildcard(filename:join(Dir, "*")),
    Acc ++ [
     {filename:basename(D), dep_names(extract_deps(D))}
     || D <- DepDirs ].

Erlang's standard library provides the digraph and digraph_utils modules for constructing and operating on directed graphs. The digraph_utils module includes a topsort/1 function which we can make use of for our "exercise". The docs say:

Returns a topological ordering of the vertices of the digraph Digraph if such an ordering exists, false otherwise. For each vertex in the returned list, there are no out-neighbours that occur earlier in the list.

To figure out which way to point the edges when building our graph, consider two packages A and B with A depending on B. We know we want to end up with an install order of [B, A]. Rereading the topsort/1 docs, we must want an edge B => A. With that, we can build our DAG and obtain an install order with the topological sort:

load_digraph(Config, Dir) ->
    AllDeps = read_all_deps(Config, Dir),
    G = digraph:new(),
    Nodes = all_nodes(AllDeps),
    [ digraph:add_vertex(G, N) || N <- Nodes ],
    %% If A depends on B, then we add an edge A <= B
    [ 
      [ digraph:add_edge(G, Dep, Item)
        || Dep <- DepList ]
      || {Item, DepList} <- AllDeps, Item =/= top ],
    digraph_utils:topsort(G).

%% extract a sorted unique list of all deps
all_nodes(AllDeps) ->
    lists:usort(lists:foldl(fun({top, L}, Acc) ->
                                    L ++ Acc;
                               ({K, L}, Acc) ->
                                    [K|L] ++ Acc
                            end, [], AllDeps)).

The digraph module manages graphs using ETS giving it a convenient API, though one that feels un-erlang-y in its reliance on side-effects.

The above gives an install order, but doesn't take into account the relative order of deps as specified in the rebar.config files. The solution implemented in the plugin is a bit less fancy, recursing over the deps and maintaining the desired ordering. The only tricky bit being that shared deps are ignored until the end and the entire linearized list is de-duped which required a . Here's the code:

order_deps(AllDeps) ->
    Top = proplists:get_value(top, AllDeps),
    order_deps(lists:reverse(Top), AllDeps, []).

order_deps([], _AllDeps, Acc) ->
    de_dup(Acc);
order_deps([Item|Rest], AllDeps, Acc) ->
    ItemDeps = proplists:get_value(Item, AllDeps),
    order_deps(lists:reverse(ItemDeps) ++ Rest, AllDeps, [Item | Acc]).

de_dup(AccIn) ->
    WithIndex = lists:zip(AccIn, lists:seq(1, length(AccIn))),
    UWithIndex = lists:usort(fun({A, _}, {B, _}) ->
                                     A =< B
                             end, WithIndex),
    Ans0 = lists:sort(fun({_, I1}, {_, I2}) ->
                              I1 =< I2
                      end, UWithIndex),
    [ V || {V, _} <- Ans0 ].

Conclusion and the end of this post

The great thing about posting to your blog is, you don't have to have a proper conclusion if you don't want to.

December 26, 2013 04:20 PM

December 09, 2013

Leandro Penz

Probabilistic bug hunting

Probabilistic bug hunting

Have you ever run into a bug that, no matter how careful you are trying to reproduce it, it only happens sometimes? And then, you think you've got it, and finally solved it - and tested a couple of times without any manifestation. How do you know that you have tested enough? Are you sure you were not "lucky" in your tests?

In this article we will see how to answer those questions and the math behind it without going into too much detail. This is a pragmatic guide.

The Bug

The following program is supposed to generate two random 8-bit integer and print them on stdout:

  
  #include <stdio.h>
  #include <fcntl.h>
  #include <unistd.h>
  
  /* Returns -1 if error, other number if ok. */
  int get_random_chars(char *r1, char*r2)
  {
  	int f = open("/dev/urandom", O_RDONLY);
  
  	if (f < 0)
  		return -1;
  	if (read(f, r1, sizeof(*r1)) < 0)
  		return -1;
  	if (read(f, r2, sizeof(*r2)) < 0)
  		return -1;
  	close(f);
  
  	return *r1 & *r2;
  }
  
  int main(void)
  {
  	char r1;
  	char r2;
  	int ret;
  
  	ret = get_random_chars(&r1, &r2);
  
  	if (ret < 0)
  		fprintf(stderr, "error");
  	else
  		printf("%d %d\n", r1, r2);
  
  	return ret < 0;
  }
  

On my architecture (Linux on IA-32) it has a bug that makes it print "error" instead of the numbers sometimes.

The Model

Every time we run the program, the bug can either show up or not. It has a non-deterministic behaviour that requires statistical analysis.

We will model a single program run as a Bernoulli trial, with success defined as "seeing the bug", as that is the event we are interested in. We have the following parameters when using this model:

  • \(n\): the number of tests made;
  • \(k\): the number of times the bug was observed in the \(n\) tests;
  • \(p\): the unknown (and, most of the time, unknowable) probability of seeing the bug.

As a Bernoulli trial, the number of errors \(k\) of running the program \(n\) times follows a binomial distribution \(k \sim B(n,p)\). We will use this model to estimate \(p\) and to confirm the hypotheses that the bug no longer exists, after fixing the bug in whichever way we can.

By using this model we are implicitly assuming that all our tests are performed independently and identically. In order words: if the bug happens more ofter in one environment, we either test always in that environment or never; if the bug gets more and more frequent the longer the computer is running, we reset the computer after each trial. If we don't do that, we are effectively estimating the value of \(p\) with trials from different experiments, while in truth each experiment has its own \(p\). We will find a single value anyway, but it has no meaning and can lead us to wrong conclusions.

Physical analogy

Another way of thinking about the model and the strategy is by creating a physical analogy with a box that has an unknown number of green and red balls:

  • Bernoulli trial: taking a single ball out of the box and looking at its color - if it is red, we have observed the bug, otherwise we haven't. We then put the ball back in the box.
  • \(n\): the total number of trials we have performed.
  • \(k\): the total number of red balls seen.
  • \(p\): the total number of red balls in the box divided by the total number of green balls in the box.

Some things become clearer when we think about this analogy:

  • If we open the box and count the balls, we can know \(p\), in contrast with our original problem.
  • Without opening the box, we can estimate \(p\) by repeating the trial. As \(n\) increases, our estimate for \(p\) improves. Mathematically: \[p = \lim_{n\to\infty}\frac{k}{n}\]
  • Performing the trials in different conditions is like taking balls out of several different boxes. The results tell us nothing about any single box.

Estimating \(p\)

Before we try fixing anything, we have to know more about the bug, starting by the probability \(p\) of reproducing it. We can estimate this probability by dividing the number of times we see the bug \(k\) by the number of times we tested for it \(n\). Let's try that with our sample bug:

  $ ./hasbug
  67 -68
  $ ./hasbug
  79 -101
  $ ./hasbug
  error

We know from the source code that \(p=25%\), but let's pretend that we don't, as will be the case with practically every non-deterministic bug. We tested 3 times, so \(k=1, n=3 \Rightarrow p \sim 33%\), right? It would be better if we tested more, but how much more, and exactly what would be better?

\(p\) precision

Let's go back to our box analogy: imagine that there are 4 balls in the box, one red and three green. That means that \(p = 1/4\). What are the possible results when we test three times?

Red balls Green balls \(p\) estimate
0 3 0%
1 2 33%
2 1 66%
3 0 100%

The less we test, the smaller our precision is. Roughly, \(p\) precision will be at most \(1/n\) - in this case, 33%. That's the step of values we can find for \(p\), and the minimal value for it.

Testing more improves the precision of our estimate.

\(p\) likelihood

Let's now approach the problem from another angle: if \(p = 1/4\), what are the odds of seeing one error in four tests? Let's name the 4 balls as 0-red, 1-green, 2-green and 3-green:

The table above has all the possible results for getting 4 balls out of the box. That's \(4^4=256\) rows, generated by this python script. The same script counts the number of red balls in each row, and outputs the following table:

k rows %
0 81 31.64%
1 108 42.19%
2 54 21.09%
3 12 4.69%
4 1 0.39%

That means that, for \(p=1/4\), we see 1 red ball and 3 green balls only 42% of the time when getting out 4 balls.

What if \(p = 1/3\) - one red ball and two green balls? We would get the following table:

k rows %
0 16 19.75%
1 32 39.51%
2 24 29.63%
3 8 9.88%
4 1 1.23%

What about \(p = 1/2\)?

k rows %
0 1 6.25%
1 4 25.00%
2 6 37.50%
3 4 25.00%
4 1 6.25%

So, let's assume that you've seen the bug once in 4 trials. What is the value of \(p\)? You know that can happen 42% of the time if \(p=1/4\), but you also know it can happen 39% of the time if \(p=1/3\), and 25% of the time if \(p=1/2\). Which one is it?

The graph bellow shows the discrete likelihood for all \(p\) percentual values for getting 1 red and 3 green balls:

The fact is that, given the data, the estimate for \(p\) follows a beta distribution \(Beta(k+1, n-k+1) = Beta(2, 4)\) (1) The graph below shows the probability distribution density of \(p\):

The R script used to generate the first plot is here, the one used for the second plot is here.

Increasing \(n\), narrowing down the interval

What happens when we test more? We obviously increase our precision, as it is at most \(1/n\), as we said before - there is no way to estimate that \(p=1/3\) when we only test twice. But there is also another effect: the distribution for \(p\) gets taller and narrower around the observed ratio \(k/n\):

Investigation framework

So, which value will we use for \(p\)?

  • The smaller the value of \(p\), the more we have to test to reach a given confidence in the bug solution.
  • We must, then, choose the probability of error that we want to tolerate, and take the smallest value of \(p\) that we can.

    A usual value for the probability of error is 5% (2.5% on each side).
  • That means that we take the value of \(p\) that leaves 2.5% of the area of the density curve out on the left side. Let's call this value \(p_{min}\).
  • That way, if the observed \(k/n\) remains somewhat constant, \(p_{min}\) will raise, converging to the "real" \(p\) value.
  • As \(p_{min}\) raises, the amount of testing we have to do after fixing the bug decreases.

By using this framework we have direct, visual and tangible incentives to test more. We can objectively measure the potential contribution of each test.

In order to calculate \(p_{min}\) with the mentioned properties, we have to solve the following equation:

\[\sum_{k=0}^{k}{n\choose{k}}p_{min} ^k(1-p_{min})^{n-k}=\frac{\alpha}{2} \]

\(alpha\) here is twice the error we want to tolerate: 5% for an error of 2.5%.

That's not a trivial equation to solve for \(p_{min}\). Fortunately, that's the formula for the confidence interval of the binomial distribution, and there are a lot of sites that can calculate it:

Is the bug fixed?

So, you have tested a lot and calculated \(p_{min}\). The next step is fixing the bug.

After fixing the bug, you will want to test again, in order to confirm that the bug is fixed. How much testing is enough testing?

Let's say that \(t\) is the number of times we test the bug after it is fixed. Then, if our fix is not effective and the bug still presents itself with a probability greater than the \(p_{min}\) that we calculated, the probability of not seeing the bug after \(t\) tests is:

\[\alpha = (1-p_{min})^t \]

Here, \(\alpha\) is also the probability of making a type I error, while \(1 - \alpha\) is the statistical significance of our tests.

We now have two options:

  • arbitrarily determining a standard statistical significance and testing enough times to assert it.
  • test as much as we can and report the achieved statistical significance.

Both options are valid. The first one is not always feasible, as the cost of each trial can be high in time and/or other kind of resources.

The standard statistical significance in the industry is 5%, we recommend either that or less.

Formally, this is very similar to a statistical hypothesis testing.

Back to the Bug

Testing 20 times

This file has the results found after running our program 5000 times. We must never throw out data, but let's pretend that we have tested our program only 20 times. The observed \(k/n\) ration and the calculated \(p_{min}\) evolved as shown in the following graph:

After those 20 tests, our \(p_{min}\) is about 12%.

Suppose that we fix the bug and test it again. The following graph shows the statistical significance corresponding to the number of tests we do:

In words: we have to test 24 times after fixing the bug to reach 95% statistical significance, and 35 to reach 99%.

Now, what happens if we test more before fixing the bug?

Testing 5000 times

Let's now use all the results and assume that we tested 5000 times before fixing the bug. The graph bellow shows \(k/n\) and \(p_{min}\):

After those 5000 tests, our \(p_{min}\) is about 23% - much closer to the real \(p\).

The following graph shows the statistical significance corresponding to the number of tests we do after fixing the bug:

We can see in that graph that after about 11 tests we reach 95%, and after about 16 we get to 99%. As we have tested more before fixing the bug, we found a higher \(p_{min}\), and that allowed us to test less after fixing the bug.

Optimal testing

We have seen that we decrease \(t\) as we increase \(n\), as that can potentially increases our lower estimate for \(p\). Of course, that value can decrease as we test, but that means that we "got lucky" in the first trials and we are getting to know the bug better - the estimate is approaching the real value in a non-deterministic way, after all.

But, how much should we test before fixing the bug? Which value is an ideal value for \(n\)?

To define an optimal value for \(n\), we will minimize the sum \(n+t\). This objective gives us the benefit of minimizing the total amount of testing without compromising our guarantees. Minimizing the testing can be fundamental if each test costs significant time and/or resources.

The graph bellow shows us the evolution of the value of \(t\) and \(t+n\) using the data we generated for our bug:

We can see clearly that there are some low values of \(n\) and \(t\) that give us the guarantees we need. Those values are \(n = 15\) and \(t = 24\), which gives us \(t+n = 39\).

While you can use this technique to minimize the total number of tests performed (even more so when testing is expensive), testing more is always a good thing, as it always improves our guarantee, be it in \(n\) by providing us with a better \(p\) or in \(t\) by increasing the statistical significance of the conclusion that the bug is fixed. So, before fixing the bug, test until you see the bug at least once, and then at least the amount specified by this technique - but also test more if you can, there is no upper bound, specially after fixing the bug. You can then report a higher confidence in the solution.

Conclusions

When a programmer finds a bug that behaves in a non-deterministic way, he knows he should test enough to know more about the bug, and then even more after fixing it. In this article we have presented a framework that provides criteria to define numerically how much testing is "enough" and "even more." The same technique also provides a method to objectively measure the guarantee that the amount of testing performed provides, when it is not possible to test "enough."

We have also provided a real example (even though the bug itself is artificial) where the framework is applied.

As usual, the source code of this page (R scripts, etc) can be found and downloaded in https://github.com/lpenz/lpenz.github.io

December 09, 2013 12:00 AM

December 01, 2013

Gregor Gorjanc

Read line by line of a file in R

Are you using R for data manipulation for later use with other programs, i.e., a workflow something like this:
  1. read data sets from a disk,
  2. modify the data, and
  3. write it back to a disk.
All fine, but of data set is really big, then you will soon stumble on memory issues. If data processing is simple and you can read only chunks, say only line by line, then the following might be useful:

## File
file <- "myfile.txt"
 
## Create connection
con <- file(description=file, open="r")
 
## Hopefully you know the number of lines from some other source or
com <- paste("wc -l ", file, " | awk '{ print $1 }'", sep="")
n <- system(command=com, intern=TRUE)
 
## Loop over a file connection
for(i in 1:n) {
tmp <- scan(file=con, nlines=1, quiet=TRUE)
## do something on a line of data
}
Created by Pretty R at inside-R.org

by Gregor Gorjanc (noreply@blogger.com) at December 01, 2013 10:55 PM

August 13, 2013

Gregor Gorjanc

Setup up the inverse of additive relationship matrix in R

Additive genetic covariance between individuals is one of the key concepts in (quantitative) genetics. When doing the prediction of additive genetic values for pedigree members, we need the inverse of the so called numerator relationship matrix (NRM) or simply A. Matrix A has off-diagonal entries equal to numerator of Wright's relationship coefficient and diagonal elements equal to 1 + inbreeding coefficient. I have blogged before about setting up such inverse in R using routine from the ASReml-R program or importing the inverse from the CFC program. However, this is not the only way to "skin this cat" in R. I am aware of the following attempts to provide this feature in R for various things (the list is probably incomplete and I would grateful if you point me to other implementations):
  • pedigree R package has function makeA() and makeAinv() with obvious meanings; there is also calcG() if you have a lot of marker data instead of pedigree information; there are also some other very handy functions calcInbreeding(), orderPed(), trimPed(), etc.
  • pedigreemm R package does not have direct implementation to get A inverse, but has all the needed ingredients, which makes the package even more interesting
  • MCMCglmm R package has function inverseA() which works with pedigree or phlyo objects; there are also handy functions such as prunePed(), rbv()sm2asreml(), etc.
  • kinship and kinship2 R packages have function kinship() to setup kinship matrix, which is equal to the half of A; there are also nice functions for plotting pedigrees etc. (see also here)
  • see also a series of R scripts for relationship matrices 
As I described before, the interesting thing is that setting up inverse of A is easier and cheaper than setting up A and inverting it. This is very important for large applications. This is an old result using the following matrix theory. We can decompose symmetric positive definite matrix as A = LU = LL' (Cholesky decomposition) or as A = LDU = LDL' (Generalized Cholesky decomposition), where L (U) is lower (upper) triangular, and D is diagonal matrix. Note that L and U in previous two equations are not the same thing (L from Cholesky is not equal to L from Generalized Cholesky decomposition)! Sorry for sloppy notation. In order to confuse you even more note that Henderson usually wrote A = TDT'. We can even do A = LSSU, where S diagonal is equal to the square root of D diagonal. This can get us back to A = LU = LL' as LSSU = LSSL' = LSS'L' = LS(LS)' = L'L (be ware of sloppy notation)! The inverse rule says that inv(A) = inv(LDU) = inv(U) inv(D) inv(L) = inv(L)' inv(D) inv(L) = inv(L)' inv(S) inv(S) inv(L). I thank to Martin Maechler for pointing out to the last (obviously) bit to me. In Henderson's notation this would be inv(A) = inv(T)' inv(D) inv(T) = inv(T)' inv(S) inv(S) inv(T) Uf ... The important bit is that with NRM (aka A) inv(L) has nice simple structure - it shows the directed graph of additive genetic values in pedigree, while inv(D) tells us about the precision (inverse variance) of additive genetic values given the additive genetic values of parents and therefore depends on knowledge of parents and their inbreeding (the more they are inbred less variation can we expect in their progeny). Both inv(L) and inv(D) are easier to setup.

Packages MCMCglmm and pedigree give us inv(A) directly (we can also get inv(D) in MCMCglmm), but pedigreemm enables us to play around with the above matrix algebra and graph theory. First we need a small example pedigree. Bellow is an example with 10 members and there is also some inbreeding and some individuals have both, one, or no parents known. It is hard to see inbreeding directly from the table, but we will improve that later (see also here).

ped <- data.frame( id=c(  1,   2,   3,   4,   5,   6,   7,   8,   9,  10),
fid=c( NA, NA, 2, 2, 4, 2, 5, 5, NA, 8),
mid=c( NA, NA, 1, NA, 3, 3, 6, 6, NA, 9))

Now we will create an object of a pedigree class and show the A = U'U stuff:

## install.packages(pkgs="pedigreemm")
libr
ary(package="pedigreemm")
 
ped2 <- with(ped, pedigree(sire=fid, dam=mid, label=id))
 
U <-
relfactor(ped2)
A &lt
;- crossprod(U)
 
round(U, digits=2)
## 10 x 10 sparse Matrix of class "dtCMatrix"
## [1,] 1 . 0.50 . 0.25 0.25 0.25 0.25 . 0.12
## [2,] . 1 0.50 0.50 0.50 0.75 0.62 0.62 . 0.31
## [3,] . . 0.71 . 0.35 0.35 0.35 0.35 . 0.18
## [4,] . . . 0.87 0.43 . 0.22 0.22 . 0.11
## [5,] . . . . 0.71 . 0.35 0.35 . 0.18
## [6,] . . . . . 0.71 0.35 0.35 . 0.18
## [7,] . . . . . . 0.64 . . .
## [8,] . . . . . . . 0.64 . 0.32
## [9,] . . . . . . . . 1 0.50
## [10,] . . . . . . . . . 0.66

## To check
U - chol(A)

round(A, digits=2)
## 10 x 10 sparse Matrix of class "dsCMatrix"
## [1,] 1.00 . 0.50 . 0.25 0.25 0.25 0.25 . 0.12
## [2,] . 1.00 0.50 0.50 0.50 0.75 0.62 0.62 . 0.31
## [3,] 0.50 0.50 1.00 0.25 0.62 0.75 0.69 0.69 . 0.34
## [4,] . 0.50 0.25 1.00 0.62 0.38 0.50 0.50 . 0.25
## [5,] 0.25 0.50 0.62 0.62 1.12 0.56 0.84 0.84 . 0.42
## [6,] 0.25 0.75 0.75 0.38 0.56 1.25 0.91 0.91 . 0.45
## [7,] 0.25 0.62 0.69 0.50 0.84 0.91 1.28 0.88 . 0.44
## [8,] 0.25 0.62 0.69 0.50 0.84 0.91 0.88 1.28 . 0.64
## [9,] . . . . . . . . 1.0 0.50
## [10,] 0.12 0.31 0.34 0.25 0.42 0.45 0.44 0.64 0.5
1.
0
0

N
ote tha
t
pedigreem
m package uses Matrix classes in order to store only what we need to store, e.g., matrix U is triangular (t in "dtCMatrix") and matrix A is symmetric (s in "dsCMatrix"). To show the generalized Cholesky A = LDU (or using Henderson notation A = TDT') we use gchol() from the bdsmatrix R package. Matrix T shows the "flow" of genes in pedigree.

## install.packages(pkgs="bdsmatrix")
library(package="bdsmatrix")
tmp
&lt;- gchol(as.matrix(A))
D &lt;- diag(tmp)
(T <- as(as.matrix(tmp), "dtCMatrix"))
## 10 x 10 sparse Matrix of class "dtCMatrix"
## [1,] 1.000 . . . . . . . . .
## [2,] . 1.0000 . . . . . . . .
## [3,] 0.500 0.5000 1.00 . . . . . . .
## [4,] . 0.5000 . 1.000 . . . . . .
## [5,] 0.250 0.5000 0.50 0.500 1.00 . . . . .
## [6,] 0.250 0.7500 0.50 . . 1.00 . . . .
## [7,] 0.250 0.6250 0.50 0.250 0.50 0.50 1 . . .
## [8,] 0.250 0.6250 0.50 0.250 0.50 0.50 . 1.0 . .
## [9,] . . . . . . . . 1.0 .
## [10,] 0.125 0.3125 0.25 0.125 0.25 0.25 . 0.5 0.5 1

## To chec
k
&
lt;
- T %*% diag(sqrt(D))
L - t(U)
Now the A inverse part (inv(A) = inv(T)' inv(D) inv(T) = inv(T)' inv(S) inv(S) inv(T) using Henderson's notation, note that ). The nice thing is that pedigreemm authors provided functions to get inv(T) and D.

(TInv <- as(ped2, "sparseMatrix"))
## 10 x 10 sparse Matrix of class "dtCMatrix" (unitriangular)
## 1 1.0 . . . . . . . . .
## 2 . 1.0 . . . . . . . .
## 3 -0.5 -0.5 1.0 . . . . . . .
## 4 . -0.5 . 1.0 . . . . . .
## 5 . . -0.5 -0.5 1.0 . . . . .
## 6 . -0.5 -0.5 . . 1.0 . . . .
## 7 . . . . -0.5 -0.5 1 . . .
## 8 . . . . -0.5 -0.5 . 1.0 . .
## 9 . . . . . . . . 1.0 .
## 10 . . . . . . . -0.5 -0.5 1
round(DInv <- Diagonal(x=1/Dmat(ped2)), digits=2)
## 10 x 10 diagonal matrix of class "ddiMatrix"
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 1 . . . . . . . . .
## [2,] . 1 . . . . . . . .
## [3,] . . 2 . . . . . . .
## [4,] . . . 1.33 . . . . . .
## [5,] . . . . 2 . . . . .
## [6,] . . . . . 2 . . . .
## [7,] . . . . . . 2.46 . . .
## [8,] . . . . . . . 2.46 . .
## [9,] . . . . . . . . 1 .
## [10,] . . . . . . . . . 2.33
 
round(t(TInv) %*% DInv %*% TInv, digits=2)
## 10 x 10 sparse Matrix of class "dgCMatrix"
## .
..
ro
und(crossprod(sqrt(DInv) %*% TInv), digits=2)
## 10 x 10 sparse Matrix of class "dsCMatrix"
##  [1,]  1.5  0.50 -1.0  .     .     .     .     .     .     .   
## [2,] 0.5 2.33 -0.5 -0.67 . -1.00 . . . .
## [3,] -1.0 -0.50 3.0 0.50 -1.00 -1.00 . . . .
## [4,] . -0.67 0.5 1.83 -1.00 . . . . .
## [5,] . . -1.0 -1.00 3.23 1.23 -1.23 -1.23 . .
## [6,] . -1.00 -1.0 . 1.23 3.23 -1.23 -1.23 . .
## [7,] . . . . -1.23 -1.23 2.46 . . .
## [8,] . . . . -1.23 -1.23 . 3.04 0.58 -1.16
## [9,] . . . . . . . 0.58 1.58 -1.16
## [10,] . . . . . . . -1.16 -1.16
  2
.3
3

#
# T
o c
heck
so
l
ve
(A
) - crossprod(sqrt(DInv) %*% TInv)

The second method (using crossprod) is preferred as it leads directly to symmetric matrix (dsCMatrix), which stores only upper or lower triangle. And make sure you do not do crossprod(TInv %*% sqrt(DInv)) as it is the wrong order of matrices.

As promised we will display (plot) pedigree by use of conversion functions of matrix objects to graph objects using the following code. Two examples are provided using the graph and igraph packages. The former does a very good job on this example, but otherwise igraph seems to have much nicer support for editing etc.

## source("http://www.bioconductor.org/biocLite.R")
## biocLite(pkgs=c("graph", "Rgraphviz"))
library(package="graph")
library(package="Rgraphviz")
g <- as(t(TInv), "graph")
plot(g)



## install.packages(pkgs="igraph")
li
brary(package="igraph")
i &l
t;- igraph.from.graphNEL(graphNEL=g)
V(
i)$label <- 1:10
plot
(i, layout=layout.kamada.kawai)
## tkplot(i)

by Gregor Gorjanc (noreply@blogger.com) at August 13, 2013 02:28 PM

July 02, 2013

Gregor Gorjanc

Parse arguments of an R script

R can be used also as a scripting tool. We just need to add shebang in the first line of a file (script):

#!/usr/bin/Rscript

and then the R code should follow.

Often we want to pass arguments to such a script, which can be collected in the script by the commandArgs() function. Then we need to parse the arguments and conditional on them do something. I came with a rather general way of parsing these arguments using simply these few lines:

## Collect arguments
args <- commandArgs(TRUE)
 
## Default setting when no arguments passed
if(length(args) < 1) {
args <- c("--help")
}
 
## Help section
if("--help" %in% args) {
cat("
The R Script
 
Arguments:
--arg1=someValue - numeric, blah blah
--arg2=someValue - character, blah blah
--arg3=someValue - logical, blah blah
--help - print this text
 
Example:
./test.R --arg1=1 --arg2="
output.txt" --arg3=TRUE \n\n")
 
q(save="no")
}
 
## Parse arguments (we expect the form --arg=value)
parseArgs <- function(x) strsplit(sub("^--", "", x), "=")
argsDF <- as.data.frame(do.call("rbind", parseArgs(args)))
argsL <- as.list(as.character(argsDF$V2))
names(argsL) <- argsDF$V1
 
## Arg1 default
if(is.null(args$arg1)) {
## do something
}
 
## Arg2 default
if(is.null(args$arg2)) {
## do something
}
 
## Arg3 default
if(is.null(args$arg3)) {
## do something
}
 
## ... your code here ...
Created by Pretty R at inside-R.org

It is some work, but I find it pretty neat and use it for quite a while now. I do wonder what others have come up for this task. I hope I did not miss some very general solution.

by Gregor Gorjanc (noreply@blogger.com) at July 02, 2013 04:55 PM

March 24, 2013

Romain Francois

Moving

Moving_Boxes___Packing_Material.jpg

This blog is moving to blog.r-enthusiasts.com. The new one is powered by wordpress and gets a subdomain of r-enthusiasts.com.

See you there

by romain francois at March 24, 2013 03:52 PM

March 17, 2013

Modern Toolmaking

caretEnsemble Classification example

Here's a quick demo of how to fit a binary classification model with caretEnsemble.  Please note that I haven't spent as much time debugging caretEnsemble for classification models, so there's probably more bugs than my last post.  Also note that multi class models are not yet supported.






Right now, this code fails for me if I try a model like a nnet or an SVM for stacking, so there's clearly bugs to fix.

The greedy model relies 100% on the gbm, which makes sense as the gbm has an AUC of 1 on the training set.  The linear model uses all of the models, and achieves an AUC of .5.  This is a little weird, as the gbm, rf, SVN, and knn all achieve an AUC of close to 1.0 on the training set, and I would have expected the linear model to focus on these predictions. I'm not sure if this is a bug, or a failure of my stacking model.

by Zachary Deane-Mayer (noreply@blogger.com) at March 17, 2013 04:04 AM