Planet R

November 15, 2018

CRANberries

New package UncDecomp with initial version 0.0.4

Package: UncDecomp
Title: Uncertainty Decomposition
Version: 0.0.4
Authors@R: c(person("Seonghyeon", "Kim", email = "shkim93@snu.ac.kr", role = c("aut", "cre")), person("Yongdai", "Kim", role = "aut"), person("Ilsang", "Ohn", role = "aut"))
Author: Seonghyeon Kim [aut, cre], Yongdai Kim [aut], Ilsang Ohn [aut]
Maintainer: Seonghyeon Kim <shkim93@snu.ac.kr>
Description: If a procedure consists of several stages and there are several scenarios that can be selected for each stage, uncertainty of the procedure can be decomposed by stages or scenarios. cum_uncertainty() is used to decompose uncertainty based on the cumulative uncertainty. stage_uncertainty() and scenario_uncertainty() is used to decompose uncertainty based on the second order interaction ANOVA model. In stage_uncertainty() and scenario_uncertainty(), the uncertainty from interaction effect from two stages is distributed equally to each stage.
Depends: R (>= 3.3.2)
License: GPL-2
Encoding: UTF-8
LazyData: true
RoxygenNote: 6.1.0
NeedsCompilation: no
Packaged: 2018-11-06 09:37:21 UTC; ksh
Repository: CRAN
Date/Publication: 2018-11-15 17:40:12 UTC

More information about UncDecomp at CRAN

November 15, 2018 05:03 PM

New package POUMM with initial version 2.1.2

Package: POUMM
Type: Package
Title: The Phylogenetic Ornstein-Uhlenbeck Mixed Model
Version: 2.1.2
Date: 2018-11-15
Authors@R: person("Venelin", "Mitov", email = "vmitov@gmail.com", role = c("aut", "cre", "cph"))
Maintainer: Venelin Mitov <vmitov@gmail.com>
Description: The Phylogenetic Ornstein-Uhlenbeck Mixed Model (POUMM) allows to estimate the phylogenetic heritability of continuous traits, to test hypotheses of neutral evolution versus stabilizing selection, to quantify the strength of stabilizing selection, to estimate measurement error and to make predictions about the evolution of a phenotype and phenotypic variation in a population. The package implements combined maximum likelihood and Bayesian inference of the univariate Phylogenetic Ornstein-Uhlenbeck Mixed Model, fast parallel likelihood calculation, maximum likelihood inference of the genotypic values at the tips, functions for summarizing and plotting traces and posterior samples, functions for simulation of a univariate continuous trait evolution along a phylogenetic tree. For examples on using the package, see the package vignettes.
License: GPL (>= 3.0)
Encoding: UTF-8
LazyData: true
Depends: R (>= 3.1.0), stats, Rcpp, methods
LinkingTo: Rcpp
Imports: ape, data.table(>= 1.10.4), coda, foreach, ggplot2, lamW, adaptMCMC, utils
Suggests: testthat, Rmpfr, mvtnorm, lmtest, knitr, rmarkdown, parallel, doParallel
RoxygenNote: 6.1.1
VignetteBuilder: knitr, rmarkdown
ByteCompile: no
NeedsCompilation: yes
URL: https://venelin.github.io/POUMM/index.html, https://github.com/venelin/POUMM
BugReports: https://github.com/venelin/POUMM/issues
Packaged: 2018-11-15 16:04:47 UTC; vmitov
Author: Venelin Mitov [aut, cre, cph]
Repository: CRAN
Date/Publication: 2018-11-15 17:20:11 UTC

More information about POUMM at CRAN

November 15, 2018 05:03 PM

New package crseEventStudy with initial version 1.0

Package: crseEventStudy
Type: Package
Title: A Robust and Powerful Test of Abnormal Stock Returns in Long-Horizon Event Studies
Version: 1.0
Authors@R: c(person(given = "Siegfried", family = "Köstlmeier", role = c("aut", "cre"), email = "siegfried.koestlmeier@gmail.com"), person(given = "Seppo", family = "Pynnonen", role = c("aut"), email = "sjp@uwasa.fi"))
Maintainer: Siegfried Köstlmeier <siegfried.koestlmeier@gmail.com>
Description: Based on Dutta et al. (2018) <doi:10.1016/j.jempfin.2018.02.004>, this package provides their standardized test for abnormal returns in long-horizon event studies. The methods used improve the major weaknesses of size, power, and robustness of long-run statistical tests described in Kothari/Warner (2007) <doi:10.1016/B978-0-444-53265-7.50015-9>. Abnormal returns are weighted by their statistical precision (i.e., standard deviation), resulting in abnormal standardized returns. This procedure efficiently captures the heteroskedasticity problem. Clustering techniques following Cameron et al. (2011) <10.1198/jbes.2010.07136> are adopted for computing cross-sectional correlation robust standard errors. The statistical tests in this package therefore accounts for potential biases arising from returns' cross-sectional correlation, autocorrelation, and volatility clustering without power loss.
License: BSD_3_clause + file LICENSE
Imports: methods, stats, sandwich
Suggests: testthat
Depends: R (>= 3.5)
Encoding: UTF-8
LazyData: true
NeedsCompilation: no
Packaged: 2018-11-05 15:00:26 UTC; Siegfried
Author: Siegfried Köstlmeier [aut, cre], Seppo Pynnonen [aut]
Repository: CRAN
Date/Publication: 2018-11-15 17:40:08 UTC

More information about crseEventStudy at CRAN

November 15, 2018 05:02 PM

New package OsteoBioR with initial version 0.1.1

Package: OsteoBioR
Version: 0.1.1
Title: Temporal Estimation of Isotopic Values
Description: Estimates the temporal changes of isotopic values of bone and teeth data solely based on the renewal rate of different bones/teeth and given measurements. The package furthermore provides plotting and exporting functionalities.
Authors@R: c( person("Matthaeus", "Deutsch", ,"matthaeus.deutsch@inwt-statistics.de", c("aut")), person("Marcus", "Groß", email = "marcus.gross@inwt-statistics.de", role = c("aut", "cre")), person("Ricardo", "Fernandes", email = "ldv1452@gmail.com", role = c("aut")) )
License: GPL (>= 3)
Encoding: UTF-8
LazyData: true
ByteCompile: true
Depends: R (>= 3.4.0), Rcpp (>= 0.12.0)
Imports: rstan (>= 2.18.1), rstantools (>= 1.5.0), ggplot2 (>= 2.2.1), methods
LinkingTo: StanHeaders (>= 2.18.0), rstan (>= 2.18.1), BH (>= 1.66.0), Rcpp (>= 0.12.0), RcppEigen (>= 0.3.3.3.0)
Suggests: lintr, testthat
SystemRequirements: GNU make
NeedsCompilation: yes
RoxygenNote: 6.1.0
Packaged: 2018-11-06 08:56:08 UTC; mgross
Author: Matthaeus Deutsch [aut], Marcus Groß [aut, cre], Ricardo Fernandes [aut]
Maintainer: Marcus Groß <marcus.gross@inwt-statistics.de>
Repository: CRAN
Date/Publication: 2018-11-15 17:40:15 UTC

More information about OsteoBioR at CRAN

November 15, 2018 05:02 PM

Dirk Eddelbuettel

Rcpp now used by 1500 CRAN packages

1500 Rcpp packages

Right now Rcpp stands at 1500 reverse-dependencies on CRAN. The graph is on the left depicts the growth of Rcpp usage (as measured by Depends, Imports and LinkingTo, but excluding Suggests) over time. What an amazing few days this has been as we also just marked the tenth anniversary and the big one dot oh release.

Rcpp cleared 300 packages in November 2014. It passed 400 packages in June 2015 (when I only tweeted about it), 500 packages in late October 2015, 600 packages in March 2016, 700 packages last July 2016, 800 packages last October 2016, 900 packages early January 2017,
1000 packages in April 2017, and 1250 packages in November 2018. The chart extends to the very beginning via manually compiled data from CRANberries and checked with crandb. The next part uses manually saved entries. The core (and by far largest) part of the data set was generated semi-automatically via a short script appending updates to a small file-based backend. A list of packages using Rcpp is kept on this page.

Also displayed in the graph is the relative proportion of CRAN packages using Rcpp. The four per-cent hurdle was cleared just before useR! 2014 where I showed a similar graph (as two distinct graphs) in my invited talk. We passed five percent in December of 2014, six percent July of 2015, seven percent just before Christmas 2015, eight percent last summer, nine percent mid-December 2016, cracked ten percent in the summer of 2017 and eleven percent this year. We are currently at 11.199 percent or just over one in nine packages. There is more detail in the chart: how CRAN seems to be pushing back more and removing more aggressively (which my CRANberries tracks but not in as much detail as it could), how the growth of Rcpp seems to be slowing somewhat outright and even more so as a proportion of CRAN – just like every growth curve should, eventually. But we leave all that for another time.

1500 Rcpp packages

1500 user packages is pretty mind-boggling. We can use the progression of CRAN itself compiled by Henrik in a series of posts and emails to the main development mailing list. Not that long ago CRAN itself did not have 1500 packages, and here we are at almost 13400 with Rcpp at 11.2% and still growing (albeit slightly more slowly). Amazeballs.

This puts a whole lot of responsibility on us in the Rcpp team as we continue to keep Rcpp as performant and reliable as it has been.

And with that, and as always, a very big Thank You! to all users and contributors of Rcpp for help, suggestions, bug reports, documentation or, of course, code.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

November 15, 2018 11:44 AM

November 14, 2018

Dirk Eddelbuettel

anytime 0.3.3

A new minor clean-up release of the anytime package arrived on CRAN overnight. This is the fourteenth release, and follows the 0.3.2 release a good week ago.

anytime is a very focused package aiming to do just one thing really well: to convert anything in integer, numeric, character, factor, ordered, … format to either POSIXct or Date objects – and to do so without requiring a format string. See the anytime page, or the GitHub README.md for a few examples.

This release really adds the nice new vignette as a vignette—there was a gotcha in the 0.3.2 release—and updates some core documentation in the README.md to correctly show anydata() on input such as 20160101 (which was an improvement made starting with the 0.3.0 release).

Changes in anytime version 0.3.3 (2018-11-13)

  • Vignette build quirkyness on Windows resolved so vignette reinstated.

  • Documentation updated showing correct use of anydate (and not anytime) on input like ‘2016010’ following the 0.3.0 release heuristic change.

  • Set #define for Boost to make compilation more quiet.

Courtesy of CRANberries, there is a comparison to the previous release. More information is on the anytime page.

For questions or comments use the issue tracker off the GitHub repo.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

November 14, 2018 11:03 AM

November 12, 2018

Journal of Statistical Software

Flexible Self-Organizing Maps in kohonen 3.0

Self-organizing maps (SOMs) are popular tools for grouping and visualizing data in many areas of science. This paper describes recent changes in package kohonen, implementing several different forms of SOMs. These changes are primarily focused on making the package more useable for large data sets. Memory consumption has decreased dramatically, amongst others, by replacing the old interface to the underlying compiled code by a new one relying on Rcpp. The batch SOM algorithm for training has been added in both sequential and parallel forms. A final important extension of the package's repertoire is the possibility to define and use data-dependent distance functions, extremely useful in cases where standard distances like the Euclidean distance are not appropriate. Several examples of possible applications are presented.

by Ron Wehrens, Johannes Kruisselbrink at November 12, 2018 12:00 AM

MTPmle: A SAS Macro and Stata Programs for Marginalized Inference in Semi-Continuous Data

We develop a SAS macro and equivalent Stata programs that provide marginalized inference for semi-continuous data using a maximum likelihood approach. These software extensions are based on recently developed methods for marginalized two-part (MTP) models. Both the SAS and Stata extensions can fit simple MTP models for cross-sectional semi-continuous data. In addition, the SAS macro can fit random intercept models for longitudinal or clustered data, whereas the Stata programs can fit MTP models that account for subject level heteroscedasticity and for a complex survey design. Differences and similarities between the two software extensions are highlighted to provide a comparative picture of the available options for estimation, inclusion of random effects, convergence diagnosis, and graphical display. We provide detailed programming syntax, simulated and real data examples to facilitate the implementation of the MTP models for both SAS and Stata software users.

by Delia C. Voronca, Mulugeta Gebregziabher, Valerie Durkalski-Mauldin, Lei Liu, Leonard E. Egede at November 12, 2018 12:00 AM

SSpace: A Toolbox for State Space Modeling

SSpace is a MATLAB toolbox for state space modeling. State space modeling is in itself a powerful and flexible framework for dynamic system modeling, and SSpace is conceived in a way that tries to maximize this flexibility. One of the most salient features is that users implement their models by coding a MATLAB function. In this way, users have complete flexibility when specifying the systems, have absolute control on parameterizations, constraints among parameters, etc. Besides, the toolbox allows for some ways to implement either non-standard models or standard models with non-standard extensions, like heteroskedasticity, time-varying parameters, arbitrary nonlinear relations with inputs, transfer functions without the need of using explicitly the state space form, etc. The toolbox may be used on the basis of scratch state space systems, but is supplied with a number of templates for standard widespread models. A full help system and documentation are provided. The way the toolbox is built allows for extensions in many ways. In order to fuel such extensions and discussions an online forum has been launched.

by Marco A. Villegas, Diego J. Pedregal at November 12, 2018 12:00 AM

Rqc: A Bioconductor Package for Quality Control of High-Throughput Sequencing Data

As sequencing costs drop with the constant improvements in the field, next-generation sequencing becomes one of the most used technologies in biological research. Sequencing technology allows the detailed characterization of events at the molecular level, including gene expression, genomic sequence and structural variants. Such experiments result in billions of sequenced nucleotides and each one of them is associated to a quality score. Several software tools allow the quality assessment of whole experiments. However, users need to switch between software environments to perform all steps of data analysis, adding an extra layer of complexity to the data analysis workflow. We developed Rqc, a Bioconductor package designed to assist the analyst during assessment of high-throughput sequencing data quality. The package uses parallel computing strategies to optimize large data sets processing, regardless of the sequencing platform. We created new data quality visualization strategies by using established analytical procedures. That improves the ability of identifying patterns that may affect downstream procedures, including undesired sources technical variability. The software provides a framework for writing customized reports that integrates seamlessly to the R/Bioconductor environment, including publication-ready images. The package also offers an interactive tool to generate quality reports dynamically. Rqc is implemented in R and it is freely available through the Bioconductor project (https://bioconductor.org/packages/Rqc/) for Windows, Linux and Mac OS X operating systems.

by Wélliton de Souza, Benilton de Sá Carvalho, Iscia Lopes-Cendes at November 12, 2018 12:00 AM

November 10, 2018

Dirk Eddelbuettel

RcppArmadillo 0.9.200.4.0

armadillo image

A new RcppArmadillo release, now at 0.9.200.4.0, based on the new Armadillo release 9.200.4 from earlier this week, is now on CRAN, and should get to Debian very soon.

Armadillo is a powerful and expressive C++ template library for linear algebra aiming towards a good balance between speed and ease of use with a syntax deliberately close to a Matlab. RcppArmadillo integrates this library with the R environment and language–and is widely used by (currently) 532 (or 31 more since just the last release!) other packages on CRAN.

This release once again brings a number of improvements, see below for details.

Changes in RcppArmadillo version 0.9.200.4.0 (2018-11-09)

  • Upgraded to Armadillo release 9.200.4 (Carpe Noctem)

    • faster handling of symmetric positive definite matrices by rcond()

    • faster transpose of matrices with size ≥ 512x512

    • faster handling of compound sparse matrix expressions by accu(), diagmat(), trace()

    • faster handling of sparse matrices by join_rows()

    • expanded sign() to handle scalar arguments

    • expanded operators (*, %, +, ) to handle sparse matrices with differing element types (eg. multiplication of complex matrix by real matrix)

    • expanded conv_to() to allow conversion between sparse matrices with differing element types

    • expanded solve() to optionally allow keeping solutions of systems singular to working precision

    • workaround for gcc and clang bug in C++17 mode

  • Commented-out sparse matrix test consistently failing on the fedora-clang machine CRAN, and only there. No fix without access.

  • The 'Unit test' vignette is no longer included.

Courtesy of CRANberries, there is a diffstat report relative to previous release. More detailed information is on the RcppArmadillo page. Questions, comments etc should go to the rcpp-devel mailing list off the R-Forge page.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

November 10, 2018 08:01 PM

November 08, 2018

Dirk Eddelbuettel

Rcpp 1.0.0: The Tenth Birthday Release

As mentioned here two days ago, the Rcpp package turned ten on Monday—and we used to opportunity to mark the current version as 1.0.0! Thanks to everybody who liked and retweeted our tweet about this. And of course, once more a really big Thank You! to everybody who helped along this journey: Rcpp Core team, contributors, bug reporters, workshop and tutorial attendees and last but not least all those users – we did well. So let’s enjoy and celebrate this moment.

As indicated in Monday’s blog post, we had also planned to upload this version to CRAN, and this 1.0.0 release arrived on CRAN after the customary inspection and is now available. I will build the Debian package in a moment, it will find its way to Ubuntu and of the CRAN-mirrored backport that Michael looks after so well.

While this release is of course marked as 1.0.0 signifying the feature and release stability we have had for some time, it also marks another regular release at the now-common bi-monthly schedule following nineteen releases since July 2016 in the 0.12.* series as well as another five in the preceding 0.11.* series.

Rcpp has become the most popular way of enhancing GNU R with C or C++ code. As of today, 1493 packages on CRAN depend on Rcpp for making analytical code go faster and further, along with another 150 in the (very recent) BioConductor release 3.8. Per the (partial) logs of CRAN downloads, we were reaching more than 900,000 downloads a month of late.

Once again, we have a number of nice pull requests from the usual gang of contributors in there, see below for details.

Changes in Rcpp version 1.0.0 (2018-11-05)

  • Happy tenth birthday to Rcpp, and hello release 1.0 !

  • Changes in Rcpp API:

    • The empty destructor for the Date class was removed to please g++-9 (prerelease) and -Wdeprecated-copy (Dirk).

    • The constructor for NumericMatrix(not_init(n,k)) was corrected (Romain in #904, Dirk in #905, and also Romain in #908 fixing #907).

    • Rcpp::String no longer silently drops embedded NUL bytes in strings but throws new Rcpp exception embedded_nul_in_string. (Kevin in #917 fixing #916).

  • Changes in Rcpp Deployment:

    • The Dockerfile for Continuous Integration sets the required test flag (for release versions) inside the container (Dirk).

    • Correct the R CMD check call to skip vignettes (Dirk).

  • Changes in Rcpp Attributes:

    • A new [[Rcpp::init]] attribute allows function registration for running on package initialization (JJ in #903).

    • Sort the files scanned for attributes in the C locale for stable output across systems (JJ in #912).

  • Changes in Rcpp Documentation:

    • The 'Rcpp Extending' vignette was corrected and refers to EXPOSED rather than EXPORTED (Ralf Stubner in #910).

    • The 'Unit test' vignette is no longer included (Dirk in #914).

Thanks to CRANberries, you can also look at a diff to the previous release. Questions, comments etc should go to the rcpp-devel mailing list off the R-Forge page.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

November 08, 2018 12:47 AM

October 31, 2018

Bioconductor Project Working Papers

Technical Considerations in the Use of the E-value

The E-value is defined as the minimum strength of association on the risk ratio scale that an unmeasured confounder would have to have with both the exposure and the outcome, conditional on the measured covariates, to explain away the observed exposure-outcome association. We have elsewhere proposed that the reporting of E-values for estimates and for the limit of the confidence interval closest to the null become routine whenever causal effects are of interest. A number of questions have arisen about the use of E-value including questions concerning the interpretation of the relevant confounding association parameters, the nature of the transformation from the risk ratio scale to the E-value scale, inference for and using E-values, and the relation to Rosenbaum’s notion of design sensitivity. Here we bring these various questions together and provide responses that we hope will assist in the interpretation of E-values and will further encourage their use.

by Tyler J. VanderWeele et al. at October 31, 2018 01:20 AM

October 28, 2018

Journal of the Royal Statistical Society: Series B

October 19, 2018

Bioconductor Project Working Papers

Analysis of Covariance (ANCOVA) in Randomized Trials: More Precision, Less Conditional Bias, and Valid Confidence Intervals, Without Model Assumptions

Covariate adjustment" in the randomized trial context refers to an estimator of the average treatment effect that adjusts for chance imbalances between study arms in baseline variables (called “covariates"). The baseline variables could include, e.g., age, sex, disease severity, and biomarkers. According to two surveys of clinical trial reports, there is confusion about the statistical properties of covariate adjustment. We focus on the ANCOVA estimator, which involves fitting a linear model for the outcome given the treatment arm and baseline variables, and trials with equal probability of assignment to treatment and control. We prove the following new (to the best of our knowledge) robustness property of ANCOVA to arbitrary model misspecification: Not only is the ANCOVA point estimate consistent (as proved by Yang and Tsiatis (2001)) but so is its standard error. This implies that confidence intervals and hypothesis tests conducted as if the linear model were correct are still valid even when the linear model is arbitrarily misspecified, e.g., when the baseline variables are nonlinearly related to the outcome or there is treatment effect heterogeneity. We also give a simple, robust formula for the variance reduction (equivalently, sample size reduction) from using ANCOVA. By re-analyzing completed randomized trials for mild cognitive impairment, schizophrenia, and depression, we demonstrate how ANCOVA can reduce variance, reduce bias conditional on chance imbalance, and increase power even when by chance there is perfect balance across arms in the baseline variables.

by Bingkai Wang et al. at October 19, 2018 05:47 PM

October 14, 2018

Journal of the Royal Statistical Society: Series C

September 23, 2018

Journal of the Royal Statistical Society: Series A

August 31, 2018

Bioconductor Project Working Papers

Robust Inference for the Stepped Wedge Design

Based on a permutation argument, we derive a closed form expression for an estimate of the treatment effect, along with its standard error, in a stepped wedge design. We show that these estimates are robust to misspecification of both the mean and covariance structure of the underlying data-generating mechanism, thereby providing a robust approach to inference for the treatment effect in stepped wedge designs. We use simulations to evaluate the type I error and power of the proposed estimate and to compare the performance of the proposed estimate to the optimal estimate when the correct model specification is known. The limitations, possible extensions, and open problems regarding the method are discussed.

by James P. Hughes et al. at August 31, 2018 08:37 PM

July 26, 2018

Bioconductor Project Working Papers

Studying the Optimal Scheduling for Controlling Prostate Cancer under Intermittent Androgen Suppression

This retrospective study shows that the majority of patients’ correlations between PSA and Testosterone during the on-treatment period is at least 0.90. Model-based duration calculations to control PSA levels during off-treatment are provided. There are two pairs of models. In one pair, the Generalized Linear Model and Mixed Model are both used to analyze the variability of PSA at the individual patient level by using the variable “Patient ID” as a repeated measure. In the second pair, Patient ID is not used as a repeated measure but additional baseline variables are included to analyze the variability of PSA.

by Sunil K. Dhar et al. at July 26, 2018 03:45 PM

May 05, 2018

R you ready?

Remove password protection from Excel sheets using R

Most data scientists wished that all data lived neatly managed in some DB. However, in reality, Excel files are ubiquitous and often a common way to disseminate results or data within many companies. Every now and then I found myself in the situation where I wanted to protect Excel sheets against users accidentally changing them. A few months later, however, I found that I sometimes had forgotten the password I used. The “good” thing is that protecting Excel sheets by password is far from safe and access can be recovered quite easily. The following works for .xlsx files only (tested with Excel 2016 files), not the older .xls files.

Before implementing the steps in R, I will outline how to remove the password protection “by hand”. The R way is simply the automation of these steps. The first thing one needs to understand is that a .xlsx file is just a collection of folders and files in a zip container. If you unzip a .xlsx file (e.g. using 7-Zip) you get the following folder structure (sorry, German UI):

In the folder ./xl/worksheets we find one XML file for each Excel sheet. The sheet’s password protection is encoded directly in the sheet. While there used to be the plain password text in the XML in former versions, now, we find the hashed password (see part marked in yellow below). In order to get rid of the password protection, we simply can remove the whole sheetProtection node from the XML. We can do that in any text editor and save the file.

As the last step, we need to recreate the .xlsx file by creating a zip folder that contains our modified XML file (German UI again).

Finally, we change the file extension from .zip back to .xlsx and voilá, we get an Excel file without password protected sheets. Programming the steps outlined above in R is quite straightforward. The steps are commented in the GitHub gist below.

by markheckmann at May 05, 2018 11:07 PM

April 18, 2018

RCpp Gallery

Performance considerations with sparse matrices in Armadillo

Introduction

Besides outstanding support for dense matrices, the Armadillo library also provides a great way to manipulate sparse matrices in C++. However, the performance characteristics of dealing with sparse matrices may be surprising if one is only familiar with dense matrices. This is a collection of observations on getting best performance with sparse matrices in Armadillo.

All the timings in this article were generated using Armadillo version 8.500. This version adds a number of substantial optimisations for sparse matrix operations, in some cases speeding things up by as much as two orders of magnitude.

General considerations: sparsity, row vs column access

Perhaps the most important thing to note is that the efficiency of sparse algorithms can depend strongly on the level of sparsity in the data. If your matrices and vectors are very sparse (most elements equal to zero), you will often see better performance even if the nominal sizes of those matrices remain the same. This isn’t specific to C++ or Armadillo; it applies to sparse algorithms in general, including the code used in the Matrix package for R. By contrast, algorithms for working with dense matrices usually aren’t affected by sparsity.

Similarly, the pattern of accesssing elements, whether by rows or by columns, can have a significant impact on performance. This is due to caching, which modern CPUs use to speed up memory access: accessing elements that are already in the cache is much faster than retrieving them from main memory. If one iterates or loops over the elements of a matrix in Armadillo, one should try to iterate over columns first, then rows, to maximise the benefits of caching. This applies to both dense and sparse matrices. (Technically, at least for dense matrices, whether to iterate over rows or columns first depends on how the data is stored in memory. Both R and Armadillo store matrices in column-major order, meaning that elements in the same column are contiguous in memory. Sparse matrices are more complex but the advice to iterate by columns is basically the same; see below.)

Matrix multiplication

We start with a simple concrete example: multiplying two matrices together. In R, this can be done using the %*% operator which (via the Matrix package) is able to handle any combination of sparse and dense inputs. However, let us assume we want to do the multiplication in Armadillo: for example if the inputs are from other C++ functions, or if we want more precise control of the output.

In Armadillo, the * operator multiplies two matrices together, and this also works for any combination of sparse and dense inputs. However, the speed of the operation can vary tremendously, depending on which of those inputs is sparse. To see this, let us define a few simple functions:

// [[Rcpp::depends(RcppArmadillo)]]
#include <RcppArmadillo.h>

// [[Rcpp::export]]
arma::sp_mat mult_sp_sp_to_sp(const arma::sp_mat& a, const arma::sp_mat& b) {
    // sparse x sparse -> sparse
    arma::sp_mat result(a * b);
    return result;
}

// [[Rcpp::export]]
arma::sp_mat mult_sp_den_to_sp(const arma::sp_mat& a, const arma::mat& b) {
    // sparse x dense -> sparse
    arma::sp_mat result(a * b);
    return result;
}

// [[Rcpp::export]]
arma::sp_mat mult_den_sp_to_sp(const arma::mat& a, const arma::sp_mat& b) {
    // dense x sparse -> sparse
    arma::sp_mat result(a * b);
    return result;
}

The outputs of these functions are all the same, but they take different types of inputs: either two sparse matrices, or a sparse and a dense matrix, or a dense and a sparse matrix (the order matters). Let us call them on some randomly generated data:

library(Matrix)
set.seed(98765)
n <- 5e3
# 5000 x 5000 matrices, 99% sparse
a <- rsparsematrix(n, n, 0.01, rand.x=function(n) rpois(n, 1) + 1)
b <- rsparsematrix(n, n, 0.01, rand.x=function(n) rpois(n, 1) + 1)

a_den <- as.matrix(a)
b_den <- as.matrix(b)

system.time(m0 <- a %*% b)
   user  system elapsed 
  0.230   0.091   0.322 
system.time(m1 <- mult_sp_sp_to_sp(a, b))
   user  system elapsed 
  0.407   0.036   0.442 
system.time(m2 <- mult_sp_den_to_sp(a, b_den))
   user  system elapsed 
  1.081   0.100   1.181 
system.time(m3 <- mult_den_sp_to_sp(a_den, b))
   user  system elapsed 
  0.826   0.087   0.913 
all(identical(m0, m1), identical(m0, m2), identical(m0, m3))
[1] TRUE

While all the times are within an order of magnitude of each other, multiplying a dense and a sparse matrix takes about twice as long as multiplying two sparse matrices together, and multiplying a sparse and dense matrix takes about three times as long. (The sparse-dense multiply is actually one area where Armadillo 8.500 makes massive gains over previous versions. This operation used to take much longer due to using an unoptimised multiplication algorithm.)

Let us see if we can help the performance of the mixed-type functions by creating a temporary sparse copy of the dense input. This forces Armadillo to use the sparse-sparse version of matrix multiply, which as seen above is much more efficient. For example, here is the result of tweaking the dense-sparse multiply. Creating the sparse copy does take some extra time and memory, but not enough to affect the result.

// [[Rcpp::depends(RcppArmadillo)]]
#include <RcppArmadillo.h>

// [[Rcpp::export]]
arma::sp_mat mult_sp_den_to_sp2(const arma::sp_mat& a, const arma::mat& b) {
    // sparse x dense -> sparse
    // copy dense to sparse, then multiply
    arma::sp_mat temp(b);
    arma::sp_mat result(a * temp);
    return result;
}
system.time(m4 <- mult_sp_den_to_sp2(a, b_den))
   user  system elapsed 
  0.401   0.047   0.448 
identical(m0, m4)
[1] TRUE

This shows that mixing sparse and dense inputs can have significant effects on the efficiency of your code. To avoid unexpected slowdowns, consider sticking to either sparse or dense classes for all your objects. If one decides to mix them, it is worth remembering to test and profile the code.

Row vs column access

Consider another simple computation: multiply the elements of a matrix by their row number, then sum them up. (The multiply by row number is to make it not completely trivial.) That is, we want to obtain:

Armadillo lets us extract individual rows and columns from a matrix, using the .row() and .col() member functions. We can use row extraction to do this computation:

// [[Rcpp::depends(RcppArmadillo)]]
#include <RcppArmadillo.h>

// [[Rcpp::export]]
arma::sp_mat row_slice(const arma::sp_mat& x, const int n) {
    return x.row(n - 1);
}
system.time({
    result <- sapply(1:nrow(a),
        function(i) i * sum(row_slice(a, i)))
    print(sum(result))
})
[1] 1248361320
   user  system elapsed 
  1.708   0.000   1.707 

For a large matrix, this takes a not-insignificant amount of time, even on a fast machine. To speed it up, we will make use of the fact that Armadillo uses the compressed sparse column (CSC) format for storing sparse matrices. This means that the data for a matrix is stored as three vectors: the nonzero elements; the row indices of these elements (ordered by column); and a vector of offsets for the first row index in each column. Since the vector of row indices is ordered by column, and we have the starting offsets for each column, it turns out that extracting a column slice is actually very fast. We only need to find the offset for that column, and then pull out the elements and row indices up to the next column offset. On the other hand, extracting a row is much more work; we have to search through the indices to find those matching the desired row.

We can put this knowledge to use on our problem. Row slicing a matrix is the same as transposing it and then slicing by columns, so let us define a new function to return a column slice. (Transposing the matrix takes only a tiny fraction of a second.)

// [[Rcpp::depends(RcppArmadillo)]]
#include <RcppArmadillo.h>

// [[Rcpp::export]]
arma::sp_mat col_slice(const arma::sp_mat& x, const int n) {
    return x.col(n - 1);
}
a_t <- t(a)
system.time({
    result <- sapply(1:nrow(a_t),
        function(i) i * sum(col_slice(a_t, i)))
    print(sum(result))
})
[1] 1248361320
   user  system elapsed 
  0.766   0.000   0.766 

The time taken has come down by quite a substantial margin. This reflects the ease of obtaining column slices for sparse matrices.

The R–C++ interface

We can take the previous example further. Each time R calls a C++ function that takes a sparse matrix as input, it makes a copy of the data. Similarly, when the C++ function returns, its sparse outputs are copied into R objects. When the function itself is very simple—as it is here—all this copying and memory shuffling can be a significant proportion of the time taken.

Rather than calling sapply in R to iterate over rows, let us do the same entirely in C++:

// [[Rcpp::depends(RcppArmadillo)]]
#include <RcppArmadillo.h>

// [[Rcpp::export]]
double sum_by_row(const arma::sp_mat& x) {
    double result = 0;
    for (size_t i = 0; i < x.n_rows; i++) {
        arma::sp_mat row(x.row(i));
        for (arma::sp_mat::iterator j = row.begin(); j != row.end(); ++j) {
            result += *j * (i + 1);
        }
    }
    return result;
}
system.time(print(sum_by_row(a)))
[1] 1248361320
   user  system elapsed 
  0.933   0.000   0.935 

This is again a large improvement. But what if we do the same with column slicing?

// [[Rcpp::depends(RcppArmadillo)]]
#include <RcppArmadillo.h>

// [[Rcpp::export]]
double sum_by_col(const arma::sp_mat& x) {
    double result = 0;
    for (size_t i = 0; i < x.n_cols; i++) {
        arma::sp_mat col(x.col(i));
        for (arma::sp_mat::iterator j = col.begin(); j != col.end(); ++j) {
            result += *j * (i + 1);
        }
    }
    return result;
}
system.time(print(sum_by_col(a_t)))
[1] 1248361320
   user  system elapsed 
  0.005   0.000   0.006 

Now the time is less than a tenth of a second, which is faster than the original code by roughly three orders of magnitude. The moral to the story is: rather than constantly switching between C++ and R, you should try to stay in one environment for as long as possible. If your code involves a loop with a C++ call inside (including functional maps like lapply and friends), consider writing the loop entirely in C++ and combine the results into a single object to return to R.

(It should be noted that this interface tax is less onerous for built-in Rcpp classes such as NumericVector or NumericMatrix, which do not require making copies of the data. Sparse data types are different, and in particular Armadillo’s sparse classes do not provide constructors that can directly use auxiliary memory.)

Iterators vs elementwise access

Rather than taking explicit slices of the data, let us try using good old-fashioned loops over the matrix elements. This is easily coded up in Armadillo, and it turns out to be quite efficient, relatively speaking. It is not as fast as using column slicing, but much better than row slicing.

// [[Rcpp::depends(RcppArmadillo)]]
#include <RcppArmadillo.h>

// [[Rcpp::export]]
double sum_by_element(const arma::sp_mat& x) {
    double result = 0;
    // loop over columns, then rows: see comments at the start of this article
    for (size_t j = 0; j < x.n_cols; j++) {
        for (size_t i = 0; i < x.n_rows; i++) {
            result += x(i, j) * (i + 1);
        }
    }
    return result;
}
system.time(print(sum_by_element(a)))
[1] 1248361320
   user  system elapsed 
  0.176   0.000   0.176 

However, we can still do better. In Armadillo, the iterators for sparse matrix classes iterate only over the nonzero elements. Let us compare this to our naive double loop:

// [[Rcpp::depends(RcppArmadillo)]]
#include <RcppArmadillo.h>

// [[Rcpp::export]]
double sum_by_iterator(const arma::sp_mat& x) {
    double result = 0;
    for (arma::sp_mat::const_iterator i = x.begin(); i != x.end(); ++i) {
        result += *i * (i.row() + 1);
    }
    return result;
}
system.time(print(sum_by_iterator(a)))
[1] 1248361320
   user  system elapsed 
  0.002   0.000   0.002 

This is the best time achieved so far, to the extent that system.time might have difficulty capturing it. The timings are now so low that we should use the microbenchmark package to get accurate measurements:

library(microbenchmark)
microbenchmark(col=sum_by_col(a_t), 
               elem=sum_by_element(a), 
               iter=sum_by_iterator(a),
               times=20)
Unit: milliseconds
 expr       min        lq      mean    median        uq       max neval
  col   4.78921   4.88444   5.05229   4.99184   5.18450   5.50579    20
 elem 172.84830 177.20431 179.87007 179.06447 182.08075 188.11256    20
 iter   1.02268   1.05447   1.12611   1.12627   1.16482   1.30800    20

Thus, using iterators represents a greater than three-order-of-magnitude speedup over the original row-slicing code.

April 18, 2018 12:00 AM

March 27, 2018

RCpp Gallery

Suppressing Call Stack Info in Rcpp-Generated Errors and Warnings

Introduction

Rcpp has an elegant mechanism of exception handling whereby C++ exceptions are automatically translated to errors in R. For most projects, the Rcpp::stop wrapper (in conjunction with the BEGIN_RCPP and END_RCPP macros automatically inserted by RcppAttributes) is sufficient and easy to use, providing an Rcpp equivalent of base::stop.

By default, it captures the call stack and attaches it to the exception in R, giving informative error messages:

#include "Rcpp.h"
using namespace Rcpp; 

//[[Rcpp::export]]         
NumericVector add1(NumericVector x, NumericVector y){
    if(x.size() != y.size()){
        stop("x and y are not the same length!");
    }
    return x + y; 
}
add1(1:5, 1:3)
Error in add1(1:5, 1:3): x and y are not the same length!

This matches the default behavior of base::stop() which captures the call info.

For complex calling patterns (e.g., creating an argument list and calling the Rcpp function with do.call), the resulting error messages are less helpful:

#include "Rcpp.h"
using namespace Rcpp; 

// [[Rcpp::export]]
NumericVector internal_function_name(NumericVector x, NumericVector y){
    if(x.size() != y.size()){
        stop("x and y are not the same length!");
    }
    return x + y; 
}
add2 <- function(x, y){
    if(!is.numeric(x)){
        x <- as.numeric(x)
    }

    do.call(internal_function_name, list(x, y))
}

add2(1:5, 1:3)
Error in (function (x, y) : x and y are not the same length!

If the internal error were being generated in R code, we might choose to use the call.=FALSE argument to base::stop to suppress the unhelpful (function (x, y) part of the error message, but we don’t (immediately) have a corresponding option in Rcpp. In this gallery post, we show how to suppress the call-stack capture of Rcpp::stop to give cleaner error messages.

Error Messages

The key functionality was added to Rcpp by Jim Hester in Rcpp Pull Request #663. To generate an R-level exception without a call stack, we pass an optional false flag to Rcpp::exception. For example,

#include "Rcpp.h"
using namespace Rcpp; 

// [[Rcpp::export]]
NumericVector internal_function_name2(NumericVector x, NumericVector y){
    if(x.size() != y.size()){
        throw Rcpp::exception("x and y are not the same length!", false);
    }
    return x + y; 
}
add3 <- function(x, y){
    if(!is.numeric(x)){
        x <- as.numeric(x)
    }

    do.call(internal_function_name2, list(x, y))
}

add3(1:5, 1:3)
Error: x and y are not the same length!

This can’t capture the R level call stack, but it is at least cleaner than the error message from the previous example.

Note that here, as elsewhere in C++, we need to handle exceptions using a try/catch structure, but we do not add it explicitly because RcppAttributes automatically handles this for us.

Warnings

Similar to Rcpp::stop, Rcpp also provides a warning function to generate R level warnings. It has the same call-stack capture behavior as stop.

For the direct call case:

#include "Rcpp.h"
using namespace Rcpp; 

//[[Rcpp::export]]         
NumericVector add4(NumericVector x, NumericVector y){
    if(x.size() != y.size()){
        warning("x and y are not the same length!");
    }
    return x + y; 
}
add4(1:5, 1:3)
Warning in add4(1:5, 1:3): x and y are not the same length!
[1] 2 4 6 4 5

For the indirect call case:

#include "Rcpp.h"
using namespace Rcpp; 

// [[Rcpp::export]]
NumericVector internal_function_name3(NumericVector x, NumericVector y){
    if(x.size() != y.size()){
        warning("x and y are not the same length!");
    }
    return x + y; 
}
add5 <- function(x, y){
    if(!is.numeric(x)){
        x <- as.numeric(x)
    }

    do.call(internal_function_name3, list(x, y))
}

add5(1:5, 1:3)
Warning in (function (x, y) : x and y are not the same length!
[1] 2 4 6 4 5

If we want to suppress the call stack info in this warning, we have to drop down to the C-level R API. In particular, we use the Rf_warningcall function, which takes the call as the first argument. By passing a NULL, we suppress the call:

#include "Rcpp.h"
using namespace Rcpp; 

// [[Rcpp::export]]
NumericVector internal_function_name5(NumericVector x, NumericVector y){
    if(x.size() != y.size()){
        Rf_warningcall(R_NilValue, "x and y are not the same length!");
    }
    return x + y; 
}
add6 <- function(x, y){
    if(!is.numeric(x)){
        x <- as.numeric(x)
    }

    do.call(internal_function_name5, list(x, y))
}

add6(1:5, 1)
Warning: x and y are not the same length!
[1] 2 2 3 4 5

A C++11 Implementation

The above methods work, but they are not as clean as their Rcpp::stop and Rcpp::warning counterparts. We can take advantage of C++11 to provide similar functionality for our call-free versions.

Basing our implementation on the C++11 implementation of Rcpp::stop and Rcpp::warning we can define our own stopNoCall and warningNoCall

#include "Rcpp.h"
using namespace Rcpp; 

// [[Rcpp::plugins(cpp11)]]
template <typename... Args>
inline void warningNoCall(const char* fmt, Args&&... args ) {
    Rf_warningcall(R_NilValue, tfm::format(fmt, std::forward<Args>(args)... ).c_str());
}

template <typename... Args>
inline void NORET stopNoCall(const char* fmt, Args&&... args) {
    throw Rcpp::exception(tfm::format(fmt, std::forward<Args>(args)... ).c_str(), false);
}

// [[Rcpp::export]]
NumericVector internal_function_name6(NumericVector x, NumericVector y, bool warn){
    if(x.size() != y.size()){
        if(warn){
            warningNoCall("x and y are not the same length!");  
        } else {
            stopNoCall("x and y are not the same length!");
        }
        
    }
    return x + y; 
}
add7 <- function(x, y, warn=TRUE){
    if(!is.numeric(x)){
        x <- as.numeric(x)
    }

    do.call(internal_function_name6, list(x, y, warn))
}
add7(1:5, 1:3, warn=TRUE)
Warning: x and y are not the same length!
[1] 2 4 6 4 5
add7(1:5, 1:3, warn=FALSE)
Error: x and y are not the same length!

Note that we used C++11 variadic templates here – if we wanted to do something similar in C++98, we could use essentially the same pattern, but would need to implement each case individually.

March 27, 2018 12:00 AM

March 07, 2018

RCpp Gallery

Introducing RcppArrayFire

Introduction

The RcppArrayFire package provides an interface from R to and from the ArrayFire library, an open source library that can make use of GPUs and other hardware accelerators via CUDA or OpenCL.

The official R bindings expose ArrayFire data structures as S4 objects in R, which would require a large amount of code to support all the methods defined in ArrayFire’s C/C++ API. RcppArrayFire instead, which is derived from RcppFire by Kazuki Fukui, follows the lead of packages like RcppArmadillo or RcppEigen to provide seamless communication between R and ArrayFire at the C++ level.

Installation

Please note that RcppArrayFire is developed and tested on Linux systems. There is preliminary support for Mac OS X.

In order to use RcppArrayFire you will need the ArrayFire library and header files. While ArrayFire has been packaged for Debian, I currently prefer using upstream’s binary installer or building from source.

RcppArrayFire is not on CRAN, but you can install the current version via drat:

#install.packages("drat")
drat::addRepo("daqana")
install.packages("RcppArrayFire")

If you have installed ArrayFire in a non-standard directory, you have to use the configure argument --with-arrayfire, e.g.:

install.packages("RcppArrayFire", configure.args = "--with-arrayfire=/opt/arrayfire-3")

A first example

Let’s look at the classical example of calculating via simulation. The basic idea is to generate a large number of random points within the unit square. An approximation for can then be calculated from the ratio of points within the unit circle to the total number of points. A vectorized implementation in R might look like this:

piR <- function(N) {
    x <- runif(N)
    y <- runif(N)
    4 * sum(sqrt(x^2 + y^2) < 1.0) / N
}

set.seed(42)
system.time(cat("pi ~= ", piR(10^6), "\n"))
pi ~=  3.13999 
   user  system elapsed 
  0.102   0.009   0.111 

A simple way to use C++ code in R is to use the inline package or cppFunction() from Rcpp, which are both possible with RcppArrayFire. An implementation in C++ using ArrayFire might look like this:

src <- '
double piAF (const int N) {
    array x = randu(N, f32);
    array y = randu(N, f32);
    return 4.0 * sum<float>(sqrt(x*x + y*y) < 1.0) / N;
}'
Rcpp::cppFunction(code = src, depends = "RcppArrayFire", includes = "using namespace af;")

RcppArrayFire::arrayfire_set_seed(42)
cat("pi ~= ", piAF(10^6), "\n") # also used for warm-up 
pi ~=  3.14279 
system.time(piAF(10^6))
   user  system elapsed 
  0.000   0.001   0.000 

Several things are worth noting:

  1. The syntax is almost identical. Besides the need for using types and a different function name when generating random numbers, the argument f32 to randu as well as the float type catches the eye. These instruct ArrayFire to use single precision floats, since not all devices support double precision floating point numbers. If you want and can to use double precision, you have to specify f64 and double.

  2. The results are not the same since ArrayFire uses a different random number generator.

  3. The speed-up can be quite impressive. However, the first invocation of a function is often not as fast as expected due to the just-in-time compilation used by ArrayFire. This can be circumvented by using a warm-up call with (normally) fewer computations.

Arrays as parameters

Up to now we have only considered simple types like double or int as function parameters and return values. However, we can also use arrays. Consider the case of an European put option that was recently handled with R, Rcpp and RcppArmadillo. The Armadillo based function from this post reads:

#include <RcppArmadillo.h>

// [[Rcpp::depends(RcppArmadillo)]]
// [[Rcpp::plugins(cpp11)]]

using arma::colvec;
using arma::log;
using arma::normcdf;
using std::sqrt;
using std::exp;


// [[Rcpp::export]]
colvec put_option_pricer_arma(colvec s, double k, double r, double y, double t, double sigma) {
  
  colvec d1 = (log(s / k) + (r - y + sigma * sigma / 2.0) * t) / (sigma * sqrt(t));
  colvec d2 = d1 - sigma * sqrt(t);
  
  // Notice the use of % to represent element wise multiplication
  colvec V = normcdf(-d2) * k * exp(-r * t) - s * exp(-y * t) % normcdf(-d1); 
  
  return V;
}

This function can be applied to a range of spot prices:

put_option_pricer_arma(s = 55:60, 60, .01, .02, 1, .05)
        [,1]
[1,] 5.52021
[2,] 4.58142
[3,] 3.68485
[4,] 2.85517
[5,] 2.11883
[6,] 1.49793

Porting this code to RcppArrayFire is straight forward:

#include <RcppArrayFire.h>

// [[Rcpp::depends(RcppArrayFire)]]

using af::array;
using af::log;
using af::erfc;
using std::sqrt;
using std::exp;

array normcdf(array x) {
  return erfc(-x / sqrt(2.0)) / 2.0;
}

// [[Rcpp::export]]
array put_option_pricer_af(RcppArrayFire::typed_array<f32> s, double k, double r,
                           double y, double t, double sigma) {
  
  array d1 = (log(s / k) + (r - y + sigma * sigma / 2.0) * t) / (sigma * sqrt(t));
  array d2 = d1 - sigma * sqrt(t);
  
  return normcdf(-d2) * k * exp(-r * t) - s * exp(-y * t) * normcdf(-d1); 
}

Compared with the implementations in R, Rcpp and RcppArmadillo the syntax is again almost the same. One exception is that ArrayFire does not contain a function for the cumulative normal distribution function. However, the closely related error function is available.

Since an object of type af::array can contain different data types, the templated wrapper class RcppArrayFire::typed_array<> is used to indicate the desired data type when converting from R to C++. Again single precision floats are used with ArrayFire, which leads to differences of the order compared to the results from R, Rcpp and RcppArmadillo:

put_option_pricer_af(s = 55:60, 60, .01, .02, 1, .05)
[1] 5.52021 4.58143 3.68485 2.85516 2.11883 1.49793

Performance

The reason to use hardware accelerators is of course the quest for increased performance. How does ArrayFire fare in this respect? Using the same benchmark as in the R, Rcpp and RcppArmadillo comparison:

s <- matrix(seq(0, 100, by = .0001), ncol = 1)
rbenchmark::benchmark(Arma = put_option_pricer_arma(s, 60, .01, .02, 1, .05),
                      AF = put_option_pricer_af(s, 60, .01, .02, 1, .05),
                      order = "relative", 
                      replications = 100)[,1:4]
  test replications elapsed relative
2   AF          100   0.471    1.000
1 Arma          100   5.923   12.575

Here a Nvidia GeForce GT 1030 is used together with ArrayFire’s CUDA backend. With a build-in Intel HD Graphics 520 using the OpenCL backend the ArrayFire solution is about 6 times faster. Even without a high performance GPU the performance boost from using ArrayFire can be quite impressive. However, the results change dramatically, if fewer options are evaluated:

s <- matrix(seq(0, 100, by = 1), ncol = 1)
# use more replications to get run times of more than 10 ms
rbenchmark::benchmark(Arma = put_option_pricer_arma(s, 60, .01, .02, 1, .05),
                      AF = put_option_pricer_af(s, 60, .01, .02, 1, .05),
                      order = "relative", 
                      replications = 1000)[,1:4]
  test replications elapsed relative
1 Arma         1000   0.008    1.000
2   AF         1000   0.123   15.375

But is it realistic to process options at once? Probably not in the way used in the benchmark where only the spot price is allowed to vary. However, one can alter the function to process not only arrays of spot prices but also arrays of strikes, risk free rates etc.:

#include <RcppArrayFire.h>

// [[Rcpp::depends(RcppArrayFire)]]

using af::array;
using af::log;
using af::erfc;
using af::sqrt; // arrayfire function instead of standard function
using af::exp;  // arrayfire function instead of standard function

array normcdf(array x) {
  return erfc(-x / sqrt(2.0)) / 2.0;
}

// [[Rcpp::export]]
array put_option_pricer_af(RcppArrayFire::typed_array<f32> s,
                           RcppArrayFire::typed_array<f32> k, 
                           RcppArrayFire::typed_array<f32> r, 
                           RcppArrayFire::typed_array<f32> y, 
                           RcppArrayFire::typed_array<f32> t, 
                           RcppArrayFire::typed_array<f32> sigma) {
  
  array d1 = (log(s / k) + (r - y + sigma * sigma / 2.0) * t) / (sigma * sqrt(t));
  array d2 = d1 - sigma * sqrt(t);
  
  return normcdf(-d2) * k * exp(-r * t) - s * exp(-y * t) * normcdf(-d1); 
}

Note that ArrayFire does not recycle elements if arrays with non-matching dimensions are combined. In this particular case this means that all arrays must have the same length. One can ensure that by using a data frame for the values:

set.seed(42)
# 1000 * 21 * 3 * 3 * 3 * 3 = 1,701,000 different options
options <- expand.grid(
  s = rnorm(1000, mean = 60, sd = 20),
  k = 50:70,
  r = c(0.01, 0.005, 0.02),
  y = c(0.02, 0.01, 0.04),
  t = c(1, 0.5, 2),
  sigma = c(0.05, 0.025, 0.1)
)
head(within(options,
            p <- put_option_pricer_af(s, k, r, y, t, sigma)))
        s  k    r    y t sigma           p
1 87.4192 50 0.01 0.02 1  0.05 7.44673e-29
2 48.7060 50 0.01 0.02 1  0.05 2.09401e+00
3 67.2626 50 0.01 0.02 1  0.05 2.34556e-09
4 72.6573 50 0.01 0.02 1  0.05 6.84457e-14
5 68.0854 50 0.01 0.02 1  0.05 5.26653e-10
6 57.8775 50 0.01 0.02 1  0.05 2.57750e-03

Conclusion

The ArrayFire library provides a convenient way to use hardware accelerators without the need to write low-level OpenCL or CUDA code. The C++ syntax is actually quite similar to properly vectorized R code. The RcppArrayFire package makes this available to useRs. However, one still has to be careful: using hardware accelerators is not a “silver bullet” due to the inherent memory transfer overhead.

March 07, 2018 12:00 AM

February 28, 2018

RCpp Gallery

Using RcppArmadillo to price European Put Options

Introduction

In the quest for ever faster code, one generally begins exploring ways to integrate C++ with R using Rcpp. This post provides an example of multiple implementations of a European Put Option pricer. The implementations are done in pure R, pure Rcpp using some Rcpp sugar functions, and then in Rcpp using RcppArmadillo, which exposes the incredibly powerful linear algebra library, Armadillo.

Under the Black-Scholes model The value of a European put option has the closed form solution:

where

and

Armed with the formulas, we can create the pricer using just R.

put_option_pricer <- function(s, k, r, y, t, sigma) {

    d1 <- (log(s / k) + (r - y + sigma^2 / 2) * t) / (sigma * sqrt(t))
    d2 <- d1 - sigma * sqrt(t)

    V <- pnorm(-d2) * k * exp(-r * t) - s * exp(-y * t) * pnorm(-d1)

    V
}

# Valuation with 1 stock price
put_option_pricer(s = 55, 60, .01, .02, 1, .05)
[1] 5.52021
# Valuation across multiple prices
put_option_pricer(s = 55:60, 60, .01, .02, 1, .05)
[1] 5.52021 4.58142 3.68485 2.85517 2.11883 1.49793

Let’s see what we can do with Rcpp. Besides explicitely stating the types of the variables, not much has to change. We can even use the sugar function, Rcpp::pnorm(), to keep the syntax as close to R as possible. Note how we are being explicit about the symbols we import from the Rcpp namespace: the basic vector type, and well the (vectorized) ‘Rcpp Sugar’ calls log() and pnorm() calls. Similarly, we use sqrt() and exp() for the calls on an atomic double variables from the C++ Standard Library. With these declarations the code itself is essentially identical to the R code (apart of course from requiring both static types and trailing semicolons).

#include <Rcpp.h>
                                        
using Rcpp::NumericVector;
using Rcpp::log;
using Rcpp::pnorm;
using std::sqrt;
using std::log;

// [[Rcpp::export]]
NumericVector put_option_pricer_rcpp(NumericVector s, double k, double r, double y, double t, double sigma) {

    NumericVector d1 = (log(s / k) + (r - y + sigma * sigma / 2.0) * t) / (sigma * sqrt(t));
    NumericVector d2 = d1 - sigma * sqrt(t);
    
    NumericVector V = pnorm(-d2) * k * exp(-r * t) - s * exp(-y * t) * pnorm(-d1);
    return V;
}

We can call this from R as well:

# Valuation with 1 stock price
put_option_pricer_rcpp(s = 55, 60, .01, .02, 1, .05)
[1] 5.52021
# Valuation across multiple prices
put_option_pricer_rcpp(s = 55:60, 60, .01, .02, 1, .05)
[1] 5.52021 4.58142 3.68485 2.85517 2.11883 1.49793

Finally, let’s look at RcppArmadillo. Armadillo has a number of object types, including mat, colvec, and rowvec. Here, we just use colvec to represent a column vector of prices. By default in Armadillo, * represents matrix multiplication, and % is used for element wise multiplication. We need to make this change to element wise multiplication in 1 place, but otherwise the changes are just switching out the types and the sugar functions for Armadillo specific functions.

Note that the arma::normcdf() function is in the upcoming release of RcppArmadillo, which is 0.8.400.0.0 at the time of writing and still in CRAN’s incoming. It also requires the C++11 plugin.

#include <RcppArmadillo.h>

// [[Rcpp::depends(RcppArmadillo)]]
// [[Rcpp::plugins(cpp11)]]

using arma::colvec;
using arma::log;
using arma::normcdf;
using std::sqrt;
using std::log;


// [[Rcpp::export]]
colvec put_option_pricer_arma(colvec s, double k, double r, double y, double t, double sigma) {
  
    colvec d1 = (log(s / k) + (r - y + sigma * sigma / 2.0) * t) / (sigma * sqrt(t));
    colvec d2 = d1 - sigma * sqrt(t);
    
    // Notice the use of % to represent element wise multiplication
    colvec V = normcdf(-d2) * k * exp(-r * t) - s * exp(-y * t) % normcdf(-d1); 

    return V;
}

Use from R:

# Valuation with 1 stock price
put_option_pricer_arma(s = 55, 60, .01, .02, 1, .05)
        [,1]
[1,] 5.52021
# Valuation across multiple prices
put_option_pricer_arma(s = 55:60, 60, .01, .02, 1, .05)
        [,1]
[1,] 5.52021
[2,] 4.58142
[3,] 3.68485
[4,] 2.85517
[5,] 2.11883
[6,] 1.49793

Finally, we can run a speed test to see which comes out on top.

s <- matrix(seq(0, 100, by = .0001), ncol = 1)

rbenchmark::benchmark(R = put_option_pricer(s, 60, .01, .02, 1, .05),
                      Arma = put_option_pricer_arma(s, 60, .01, .02, 1, .05),
                      Rcpp = put_option_pricer_rcpp(s, 60, .01, .02, 1, .05), 
                      order = "relative", 
                      replications = 100)[,1:4]
  test replications elapsed relative
2 Arma          100   6.409    1.000
3 Rcpp          100   7.917    1.235
1    R          100   9.091    1.418

Interestingly, Armadillo comes out on top here on this (multi-core) machine (as Armadillo uses OpenMP where available in newer versions). But the difference is slender, and there is certainly variation in repeated runs. And the nicest thing about all of this is that it shows off the “embarassment of riches” that we have in the R and C++ ecosystem for multiple ways of solving the same problem.

February 28, 2018 12:00 AM

November 07, 2017

Alstatr

Julia: Installation and Editors

If you have been following this blog, you may have noticed that I don't have any update for more than a year now. The reason is that I've been busy with my research, my work, and I promised not to share anything here until I finished my degree (Master of Science in Statistics). Anyways, at this point I think it's time to share with you what I've learned in the past year. So far, it's been a good year for Statistics especially in the Philippines, in fact, last November 15, 2016, the team of local data scientists made a huge step in Big data by organizing the first ever conference on this topic. Also months before that, the 13th National Convention on Statistics organized by the Philippine Statistics Authority, invited a keynote speaker from Paris21 to tackle Big data and its use in the government.

So without further ado, in this post, I would like to share a new programming language which I've used for several months now, and it's called Julia. This programming language is by far my favorite, it's a well-thought-out language as many would say, for many reasons. The first of course is the speed, second is the grammar, and many more. I can't list them down here, but I suggest you visit the official website, and try it for yourself.

Installation

The installation of this program is straightforward, simply go to the Julia's official download page and download the binaries for your operating system. Alternatively, you can install Julia by downloading the JuliaPro from the Julia Computing products. This will setup everything you need, which include the Github Atom Editor out of the box. After installation, the first time you load the command-line-version program, you'll have the following window:

Working with the command-line-version is actually fun, and personally I think Julia has the best command-line-version compared to R and Python in terms of features. For example, you can shift to shell mode by simply pressing ; in the Julia prompt, and using ? to activate the help mode. It also has autocompletion by pressing Tab after entering first few letters of the syntax, the LaTeX UTF autocompletion is also one of the best features, and almost any symbols/characters can be used as variables, like emoticon as shown below:

Editor

While Julia's command-line-version is loaded with good features, working with huge project needs a better front-end editors. Like RStudio for R, PyCharm for Python, Julia can run on Jupyter (also available for R and Python), Github Atom Editor, and Microsoft Visual Studio Code.

Julia in Jupyter Notebook

Julia in Github Atom Editor

Julia in Microsoft Visual Studio Code

To install the Jupyter notebook, simply run the following codes:
In the screenshot above, I tweaked the theme of the notebook using the script from this repository. As mentioned, to setup Julia in Github Atom Editor, I recommend downloading the JuliaPro or you can follow the instruction in the Juno Lab website. After installation, you can add Atom Extensions like Minimap, which is not available by default, and in case you are interested, the syntax highlighter I used in the screenshot is the Gruvbox Plus.

Further, to setup Julia in Microsoft Visual Studio Code, open the program, press Ctrl+P, paste ext install language-julia and hit Enter. This will install the Julia extension for Visual Studio Code. After installation, you can load the Julia REPL by pressing the following keys Ctrl+Shift+P (Windows) or Cmd+Shift+P (Mac) and enter julia start repl, and press Enter. If there is an error, the path may need to be specified properly. To do this, go to Preferences > Settings. Then in the .json user file settings, enter the following:
Of course, you need to check the path properly by replacing the Julia-0.6.0-rc3 (Windows) or Julia-0.6.app (Mac) with the desired version of your Julia, and the C:/Users/MyName with your desired path. Further, I use the following setting in my .json file to adjust my Minimap similar to the screenshot above.
Lastly, to toggle the cursor's focus between the script pane and the integrated Julia terminal using Ctrl+`, I use the following Keybindings (go to Preferences > Keyboard Shortcuts > keybindings.json).
For more on this topic visit the official github page. The three editors above have advantages and disadvantages. However, my primary editor is the Visual Studio Code, because it is fast and loaded with features as well. The major limitation of this editor is the LaTeX UTF autocompletion, which is available for Atom Editor. But there are third party packages like Unicode LaTeX, that can do the job indirectly, or alternatively you can generate the LaTeX UTF using the console (the integrated Julia terminal in the Visual Studio Code), but I think this is not a big deal, and may be in the near future, this capability will be added. On the other hand, the Atom Editor has of course more features for Julia, for example the plot pane, the workspace, and many more. The only problem is that, it's kind of slow especially when working with several datasets in your workspace, plus plots, plus very long lines of codes, scrolling through it is not smooth. Nevertheless, let's be positive and hope that more improvements are coming to these editors. Finally, for those who want to start using Julia, visit the Official Documentation and Learning Materials; ask questions on Julia Discourse and join the Julia Gitter.

by Al Asaad (noreply@blogger.com) at November 07, 2017 04:35 AM

April 15, 2017

Alstatr

R and Python: Gradient Descent

One of the problems often dealt in Statistics is minimization of the objective function. And contrary to the linear models, there is no analytical solution for models that are nonlinear on the parameters such as logistic regression, neural networks, and nonlinear regression models (like Michaelis-Menten model). In this situation, we have to use mathematical programming or optimization. And one popular optimization algorithm is the gradient descent, which we're going to illustrate here. To start with, let's consider a simple function with closed-form solution given by \begin{equation} f(\beta) \triangleq \beta^4 - 3\beta^3 + 2. \end{equation} We want to minimize this function with respect to $\beta$. The quick solution to this, as what calculus taught us, is to compute for the first derivative of the function, that is \begin{equation} \frac{\text{d}f(\beta)}{\text{d}\beta}=4\beta^3-9\beta^2. \end{equation} Setting this to 0 to obtain the stationary point gives us \begin{align} \frac{\text{d}f(\beta)}{\text{d}\beta}&\overset{\text{set}}{=}0\nonumber\\ 4\hat{\beta}^3-9\hat{\beta}^2&=0\nonumber\\ 4\hat{\beta}^3&=9\hat{\beta}^2\nonumber\\ 4\hat{\beta}&=9\nonumber\\ \hat{\beta}&=\frac{9}{4}. \end{align}
The following plot shows the minimum of the function at $\hat{\beta}=\frac{9}{4}$ (red line in the plot below).

R ScriptNow let's consider minimizing this problem using gradient descent with the following algorithm:
  1. Initialize $\mathbf{x}_{r},r=0$
  2. while $\lVert \mathbf{x}_{r}-\mathbf{x}_{r+1}\rVert > \nu$
  3.         $\mathbf{x}_{r+1}\leftarrow \mathbf{x}_{r} - \gamma\nabla f(\mathbf{x}_r)$
  4.         $r\leftarrow r + 1$
  5. end while
  6. return $\mathbf{x}_{r}$ and $r$
where $\nabla f(\mathbf{x}_r)$ is the gradient of the cost function, $\gamma$ is the learning-rate parameter of the algorithm, and $\nu$ is the precision parameter. For the function above, let the initial guess be $\hat{\beta}_0=4$ and $\gamma=.001$ with $\nu=.00001$. Then $\nabla f(\hat{\beta}_0)=112$, so that \[\hat{\beta}_1=\hat{\beta}_0-.001(112)=3.888.\] And $|\hat{\beta}_1 - \hat{\beta}_0| = 0.112> \nu$. Repeat the process until at some $r$, $|\hat{\beta}_{r}-\hat{\beta}_{r+1}| \ngtr \nu$. It will turn out that 350 iterations are needed to satisfy the desired inequality, the plot of which is in the following figure with estimated minimum $\hat{\beta}_{350}=2.250483\approx\frac{9}{4}$.

R Script with PlotPython ScriptObviously the convergence is slow, and we can adjust this by tuning the learning-rate parameter, for example if we try to increase it into $\gamma=.01$ (change gamma to .01 in the codes above) the algorithm will converge at 42nd iteration. To support that claim, see the steps of its gradient in the plot below.

If we try to change the starting value from 4 to .1 (change beta_new to .1) with $\gamma=.01$, the algorithm converges at 173rd iteration with estimate $\hat{\beta}_{173}=2.249962\approx\frac{9}{4}$ (see the plot below).

Now let's consider another function known as Rosenbrock defined as \begin{equation} f(\mathbf{w})\triangleq(1 - w_1) ^ 2 + 100 (w_2 - w_1^2)^2. \end{equation} The gradient is \begin{align} \nabla f(\mathbf{w})&=[-2(1 - w_1) - 400(w_2 - w_1^2) w_1]\mathbf{i}+200(w_2-w_1^2)\mathbf{j}\nonumber\\ &=\left[\begin{array}{c} -2(1 - w_1) - 400(w_2 - w_1^2) w_1\\ 200(w_2-w_1^2) \end{array}\right]. \end{align} Let the initial guess be $\hat{\mathbf{w}}_0=\left[\begin{array}{c}-1.8\\-.8\end{array}\right]$, $\gamma=.0002$, and $\nu=.00001$. Then $\nabla f(\hat{\mathbf{w}}_0)=\left[\begin{array}{c} -2914.4\\-808.0\end{array}\right]$. So that \begin{equation}\nonumber \hat{\mathbf{w}}_1=\hat{\mathbf{w}}_0-\gamma\nabla f(\hat{\mathbf{w}}_0)=\left[\begin{array}{c} -1.21712 \\-0.63840\end{array}\right]. \end{equation} And $\lVert\hat{\mathbf{w}}_0-\hat{\mathbf{w}}_1\rVert=0.6048666>\nu$. Repeat the process until at some $r$, $\lVert\hat{\mathbf{w}}_r-\hat{\mathbf{w}}_{r+1}\rVert\ngtr \nu$. It will turn out that 23,374 iterations are needed for the desired inequality with estimate $\hat{\mathbf{w}}_{23375}=\left[\begin{array}{c} 0.9464841 \\0.8956111\end{array}\right]$, the contour plot is depicted in the figure below.
R Script with Contour PlotPython ScriptNotice that I did not use ggplot for the contour plot, this is because the plot needs to be updated 23,374 times just to accommodate for the arrows for the trajectory of the gradient vectors, and ggplot is just slow. Finally, we can also visualize the gradient points on the surface as shown in the following figure.
R ScriptIn my future blog post, I hope to apply this algorithm on statistical models like linear/nonlinear regression models for simple illustration.

by Al Asaad (noreply@blogger.com) at April 15, 2017 12:31 PM

September 29, 2016

Statistical Modelling

Time-dependent ROC methodology to evaluate the predictive accuracy of semiparametric multi-state models in the presence of competing risks: An application to peritoneal dialysis programme

Abstract:

The evaluation of peritoneal dialysis (PD) programmes requires the use of statistical methods that suit the complexity of such programmes. Multi-state regression models taking competing risks into account are a good example of suitable approaches. In this work, multi-state structured additive regression (STAR) models combined with penalized splines (P-splines) are proposed to evaluate peritoneal dialysis programmes. These models are very flexible since they may consider smooth estimates of baseline transition intensities and the inclusion of time-varying and smooth covariate effects at each transition. A key issue in survival analysis is the quantification of the time-dependent predictive accuracy of a given regression model, which is typically assessed using receiver operating characteristic (ROC)’based methodologies. The main objective of the present study is to adapt the concept of time-dependent ROC curve, and their corresponding area under the curve (AUC), to a multi-state competing risks framework. All statistical methodologies discussed in this work were applied to PD survival data. Using a multi-state competing risks framework, this study explored the effects of major clinical covariates on survival such as age, sex, diabetes and previous renal replacement therapy. Such multi-state model was composed of one transient state (peritonitis) and several absorbing states (death, transfer to haemodialysis and renal transplantation). The application of STAR models combined with time-dependent ROC curves revealed important conclusions not previously reported in the nephrology literature when using standard statistical methodologies. For practical application, all the statistical methods proposed in this article were implemented in R and we wrote and made available a script named as NestedCompRisks.

by Teixeira, L., Cadarso-Suarez, C., Rodrigues, A., Mendonca, D. at September 29, 2016 05:16 AM

A multivariate single-index model for longitudinal data

Abstract:

Index measures are commonly used in medical research and clinical practice, primarily for quantification of health risks in individual subjects or patients. The utility of an index measure is ultimately contingent on its ability to predict health outcomes. Construction of medical indices has largely been based on heuristic arguments, although the acceptance of a new index typically requires objective validation, preferably with multiple outcomes. In this article, we propose an analytical tool for index development and validation. We use a multivariate single-index model to ascertain the best functional form for risk index construction. Methodologically, the proposed model represents a multivariate extension of the traditional single-index models. Such an extension is important because it assures that the resultant index simultaneously works for multiple outcomes. The model is developed in the general framework of longitudinal data analysis. We use penalized cubic splines to characterize the index components while leaving the other subject characteristics as additive components. The splines are estimated directly by penalizing nonlinear least squares, and we show that the model can be implemented using existing software. To illustrate, we examine the formation of an adiposity index for prediction of systolic and diastolic blood pressure in children. We assess the performance of the method through a simulation study.

by Wu, J., Tu, W. at September 29, 2016 05:16 AM

Semi-parametric frailty model for clustered interval-censored data

Abstract:

The shared frailty model is a popular tool to analyze correlated right-censored time-to-event data. In the shared frailty model, the latent frailty is assumed to be shared by the members of a cluster and is assigned a parametric distribution, typically a gamma distribution due to its conjugacy. In the case of interval-censored time-to-event data, the inclusion of frailties results in complicated intractable likelihoods. Here, we propose a flexible frailty model for analyzing such data by assuming a smooth semi-parametric form for the conditional time-to-event distribution and a parametric or a flexible form for the frailty distribution. The results of a simulation study suggest that the estimation of regression parameters is robust to misspecification of the frailty distribution (even when the frailty distribution is multimodal or skewed). Given sufficiently large sample sizes and number of clusters, the flexible approach produces smooth and accurate posterior estimates for the baseline survival function and for the frailty density, and it can correctly detect and identify unusual frailty density forms. The methodology is illustrated using dental data from the Signal Tandmobiel® study.

by Yavuz, A. C., Lambert, P. at September 29, 2016 05:16 AM

Bayesian dynamic modelling to assess differential treatment effects on panic attack frequencies

Abstract:

To represent the complex structure of intensive longitudinal data of multiple individuals, we propose a hierarchical Bayesian Dynamic Model (BDM). This BDM is a generalized linear hierarchical model where the individual parameters do not necessarily follow a normal distribution. The model parameters can be estimated on the basis of relatively small sample sizes and in the presence of missing time points. We present the BDM and discuss the model identification, convergence and selection. The use of the BDM is illustrated using data from a randomized clinical trial to study the differential effects of three treatments for panic disorder. The data involves the number of panic attacks experienced weekly (73 individuals, 10–52 time points) during treatment. Presuming that the counts are Poisson distributed, the BDM considered involves a linear trend model with an exponential link function. The final model included a moving average parameter and an external variable (duration of symptoms pre-treatment). Our results show that cognitive behavioural therapy is less effective in reducing panic attacks than serotonin selective re-uptake inhibitors or a combination of both. Post hoc analyses revealed that males show a slightly higher number of panic attacks at the onset of treatment than females.

by Krone, T., Albers, C., Timmerman, M. at September 29, 2016 05:16 AM

July 18, 2016

R you ready?

Populating data frame cells with more than one value

Data frames are lists

Most R users will know that data frames are lists. You can easily verify that a data frame is a list by typing

d <- data.frame(id=1:2, name=c("Jon", "Mark"))
d
 id name
1 1 Jon
2 2 Mark
is.list(d)
[1] TRUE

However, data frames are lists with some special properties. For example, all entries in the list must have the same length (here 2), etc. You can find a nice description of the differences between lists and data frames here. To access the first column of d, we find that it contains a vector (and a factor in case of column name). Note, that [[ ]] is an operator to select a list element. As data frames are lists, they will work here as well.

is.vector(d[[1]])
[1] TRUE

Data frame columns can contain lists

A long time, I was unaware of the fact, that data frames may also contain lists as columns instead of vectors. For example, let’s assume Jon’s children are Mary and James, and Mark’s children are called Greta and Sally. Their names are stored in a list with two elements. We can add them to the data frame like this:

d$children <-  list(c("Mary", "James"), c("Greta", "Sally"))
d
 id name children
1 1 Jon Mary, James
2 2 Mark Greta, Sally

A single data frame entry in column children now contains more than one value. Given that the column is a list, not a vector, we cannot go as usual when modifying an entry of the column. For example, to change Jon’s children, we cannot do

> d[1 , "children"] <- c("Mary", "James", "Thomas")

Error in `[<-.data.frame`(`*tmp*`, 1, "children", value = c("Mary", "James", :
replacement has 3 rows, data has 1

Taking into account the list structure of the column, we can type the following to change the values in a single cell.

d[1 , "children"][[1]] <- list(c("Mary", "James", "Thomas"))

# or also

d$children[1] <- list(c("Mary", "James", "Thomas"))
d
 id name children
1 1 Jon Mary, James, Thomas
2 2 Mark Greta, Sally

You can also create a data frame having a list as a column using the <tt>data.frame</tt> function, but with a little tweak. The list column has to be wrapped inside the function <tt>I</tt>. This will protect it from several conversions taking place in <tt>data.frame</tt> (see <tt>?I</tt> documentation).

d <- data.frame(id = 1:2,
                   name = c("Jon", "Mark"),
                   children = I(list(c("Mary", "James"),
                                     c("Greta", "Sally")))
                )

This is an interesting feature, which gives me a deeper understanding of what a data frame is. But when exactly would I want to use it? I have not encountered the need to use it very often yet (though of course there may be plenty of situations where it makes sense). But today I had a case where this feature seemed particularly useful.

Converting lists and data frames to JSON

I had two separate types of information. One stored in a data frame and the other one in a list Referring to the example above, I had

d <- data.frame(id=1:2, name=c("Jon", "Mark"))
d
 id name
1 1 Jon
2 2 Mark

and

ch <- list(c("Mary", "James"), c("Greta", "Sally"))
ch
[[1]]
[1] "Mary" "James"

[[2]]
[1] "Greta" "Sally"

I needed to return an array of JSON objects which look like this.

[
  {
    "id": 1,
    "name": "Jon",
    "children": ["Mary", "James"]
  }, 
  {
    "id": 2,
    "name": "Mark",
    "children": ["Greta", "Sally"]
  }
]

Working with the superb jsonlite package to convert R to JSON, I could do the following to get the result above.

library(jsonlite)

l <- split(d, seq(nrow(d))) # convert data frame rows to list
l <- unname(l)              # remove list names
for (i in seq_along(l))     # add element from ch to list
    l[[i]] <- c(l[[i]], children=ch[i])

toJSON(l, pretty=T, auto_unbox = T) # convert to JSON

The results are correct, but getting there involved quite a number of tedious steps. These can be avoided by directly placing the list into a column of the data frame. Then jsonlite::toJSON takes care of the rest.

d$children <- list(c("Mary", "James"), c("Greta", "Sally"))
toJSON(d, pretty=T, auto_unbox = T)
[
  {
    "id": 1,
    "name": "Jon",
    "children": ["Mary", "James"]
  },
  {
    "id": 2,
    "name": "Mark",
    "children": ["Greta", "Sally"]
  }
]

Nice :) What we do here, is basically creating the same nested list structure as above, only now it is disguised as a data frame. However, this approach is much more convenient.

by markheckmann at July 18, 2016 05:01 PM

December 27, 2015

Alstatr

R: Principal Component Analysis on Imaging

Ever wonder what's the mathematics behind face recognition on most gadgets like digital camera and smartphones? Well for most part it has something to do with statistics. One statistical tool that is capable of doing such feature is the Principal Component Analysis (PCA). In this post, however, we will not do (sorry to disappoint you) face recognition as we reserve this for future post while I'm still doing research on it. Instead, we go through its basic concept and use it for data reduction on spectral bands of the image using R.

Let's view it mathematically

Consider a line $L$ in a parametric form described as a set of all vectors $k\cdot\mathbf{u}+\mathbf{v}$ parameterized by $k\in \mathbb{R}$, where $\mathbf{v}$ is a vector orthogonal to a normalized vector $\mathbf{u}$. Below is the graphical equivalent of the statement:
So if given a point $\mathbf{x}=[x_1,x_2]^T$, the orthogonal projection of this point on the line $L$ is given by $(\mathbf{u}^T\mathbf{x})\mathbf{u}+\mathbf{v}$. Graphically, we mean

$Proj$ is the projection of the point $\mathbf{x}$ on the line, where the position of it is defined by the scalar $\mathbf{u}^{T}\mathbf{x}$. Therefore, if we consider $\mathbf{X}=[X_1, X_2]^T$ be a random vector, then the random variable $Y=\mathbf{u}^T\mathbf{X}$ describes the variability of the data on the direction of the normalized vector $\mathbf{u}$. So that $Y$ is a linear combination of $X_i, i=1,2$. The principal component analysis identifies a linear combinations of the original variables $\mathbf{X}$ that contain most of the information, in the sense of variability, contained in the data. The general assumption is that useful information is proportional to the variability. PCA is used for data dimensionality reduction and for interpretation of data. (Ref 1. Bajorski, 2012)

To better understand this, consider two dimensional data set, below is the plot of it along with two lines ($L_1$ and $L_2$) that are orthogonal to each other:
If we project the points orthogonally to both lines we have,

So that if normalized vector $\mathbf{u}_1$ defines the direction of $L_1$, then the variability of the points on $L_1$ is described by the random variable $Y_1=\mathbf{u}_1^T\mathbf{X}$. Also if $\mathbf{u}_2$ is a normalized vector that defines the direction of $L_2$, then the variability of the points on this line is described by the random variable $Y_2=\mathbf{u}_2^T\mathbf{X}$. The first principal component is one with maximum variability. So in this case, we can see that $Y_2$ is more variable than $Y_1$, since the points projected on $L_2$ are more dispersed than in $L_1$. In practice, however, the linear combinations $Y_i = \mathbf{u}_i^T\mathbf{X}, i=1,2,\cdots,p$ is maximized sequentially so that $Y_1$ is the linear combination of the first principal component, $Y_2$ is the linear combination of the second principal component, and so on. Further, the estimate of the direction vector $\mathbf{u}$ is simply the normalized eigenvector $\mathbf{e}$ of the variance-covariance matrix $\mathbf{\Sigma}$ of the original variable $\mathbf{X}$. And the variability explained by the principal component is the corresponding eigenvalue $\lambda$. For more details on theory of PCA refer to (Bajorski, 2012) at Reference 1 below.

As promised we will do dimensionality reduction using PCA. We will use the Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) data from (Barjorski, 2012), you can use other locations of AVIRIS data that can be downloaded here. However, since for most cases the AVIRIS data contains thousands of bands so for simplicity we will stick with the data given in (Bajorski, 2012) as it was cleaned reducing to 152 bands only.

What is spectral bands?

In imaging, spectral bands refer to the third dimension of the image usually denoted as $\lambda$. For example, RGB image contains red, green and blue bands as shown below along with the first two dimensions $x$ and $y$ that define the resolution of the image.

These are few of the bands that are visible to our eyes, there are other bands that are not visible to us like infrared, and many other in electromagnetic spectrum. That is why in most cases AVIRIS data contains huge number of bands each captures different characteristics of the image. Below is the proper description of the data.

Data

The Airborne Visible/Infrared Imaging Spectrometer (AVIRIS), is a sensor collecting spectral radiance in the range of wavelengths from 400 to 2500 nm. It has been flown on various aircraft platforms, and many images of the Earth’s surface are available. A 100 by 100 pixel AVIRIS image of an urban area in Rochester, NY, near the Lake Ontario shoreline is shown below. The scene has a wide range of natural and man-made material including a mixture of commercial/warehouse and residential neighborhoods, which adds a wide range of spectral diversity. Prior to processing, invalid bands (due to atmospheric water absorption) were removed, reducing the overall dimensionality to 152 bands. This image has been used in Bajorski et al. (2004) and Bajorski (2011a, 2011b). The first 152 values in the AVIRIS Data represent the spectral radiance values (a spectral curve) for the top left pixel. This is followed by spectral curves of the pixels in the first row, followed by the next row, and so on. (Ref. 1 Bajorski, 2012)

To load the data, run the following code:

Above code uses EBImage package, and can be installed from my previous post.

Why do we need to reduce the dimension of the data?

Before we jump in to our analysis, in case you may ask why? Well sometimes it's just difficult to do analysis on high dimensional data, especially on interpreting it. This is because there are dimensions that aren't significant (like redundancy) which adds to our problem on the analysis. So in order to deal with this, we remove those nuisance dimension and deal with the significant one.

To perform PCA in R, we use the function princomp as seen below:

The structure of princomp consist of a list shown above, we will give description to selected outputs. Others can be found in the documentation of the function by executing ?princomp.
  • sdev - standard deviation, the square root of the eigenvalues $\lambda$ of the variance-covariance matrix $\mathbf{\Sigma}$ of the data, dat.mat;
  • loadings - eigenvectors $\mathbf{e}$ of the variance-covariance matrix $\mathbf{\Sigma}$ of the data, dat.mat;
  • scores - the principal component scores.
Recall that the objective of PCA is to find for a linear combination $Y=\mathbf{u}^T\mathbf{X}$ that will maximize the variance $Var(Y)$. So that from the output, the estimate of the components of $\mathbf{u}$ is the entries of the loadings which is a matrix of eigenvectors, where the columns corresponds to the eigenvectors of the sequence of principal components, that is if the first principal component is given by $Y_1=\mathbf{u}_1^T\mathbf{X}$, then the estimate of $\mathbf{u}_1$ which is $\mathbf{e}_1$ (eigenvector) is the set of coefficients obtained from the first column of the loadings. The explained variability of the first principal component is the square of the first standard deviation sdev, the explained variability of the second principal component is the square of the second standard deviation sdev, and so on. Now let's interpret the loadings (coefficients) of the first three principal components. Below is the plot of this,
Base above, the coefficients of the first principal component (PC1) are almost all negative. A closer look, the variability in this principal component is mainly explained by the weighted average of radiance of the spectral bands 35 to 100. Analogously, PC2 mainly represents the variability of the weighted average of radiance of spectral bands 1 to 34. And further, the fluctuation of the coefficients of PC3 makes it difficult to tell on which bands greatly contribute on its variability. Aside from examining the loadings, another way to see the impact of the PCs is through the impact plot where the impact curve $\sqrt{\lambda_j}\mathbf{e}_j$ are plotted, I want you to explore that.

Moving on, let's investigate the percent of variability in $X_i$ explained by the $j$th principal component, below is the formula of this, \begin{equation}\nonumber \frac{\lambda_j\cdot e_{ij}^2}{s_{ii}}, \end{equation} where $s_{ii}$ is the estimated variance of $X_i$. So that below is the percent of explained variability in $X_i$ of the first three principal components including the cumulative percent variability (sum of PC1, PC2, and PC3),
For the variability of the first 33 bands, PC2 takes on about 90 percent of the explained variability as seen in the above plot. And still have great contribution further to 102 to 152 bands. On the other hand, from bands 37 to 100, PC1 explains almost all the variability with PC2 and PC3 explain 0 to 1 percent only. The sum of the percentage of explained variability of these principal components is indicated as orange line in the above plot, which is the cumulative percent variability.

To wrap up this section, here is the percentage of the explained variability of the first 10 PCs.

PC1PC2PC3PC4PC5PC6PC7PC8PC9PC10
Table 1: Variability Explained by the First Ten Principal Components for the AVIRIS data.
82.05717.1760.3200.1820.0940.0650.0370.0290.0140.005

Above variability were obtained by noting that the variability explained by the principal component is simply the eigenvalue (square of the sdev) of the variance-covariance matrix $\mathbf{\Sigma}$ of the original variable $\mathbf{X}$, hence the percentage of variability explained by the $j$th PC is equal to its corresponding eigenvalue $\lambda_j$ divided by the overall variability which is the sum of the eigenvalues, $\sum_{j=1}^{p}\lambda_j$, as we see in the following code,

Stopping Rules

Given the list of percentage of variability explained by the PCs in Table 1, how many principal components should we take into account that would best represent the variability of the original data? To answer that, we introduce the following stopping rules that will guide us on deciding the number of PCs:
  1. Scree plot;
  2. Simple fare-share;
  3. Broken-stick; and,
  4. Relative broken-stick.
The scree plot is the plot of the variability of the PCs, that is the plot of the eigenvalues. Where we look for an elbow or sudden drop of the eigenvalues on the plot, hence for our example we have
Therefore, we need return the first two principal components based on the elbow shape. However, if the eigenvalues differ by order of magnitude, it is recommended to use the logarithmic scale which is illustrated below,
Unfortunately, sometimes it won't work as we can see here, it's just difficult to determine where the elbow is. The succeeding discussions on the last three stopping rules are based on (Bajorski, 2012). The simple fair-share stopping rule identifies the largest $k$ such that $\lambda_k$ is larger than its fair share, that is larger than $(\lambda_1+\lambda_2+\cdots+\lambda_p)/p$. To illustrate this, consider the following:

Thus, we need to stop at second principal component.

If one was concerned that the above method produces too many principal components, a broken-stick rule could be used. The rule is that it identifies the principal components with largest $k$ such that $\lambda_j/(\lambda_1+\lambda_2+\cdots +\lambda_p)>a_j$, for all $j\leq k$, where \begin{equation}\nonumber a_j = \frac{1}{p}\sum_{i=j}^{p}\frac{1}{i},\quad j =1,\cdots, p. \end{equation} Let's try it,

Above result coincides with the first two stopping rule. The draw back of simple fair-share and broken-stick rules is that it do not work well when the eigenvalues differ by orders of magnitude. In such case, we then use the relative broken-stick rule, where we analyze $\lambda_j$ as the first eigenvalue in the set $\lambda_j\geq \lambda_{j+1}\geq\cdots\geq\lambda_{p}$, where $j < p$. The dimensionality $k$ is chosen as the largest value such that $\lambda_j/(\lambda_j+\cdots +\lambda_p)>b_j$, for all $j\leq k$, where \begin{equation}\nonumber b_j = \frac{1}{p-j+1}\sum_{i=1}^{p-j+1}\frac{1}{i}. \end{equation} Applying this to the data we have,
According to the numerical output, the first 34 principal components are enough to represent the variability of the original data.

Principal Component Scores

The principal component scores is the resulting new data set obtained from the linear combinations $Y_j=\mathbf{e}_j(\mathbf{x}-\bar{\mathbf{x}}), j = 1,\cdots, p$. So that if we use the first three stopping rules, then below is the scores (in image) of PC1 and PC2,
If we base on the relative broken-stick rule then we return the first 34 PCs, and below is the corresponding scores (in image).
Click on the image to zoom in.

Residual Analysis

Of course when doing PCA there are errors to be considered unless one would return all the PCs, but that would not make any sense because why would someone apply PCA when you still take into account all the dimensions? An overview of the errors in PCA without going through the theory is that, the overall error is simply the excluded variability explained by the $k$th to $p$th principal components, $k>j$.

Reference

by Al Asaad (noreply@blogger.com) at December 27, 2015 01:52 AM

R: k-Means Clustering on Imaging

Enough with the theory we recently published, let's take a break and have fun on the application of Statistics used in Data Mining and Machine Learning, the k-Means Clustering.
k-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. (Wikipedia, Ref 1.)
We will apply this method to an image, wherein we group the pixels into k different clusters. Below is the image that we are going to use,
Colorful Bird From Wall321
We will utilize the following packages for input and output:
  1. jpeg - Read and write JPEG images; and,
  2. ggplot2 - An implementation of the Grammar of Graphics.

Download and Read the Image

Let's get started by downloading the image to our workspace, and tell R that our data is a JPEG file.

Cleaning the Data

Extract the necessary information from the image and organize this for our computation:

The image is represented by large array of pixels with dimension rows by columns by channels -- red, green, and blue or RGB.

Plotting

Plot the original image using the following codes:

Clustering

Apply k-Means clustering on the image:

Plot the clustered colours:

Possible clusters of pixels on different k-Means:

Originalk = 6
Table 1: Different k-Means Clustering.
k = 5k = 4
k = 3k = 2

I suggest you try it!

Reference

  1. K-means clustering. Wikipedia. Retrieved September 11, 2014.

by Al Asaad (noreply@blogger.com) at December 27, 2015 01:52 AM

May 12, 2015

Chris Lawrence

That'll leave a mark

Here’s a phrase you never want to see in print (in a legal decision, no less) pertaining to your academic research: “The IRB process, however, was improperly engaged by the Dartmouth researcher and ignored completely by the Stanford researchers.”

Whole thing here; it’s a doozy.

by Chris Lawrence at May 12, 2015 12:00 AM

April 14, 2015

R you ready?

Beautiful plots while simulating loss in two-part procrustes problem

loss2Today I was working on a two-part procrustes problem and wanted to find out why my minimization algorithm sometimes does not converge properly or renders unexpected results. The loss function to be minimized is

$latex \displaystyle
L(\mathbf{Q},c) = \| c \mathbf{A_1Q} – \mathbf{B_1} \|^2 + \| \mathbf{A_2Q} – \mathbf{B_2} \|^2 \rightarrow min
$

with \| \cdot \| denoting the Frobenius norm, c is an unknown scalar and \mathbf{Q} an unknown rotation matrix, i.e. \mathbf{Q}^T\mathbf{Q}=\mathbf{I}. \;\mathbf{A_1}, \mathbf{A_2}, \mathbf{B_1}, and \mathbf{B_1} are four real valued matrices. The minimum for c is easily found by setting the partial derivation of L(\mathbf{Q},c) w.r.t c equal to zero.

$latex \displaystyle
c = \frac {tr \; \mathbf{Q}^T \mathbf{A_1}^T \mathbf{B_1}}
{ \| \mathbf{A_1} \|^2 }
$

By plugging c into the loss function L(\mathbf{Q},c) we get a new loss function L(\mathbf{Q}) that only depends on \mathbf{Q}. This is the starting situation.

When trying to find out why the algorithm to minimize L(\mathbf{Q}) did not work as expected, I got stuck. So I decided to conduct a small simulation and generate random rotation matrices to study the relation between the parameter c and the value of the loss function L(\mathbf{Q}). Before looking at the results for the entire two-part procrustes problem from above, let’s visualize the results for the first part of the loss function only, i.e.

$latex \displaystyle
L(\mathbf{Q},c) = \| c \mathbf{A_1Q} – \mathbf{B_1} \|^2 \rightarrow min
$

Here, c has the same minimum as for the whole formula above. For the simulation I used

$latex
\mathbf{A_1}= \begin{pmatrix}
0.0 & 0.4 & -0.5 \\
-0.4 & -0.8 & -0.5 \\
-0.1 & -0.5 & 0.2 \\
\end{pmatrix} \mkern18mu \qquad \text{and} \qquad \mkern36mu \mathbf{B_1}= \begin{pmatrix}
-0.1 & -0.8 & -0.1 \\
0.3 & 0.2 & -0.9 \\
0.1 & -0.3 & -0.5 \\
\end{pmatrix} $

as input matrices. Generating many random rotation matrices \mathbf{Q} and plotting c against the value of the loss function yields the following plot.

This is a well behaved relation, for each scaling parameter c the loss is identical. Now let’s look at the full two-part loss function. As input matrices I used

$latex \displaystyle
A1= \begin{pmatrix}
0.0 & 0.4 & -0.5 \\
-0.4 & -0.8 & -0.5 \\
-0.1 & -0.5 & 0.2 \\
\end{pmatrix} \mkern18mu , \mkern36mu B1= \begin{pmatrix}
-0.1 & -0.8 & -0.1 \\
0.3 & 0.2 & -0.9 \\
0.1 & -0.3 & -0.5 \\
\end{pmatrix} $
$latex
A2= \begin{pmatrix}
0 & 0 & 1 \\
1 & 0 & 0 \\
0 & 1 & 0 \\
\end{pmatrix} \mkern18mu , \mkern36mu B2= \begin{pmatrix}
0 & 0 & 1 \\
1 & 0 & 0 \\
0 & 1 & 0 \\
\end{pmatrix} $

and the following R-code.

# trace function
tr <- function(X) sum(diag(X))

# random matrix type 1
rmat_1 <- function(n=3, p=3, min=-1, max=1){
  matrix(runif(n*p, min, max), ncol=p)
}

# random matrix type 2, sparse
rmat_2 <- function(p=3) {
  diag(p)[, sample(1:p, p)]
}

# generate random rotation matrix Q. Based on Q find 
# optimal scaling factor c and calculate loss function value
#
one_sample <- function(n=2, p=2)
{
  Q <- mixAK::rRotationMatrix(n=1, dim=p) %*%         # random rotation matrix det(Q) = 1
    diag(sample(c(-1,1), p, rep=T))                   # additional reflections, so det(Q) in {-1,1}
  s <- tr( t(Q) %*% t(A1) %*% B1 ) / norm(A1, "F")^2  # scaling factor c
  rss <- norm(s*A1 %*% Q - B1, "F")^2 +               # get residual sum of squares
         norm(A2 %*% Q - B2, "F")^2 
  c(s=s, rss=rss)
}

# find c and rss or many random rotation matrices
#
set.seed(10)  # nice case for 3 x 3
n <- 3
p <- 3
A1 <- round(rmat_1(n, p), 1)
B1 <- round(rmat_1(n, p), 1)
A2 <- rmat_2(p)
B2 <- rmat_2(p)

x <- plyr::rdply(40000, one_sample(3,3)) 
plot(x$s, x$rss, pch=16, cex=.4, xlab="c", ylab="L(Q)", col="#00000010")

This time the result turns out to be very different and … beautiful :)

Here, we do not have a one to one relation between the scaling parameter and the loss function any more. I do not quite know what to make of this yet. But for now I am happy that it has aestethic value. Below you find some more beautiful graphics with different matrices as inputs.

Cheers!

by markheckmann at April 14, 2015 04:53 PM

February 24, 2015

Douglas Bates

RCall: Running an embedded R in Julia

I have used R (and S before it) for a couple of decades. In the last few years most of my coding has been in Julia, a language for technical computing that can provide remarkable performance for a dynamically typed language via Just-In-Time (JIT) compilation of functions and via multiple dispatch.

Nonetheless there are facilities in R that I would like to have access to from Julia. I created the RCall package for Julia to do exactly that. This IJulia notebook provides an introduction to RCall.

This is not a novel idea by any means. Julia already has PyCall and JavaCall packages that provide access to Python and to Java. These packages are used extensively and are much more sophisticated than RCall, at present. Many other languages have facilities to run an embedded instance of R. In fact, Python has several such interfaces.

The things I plan to do using RCall is to access datasets from R and R packages, to fit models that are not currently implemented in Julia and to use R graphics, especially the ggplot2 and lattice packages. Unfortunately I am not currently able to start a graphics device from the embedded R but I expect that to be fixed soon.

I can tell you the most remarkable aspect of RCall although it may not mean much if you haven't tried to do this kind of thing. It is written entirely in Julia. There is absolutely no "glue" code written in a compiled language like C or C++. As I said, this may not mean much to you unless you have tried to do something like this, in which case it is astonishing.

by Douglas Bates (noreply@blogger.com) at February 24, 2015 11:05 PM

February 03, 2015

Romain Francois

January 16, 2015

Modern Toolmaking

caretEnsemble

My package caretEnsemble, for making ensembles of caret models, is now on CRAN.

Check it out, and let me know what you think! (Submit bug reports and feature requests to the issue tracker)

by Zachary Deane-Mayer (noreply@blogger.com) at January 16, 2015 10:22 PM

January 15, 2015

Gregor Gorjanc

cpumemlog: Monitor CPU and RAM usage of a process (and its children)

Long time no see ...

Today I pushed the cpumemlog script to GitHub https://github.com/gregorgorjanc/cpumemlog. Read more about this useful utility at the GitHub site.

by Gregor Gorjanc (noreply@blogger.com) at January 15, 2015 10:16 PM

December 15, 2014

R you ready?

QQ-plots in R vs. SPSS – A look at the differences

qq_plot_spps_r

We teach two software packages, R and SPSS, in Quantitative Methods 101 for psychology freshman at Bremen University (Germany). Sometimes confusion arises, when the software packages produce different results. This may be due to specifics in the implemention of a method or, as in most cases, to different default settings. One of these situations occurs when the QQ-plot is introduced. Below we see two QQ-plots, produced by SPSS and R, respectively. The data used in the plots were generated by:

set.seed(0)
x <- sample(0:9, 100, rep=T)    

SPSS

QQ-plot in SPSS using Blom's method

R

qqnorm(x, datax=T)      # uses Blom's method by default
qqline(x, datax=T)

There are some obvious differences:

  1. The most obvious one is that the R plot seems to contain more data points than the SPSS plot. Actually, this is not the case. Some data points are plotted on top of each in SPSS while they are spread out vertically in the R plot. The reason for this difference is that SPSS uses a different approach assigning probabilities to the values. We will expore the two approaches below.
  2. The scaling of the y-axis differs. R uses quantiles from the standard normal distribution. SPSS by default rescales these values using the mean and standard deviation from the original data. This allows to directly compare the original and theoretical values. This is a simple linear transformation and will not be explained any further here.
  3. The QQ-lines are not identical. R uses the 1st and 3rd quartile from both distributions to draw the line. This is different in SPSS where of a line is drawn for identical values on both axes. We will expore the differences below.

QQ-plots from scratch

To get a better understanding of the difference we will build the R and SPSS-flavored QQ-plot from scratch.

R type

In order to calculate theoretical quantiles corresponding to the observed values, we first need to find a way to assign a probability to each value of the original data. A lot of different approaches exist for this purpose (for an overview see e.g. Castillo-Gutiérrez, Lozano-Aguilera, & Estudillo-Martínez, 2012b). They usually build on the ranks of the observed data points to calculate corresponding p-values, i.e. the plotting positions for each point. The qqnorm function uses two formulae for this purpose, depending on the number of observations n (Blom’s mfethod, see ?qqnorm; Blom, 1958). With r being the rank, for n > 10 it will use the formula p = (r - 1/2) / n, for n \leq 10 the formula p = (r - 3/8) / (n + 1/4) to determine the probability value p for each observation (see the help files for the functions qqnorm and ppoint). For simplicity reasons, we will only implement the n > 10 case here.

n <- length(x)          # number of observations
r <- order(order(x))    # order of values, i.e. ranks without averaged ties
p <- (r - 1/2) / n      # assign to ranks using Blom's method
y <- qnorm(p)           # theoretical standard normal quantiles for p values
plot(x, y)              # plot empirical against theoretical values

Before we take at look at the code, note that our plot is identical to the plot generated by qqnorm above, except that the QQ-line is missing. The main point that makes the difference between R and SPSS is found in the command order(order(x)). The command calculates ranks for the observations using ordinal ranking. This means that all observations get different ranks and no average ranks are calculated for ties, i.e. for observations with equal values. Another approach would be to apply fractional ranking and calculate average values for ties. This is what the function rank does. The following codes shows the difference between the two approaches to assign ranks.

v <- c(1,1,2,3,3)
order(order(v))     # ordinal ranking used by R
## [1] 1 2 3 4 5
rank(v)             # fractional ranking used by SPSS
## [1] 1.5 1.5 3.0 4.5 4.5

R uses ordinal ranking and SPSS uses fractional ranking by default to assign ranks to values. Thus, the positions do not overlap in R as each ordered observation is assigned a different rank and therefore a different p-value. We will pick up the second approach again later, when we reproduce the SPSS-flavored plot in R.1

The second difference between the plots concerned the scaling of the y-axis and was already clarified above.

The last point to understand is how the QQ-line is drawn in R. Looking at the probs argument of qqline reveals that it uses the 1st and 3rd quartile of the original data and theoretical distribution to determine the reference points for the line. We will draw the line between the quartiles in red and overlay it with the line produced by qqline to see if our code is correct.

plot(x, y)                      # plot empirical against theoretical values
ps <- c(.25, .75)               # reference probabilities
a <- quantile(x, ps)            # empirical quantiles
b <- qnorm(ps)                  # theoretical quantiles
lines(a, b, lwd=4, col="red")   # our QQ line in red
qqline(x, datax=T)              # R QQ line

The reason for different lines in R and SPSS is that several approaches to fitting a straight line exist (for an overview see e.g. Castillo-Gutiérrez, Lozano-Aguilera, & Estudillo-Martínez, 2012a). Each approach has different advantages. The method used by R is more robust when we expect values to diverge from normality in the tails, and we are primarily interested in the normality of the middle range of our data. In other words, the method of fitting an adequate QQ-line depends on the purpose of the plot. An explanation of the rationale of the R approach can e.g. be found here.

SPSS type

The default SPSS approach also uses Blom’s method to assign probabilities to ranks (you may choose other methods is SPSS) and differs from the one above in the following aspects:

  • a) As already mentioned, SPSS uses ranks with averaged ties (fractional rankings) not the plain order ranks (ordinal ranking) as in R to derive the corresponding probabilities for each data point. The rest of the code is identical to the one above, though I am not sure if SPSS distinguishes between the n  10 case.
  • b) The theoretical quantiles are scaled to match the estimated mean and standard deviation of the original data.
  • c) The QQ-line goes through all quantiles with identical values on the x and y axis.
n <- length(x)                # number of observations
r <- rank(x)                  # a) ranks using fractional ranking (averaging ties)
p <- (r - 1/2) / n            # assign to ranks using Blom's method
y <- qnorm(p)                 # theoretical standard normal quantiles for p values
y <- y * sd(x) + mean(x)      # b) transform SND quantiles to mean and sd from original data
plot(x, y)                    # plot empirical against theoretical values

Lastly, let us add the line. As the scaling of both axes is the same, the line goes through the origin with a slope of 1.

abline(0,1)                   # c) slope 0 through origin

The comparison to the SPSS output shows that they are (visually) identical.

Function for SPSS-type QQ-plot

The whole point of this demonstration was to pinpoint and explain the differences between a QQ-plot generated in R and SPSS, so it will no longer be a reason for confusion. Note, however, that SPSS offers a whole range of options to generate the plot. For example, you can select the method to assign probabilities to ranks and decide how to treat ties. The plots above used the default setting (Blom’s method and averaging across ties). Personally I like the SPSS version. That is why I implemented the function qqnorm_spss in the ryouready package, that accompanies the course. The formulae for the different methods to assign probabilities to ranks can be found in Castillo-Gutiérrez et al. (2012b). The implentation is a preliminary version that has not yet been thoroughly tested. You can find the code here. Please report any bugs or suggestions for improvements (which are very welcome) in the github issues section.

library(devtools) 
install_github("markheckmann/ryouready")                # install from github repo
library(ryouready)                                      # load package
library(ggplot2)
qq <- qqnorm_spss(x, method=1, ties.method="average")   # Blom's method with averaged ties
plot(qq)                                                # generate QQ-plot
ggplot(qq)                                              # use ggplot2 to generate QQ-plot

Literature


  1. Technical sidenote: Internally, qqnorm uses the function ppoints to generate the p-values. Type in stats:::qqnorm.default to the console to have a look at the code. 

by markheckmann at December 15, 2014 08:55 AM

October 20, 2014

Modern Toolmaking

For faster R on a mac, use veclib

Update: The links to all my github gists on blogger are broken, and I can't figure out how to fix them.  If you know how to insert gitub gists on a dynamic blogger template, please let me known.

In the meantime, here are instructions with links to the code:
First of all, use homebrew to compile openblas.  It's easy!  Second of all, you can also use homebrew to install R! (But maybe stick with the CRAN version unless you really want to compile your own R binary)

To use openblas with R, follow these instructions:
https://gist.github.com/zachmayer/e591cf868b3a381a01d6#file-openblas-sh

To use veclib with R, follow these intructions:
https://gist.github.com/zachmayer/e591cf868b3a381a01d6#file-veclib-sh

OLD POST:

Inspired by this post, I decided to try using OpenBLAS for R on my mac.  However, it turns out there's a simpler option, using the vecLib BLAS library, which is provided by Apple as part of the accelerate framework.

If you are using R 2.15, follow these instructions to change your BLAS from the default to vecLib:


However, as noted in r-sig-mac, these instructions do not work for R 3.0.  You have to directly link to the accelerate framework's version of vecLib:


Finally, test your new blas using this script:


On my system (a retina macbook pro), the default BLAS takes 141 seconds and vecLib takes 43 seconds, which is a significant speedup.  If you plan to use vecLib, note the following warning from the R development team "Although fast, it is not under our control and may possibly deliver inaccurate results."

So far, I have not encountered any issues using vecLib, but it's only been a few hours :-).

UPDATE: you can also install OpenBLAS on a mac:

If you do this, make sure to change the directories to point to the correct location on your system  (e.g. change /users/zach/source to whatever directory you clone the git repo into).  On my system, the benchmark script takes ~41 seconds when using openBLAS, which is a small but significant speedup.

by Zachary Deane-Mayer (noreply@blogger.com) at October 20, 2014 04:24 PM

September 19, 2014

Chris Lawrence

What could a federal UK look like?

Assuming that the “no” vote prevails in the Scottish independence referendum, the next question for the United Kingdom is to consider constitutional reform to implement a quasi-federal system and resolve the West Lothian question once and for all. In some ways, it may also provide an opportunity to resolve the stalled reform of the upper house as well. Here’s the rough outline of a proposal that might work.

  • Devolve identical powers to England, Northern Ireland, Scotland, and Wales, with the proviso that local self-rule can be suspended if necessary by the federal legislature (by a supermajority).

  • The existing House of Commons becomes the House of Commons for England, which (along with the Sovereign) shall comprise the English Parliament. This parliament would function much as the existing devolved legislatures in Scotland and Wales; the consociational structure of the Northern Ireland Assembly (requiring double majorities) would not be replicated.

  • The House of Lords is abolished, and replaced with a directly-elected Senate of the United Kingdom. The Senate will have authority to legislate on the non-devolved powers (in American parlance, “delegated” powers) such as foreign and European Union affairs, trade and commerce, national defense, and on matters involving Crown dependencies and territories, the authority to legislate on devolved matters in the event self-government is suspended in a constituent country, and dilatory powers including a qualified veto (requiring a supermajority) over the legislation proposed by a constituent country’s parliament. The latter power would effectively replace the review powers of the existing House of Lords; it would function much as the Council of Revision in Madison’s original plan for the U.S. Constitution.

    As the Senate will have relatively limited powers, it need not be as large as the existing Lords or Commons. To ensure the countries other than England have a meaningful voice, given that nearly 85% of the UK’s population is in England, two-thirds of the seats would be allocated proportionally based on population and one-third allocated equally to the four constituent countries. This would still result in a chamber with a large English majority (around 64.4%) but nonetheless would ensure the other three countries would have meaningful representation as well.

by Chris Lawrence at September 19, 2014 12:00 AM

June 18, 2014

Chris Lawrence

Soccer queries answered

Kevin Drum asks a bunch of questions about soccer:

  1. Outside the penalty area there’s a hemisphere about 20 yards wide. I can’t recall ever seeing it used for anything. What’s it for?
  2. On several occasions, I’ve noticed that if the ball goes out of bounds at the end of stoppage time, the referee doesn’t whistle the match over. Instead, he waits for the throw-in, and then immediately whistles the match over. What’s the point of this?
  3. Speaking of stoppage time, how has it managed to last through the years? I know, I know: tradition. But seriously. Having a timekeeper who stops the clock for goals, free kicks, etc. has lots of upside and no downside. Right? It wouldn’t change the game in any way, it would just make timekeeping more accurate, more consistent, and more transparent for the fans and players. Why keep up the current pretense?
  4. What’s the best way to get a better sense of what’s a foul and what’s a legal tackle? Obviously you can’t tell from the players’ reactions, since they all writhe around like landed fish if they so much as trip over their own shoelaces. Reading the rules provides the basics, but doesn’t really help a newbie very much. Maybe a video that shows a lot of different tackles and explains why each one is legal, not legal, bookable, etc.?

The first one’s easy: there’s a general rule that no defensive player can be within 10 yards of the spot of a direct free kick. A penalty kick (which is a type of direct free kick) takes place in the 18-yard box, and no players other than the player taking the kick and the goalkeeper are allowed in the box. However, owing to geometry, the 18 yard box and the 10 yard exclusion zone don’t fully coincide, hence the penalty arc. (That’s also why there are two tiny hash-marks on the goal line and side line 10 yards from the corner flag. And why now referees have a can of shaving cream to mark the 10 yards for other free kicks, one of the few MLS innovations that has been a good idea.)

Second one’s also easy: the half and the game cannot end while the ball is out of play.

Third one’s harder. First, keeping time inexactly forestalls the silly premature celebrations that are common in most US sports. You’d never see the Stanford-Cal play happen in a soccer game. Second, it allows some slippage for short delays and doesn’t require exact timekeeping; granted, this was more valuable before instant replays and fourth officials, but most US sports require a lot of administrative record-keeping by ancillary officials. A soccer game can be played with one official (and often is, particularly at the amateur level) without having to change timing rules;* in developing countries in particular this lowers the barriers to entry for the sport (along with the low equipment requirements) without changing the nature of the game appreciably. Perhaps most importantly, if the clock was allowed to stop regularly it would create an excuse for commercial timeouts and advertising breaks, which would interrupt the flow of the game and potentially reduce the advantages of better-conditioned and more skilled athletes. (MLS tried this, along with other exciting American ideas like “no tied games,” and it was as appealing to actual soccer fans as ketchup on filet mignon would be to a foodie, and perhaps more importantly didn’t make any non-soccer fans watch.)

Fourth, the key distinction is usually whether there was an obvious attempt to play the ball; in addition, in the modern game, even some attempts to play the ball are considered inherently dangerous (tackling from behind, many sliding tackles, etc.) and therefore are fouls even if they are successful in getting more ball than human.

* To call offside, you’d also probably need what in my day we called a “linesman.”

by Chris Lawrence at June 18, 2014 12:00 AM

May 07, 2014

Chris Lawrence

The mission and vision thing

Probably the worst-kept non-secret is that the next stage of the institutional evolution of my current employer is to some ill-defined concept of “university status,” which mostly involves the establishment of some to-be-determined master’s degree programs. In the context of the University System of Georgia, it means a small jump from the “state college” prestige tier (a motley collection of schools that largely started out as two-year community colleges and transfer institutions) to the “state university” tier (which is where most of the ex-normal schools hang out these days). What is yet to be determined is how that transition will affect the broader institution that will be the University of Middle Georgia.* People on high are said to be working on these things; in any event, here are my assorted random thoughts on what might be reasonable things to pursue:

  • Marketing and positioning: Unlike the situation facing many of the other USG institutions, the population of the two anchor counties of our core service area (Bibb and Houston) is growing, and Houston County in particular has a statewide reputation for the quality of its public school system. Rather than conceding that the most prepared students from these schools will go to Athens or Atlanta or Valdosta, we should strongly market our institutional advantages over these more “prestigious” institutions, particularly in terms of the student experience in the first two years and the core curriculum: we have no large lecture courses, no teaching assistants, no lengthy bus rides to and from class every day, and the vast majority of the core is taught by full-time faculty with terminal degrees. Not to mention costs to students are much lower, particularly in the case of students who do not qualify for need-based aid. Even if we were to “lose” these students as transfers to the top-tier institutions after 1–4 semesters, we’d still benefit from the tuition and fees they bring in and we would not be penalized in the upcoming state performance funding formula. Dual enrollment in Warner Robins in particular is an opportunity to showcase our institution as a real alternative for better prepared students rather than a safety school.
  • Comprehensive offerings at the bachelor’s level: As a state university, we will need to offer a comprehensive range of options for bachelor’s students to attract and retain students, both traditional and nontraditional. In particular, B.S. degrees in political science and sociology with emphasis in applied empirical skills would meet public and private employer demand for workers who have research skills and the ability to collect, manage, understand, and use data appropriately. There are other gaps in the liberal arts and sciences as well that need to be addressed to become a truly comprehensive state university.
  • Create incentives to boost the residential population: The college currently has a heavy debt burden inherited from the overbuilding of dorms at the Cochran campus. We need to identify ways to encourage students to live in Cochran, which may require public-private partnerships to try to build a “college town” atmosphere in the community near campus. We also need to work with wireless providers like Sprint and T-Mobile to ensure that students from the “big city” can fully use their cell phones and tablets in Cochran and Eastman without roaming fees or changing wireless providers.
  • Tie the institution more closely to the communities we serve: This includes both physical ties and psychological ties. The Macon campus in particular has poor physical links to the city itself for students who might walk or ride bicycles; extending the existing bike/walking trail from Wesleyan to the Macon campus should be a priority, as should pedestrian access and bike facilities along Columbus Road. Access to the Warner Robins campus is somewhat better but still could be improved. More generally, the institution is perceived as an afterthought or alternative of last resort in the community. Improving this situation and perception among community leaders and political figures may require a physical presence in or near downtown Macon, perhaps in partnership with the GCSU Graduate Center.

* There is no official name-in-waiting, but given that our former interim president seemed to believe he could will this name into existence by repeating it enough I’ll stick with it. The straw poll of faculty trivia night suggests that it’s the least bad option available, which inevitably means the regents will choose something else instead (if the last name change is anything to go by).

by Chris Lawrence at May 07, 2014 12:00 AM

February 17, 2014

Seth Falcon

Have Your SHA and Bcrypt Too

Fear

I've been putting off sharing this idea because I've heard the rumors about what happens to folks who aren't security experts when they post about security on the internet. If this blog is replaced with cat photos and rainbows, you'll know what happened.

The Sad Truth

It's 2014 and chances are you have accounts on websites that are not properly handling user passwords. I did no research to produce the following list of ways passwords are mishandled in decreasing order of frequency:

  1. Site uses a fast hashing algorithm, typically SHA1(salt + plain-password).
  2. Site doesn't salt password hashes
  3. Site stores raw passwords

We know that sites should be generating secure random salts and using an established slow hashing algorithm (bcrypt, scrypt, or PBKDF2). Why are sites not doing this?

While security issues deserve a top spot on any site's priority list, new features often trump addressing legacy security concerns. The immediacy of the risk is hard to quantify and it's easy to fall prey to a "nothing bad has happened yet, why should we change now" attitude. It's easy for other bugs, features, or performance issues to win out when measured by immediate impact. Fixing security or other "legacy" issues is the Right Thing To Do and often you will see no measurable benefit from the investment. It's like having insurance. You don't need it until you do.

Specific to the improper storage of user password data is the issue of the impact to a site imposed by upgrading. There are two common approaches to upgrading password storage. You can switch cold turkey to the improved algorithms and force password resets on all of your users. Alternatively, you can migrate incrementally such that new users and any user who changes their password gets the increased security.

The cold turkey approach is not a great user experience and sites might choose to delay an upgrade to avoid admitting to a weak security implementation and disrupting their site by forcing password resets.

The incremental approach is more appealing, but the security benefit is drastically diminished for any site with a substantial set of existing users.

Given the above migration choices, perhaps it's (slightly) less surprising that businesses choose to prioritize other work ahead of fixing poorly stored user password data.

The Idea

What if you could upgrade a site so that both new and existing users immediately benefited from the increased security, but without the disruption of password resets? It turns out that you can and it isn't very hard.

Consider a user table with columns:

userid
salt
hashed_pass

Where the hashed_pass column is computed using a weak fast algorithm, for example SHA1(salt + plain_pass).

The core of the idea is to apply a proper algorithm on top of the data we already have. I'll use bcrypt to make the discussion concrete. Add columns to the user table as follows:

userid
salt
hashed_pass
hash_type
salt2

Process the existing user table by computing bcrypt(salt2 + hashed_pass) and storing the result in the hashed_pass column (overwriting the less secure value); save the new salt value to salt2 and set hash_type to bycrpt+sha1.

To verify a user where hash_type is bcrypt+sha1, compute bcrypt(salt2 + SHA1(salt + plain_pass)) and compare to the hashed_pass value. Note that bcrypt implementations encode the salt as a prefix of the hashed value so you could avoid the salt2 column, but it makes the idea easier to explain to have it there.

You can take this approach further and have any user that logs in (as well as new users) upgrade to a "clean" bcrypt only algorithm since you can now support different verification algorithms using hash_type. With the proper application code changes in place, the upgrade can be done live.

This scheme will also work for sites storing non-salted password hashes as well as those storing plain text passwords (THE HORROR).

Less Sadness, Maybe

Perhaps this approach makes implementing a password storage security upgrade more palatable and more likely to be prioritized. And if there's a horrible flaw in this approach, maybe you'll let me know without turning this blog into a tangle of cat photos and rainbows.

February 17, 2014 07:08 PM

December 26, 2013

Seth Falcon

A Rebar Plugin for Locking Deps: Reproducible Erlang Project Builds For Fun and Profit

What's this lock-deps of which you speak?

If you use rebar to generate an OTP release project and want to have reproducible builds, you need the rebar_lock_deps_plugin plugin. The plugin provides a lock-deps command that will generate a rebar.config.lock file containing the complete flattened set of project dependencies each pegged to a git SHA. The lock file acts similarly to Bundler's Gemfile.lock file and allows for reproducible builds (*).

Without lock-deps you might rely on the discipline of using a tag for all of your application's deps. This is insufficient if any dep depends on something not specified as a tag. It can also be a problem if a third party dep doesn't provide a tag. Generating a rebar.config.lock file solves these issues. Moreover, using lock-deps can simplify the work of putting together a release consisting of many of your own repos. If you treat the master branch as shippable, then rather than tagging each subproject and updating rebar.config throughout your project's dependency chain, you can run get-deps (without the lock file), compile, and re-lock at the latest versions throughout your project repositories.

The reproducibility of builds when using lock-deps depends on the SHAs captured in rebar.config.lock. The plugin works by scanning the cloned repos in your project's deps directory and extracting the current commit SHA. This works great until a repository's history is rewritten with a force push. If you really want reproducible builds, you need to not nuke your SHAs and you'll need to fork all third party repos to ensure that someone else doesn't screw you over in this fashion either. If you make a habit of only depending on third party repos using a tag, assume that upstream maintainers are not completely bat shit crazy, and don't force push your master branch, then you'll probably be fine.

Getting Started

Install the plugin in your project by adding the following to your rebar.config file:

%% Plugin dependency
{deps, [
    {rebar_lock_deps_plugin, ".*",
     {git, "git://github.com/seth/rebar_lock_deps_plugin.git", {branch, "master"}}}
]}.

%% Plugin usage
{plugins, [rebar_lock_deps_plugin]}.

To test it out do:

rebar get-deps
# the plugin has to be compiled so you can use it
rebar compile
rebar lock-deps

If you'd like to take a look at a project that uses the plugin, take a look at CHEF's erchef project.

Bonus features

If you are building an OTP release project using rebar generate then you can use rebar_lock_deps_plugin to enhance your build experience in three easy steps.

  1. Use rebar bump-rel-version version=$BUMP to automate the process of editing rel/reltool.config to update the release version. The argument $BUMP can be major, minor, or patch (default) to increment the specified part of a semver X.Y.Z version. If $BUMP is any other value, it is used as the new version verbatim. Note that this function rewrites rel/reltool.config using ~p. I check-in the reformatted version and maintain the formatting when editing. This way, the general case of a version bump via bump-rel-version results in a minimal diff.

  2. Autogenerate a change summary commit message for all project deps. Assuming you've generated a new lock file and bumped the release version, use rebar commit-release to commit the changes to rebar.config.lock and rel/reltool.config with a commit message that summarizes the changes made to each dependency between the previously locked version and the newly locked version. You can get a preview of the commit message via rebar log-changed-deps.

  3. Finally, create an annotated tag for your new release with rebar tag-release which will read the current version from rel/reltool.config and create an annotated tag named with the version.

The dependencies, they are ordered

Up to version 2.0.1 of rebar_lock_deps_plugin, the dependencies in the generated lock file were ordered alphabetically. This was a side-effect of using filelib:wildcard/1 to list the dependencies in the top-level deps directory. In most cases, the order of the full dependency set does not matter. However, if some of the code in your project uses parse transforms, then it will be important for the parse transform to be compiled and on the code path before attempting to compile code that uses the parse transform.

This issue was recently discovered by a colleague who ran into build issues using the lock file for a project that had recently integrated lager for logging. He came up with the idea of maintaining the order of deps as they appear in the various rebar.config files along with a prototype patch proving out the idea. As of rebar_lock_deps_plugin 3.0.0, the lock-deps command will (mostly) maintain the relative order of dependencies as found in the rebar.config files.

The "mostly" is that when a dep is shared across two subprojects, it will appear in the expected order for the first subproject (based on the ordering of the two subprojects). The deps for the second subproject will not be in strict rebar.config order, but the resulting order should address any compile-time dependencies and be relatively stable (only changing when project deps alter their deps with larger impact when shared deps are introduced or removed).

Digression: fun with dependencies

There are times, as a programmer, when a real-world problem looks like a text book exercise (or an interview whiteboard question). Just the other day at work we had to design some manhole covers, but I digress.

Fixing the order of the dependencies in the generated lock file is (nearly) the same as finding an install order for a set of projects with inter-dependencies. I had some fun coding up the text book solution even though the approach doesn't handle the constraint of respecting the order provided by the rebar.config files. Onward with the digression.

We have a set of "packages" where some packages depend on others and we want to determine an install order such that a package's dependencies are always installed before the package. The set of packages and the relation "depends on" form a directed acyclic graph or DAG. The topological sort of a DAG produces an install order for such a graph. The ordering is not unique. For example, with a single package C depending on A and B, valid install orders are [A, B, C] and [B, A, C].

To setup the problem, we load all of the project dependency information into a proplist mapping each package to a list of its dependencies extracted from the package's rebar.config file.

read_all_deps(Config, Dir) ->
    TopDeps = rebar_config:get(Config, deps, []),
    Acc = [{top, dep_names(TopDeps)}],
    DepDirs = filelib:wildcard(filename:join(Dir, "*")),
    Acc ++ [
     {filename:basename(D), dep_names(extract_deps(D))}
     || D <- DepDirs ].

Erlang's standard library provides the digraph and digraph_utils modules for constructing and operating on directed graphs. The digraph_utils module includes a topsort/1 function which we can make use of for our "exercise". The docs say:

Returns a topological ordering of the vertices of the digraph Digraph if such an ordering exists, false otherwise. For each vertex in the returned list, there are no out-neighbours that occur earlier in the list.

To figure out which way to point the edges when building our graph, consider two packages A and B with A depending on B. We know we want to end up with an install order of [B, A]. Rereading the topsort/1 docs, we must want an edge B => A. With that, we can build our DAG and obtain an install order with the topological sort:

load_digraph(Config, Dir) ->
    AllDeps = read_all_deps(Config, Dir),
    G = digraph:new(),
    Nodes = all_nodes(AllDeps),
    [ digraph:add_vertex(G, N) || N <- Nodes ],
    %% If A depends on B, then we add an edge A <= B
    [ 
      [ digraph:add_edge(G, Dep, Item)
        || Dep <- DepList ]
      || {Item, DepList} <- AllDeps, Item =/= top ],
    digraph_utils:topsort(G).

%% extract a sorted unique list of all deps
all_nodes(AllDeps) ->
    lists:usort(lists:foldl(fun({top, L}, Acc) ->
                                    L ++ Acc;
                               ({K, L}, Acc) ->
                                    [K|L] ++ Acc
                            end, [], AllDeps)).

The digraph module manages graphs using ETS giving it a convenient API, though one that feels un-erlang-y in its reliance on side-effects.

The above gives an install order, but doesn't take into account the relative order of deps as specified in the rebar.config files. The solution implemented in the plugin is a bit less fancy, recursing over the deps and maintaining the desired ordering. The only tricky bit being that shared deps are ignored until the end and the entire linearized list is de-duped which required a . Here's the code:

order_deps(AllDeps) ->
    Top = proplists:get_value(top, AllDeps),
    order_deps(lists:reverse(Top), AllDeps, []).

order_deps([], _AllDeps, Acc) ->
    de_dup(Acc);
order_deps([Item|Rest], AllDeps, Acc) ->
    ItemDeps = proplists:get_value(Item, AllDeps),
    order_deps(lists:reverse(ItemDeps) ++ Rest, AllDeps, [Item | Acc]).

de_dup(AccIn) ->
    WithIndex = lists:zip(AccIn, lists:seq(1, length(AccIn))),
    UWithIndex = lists:usort(fun({A, _}, {B, _}) ->
                                     A =< B
                             end, WithIndex),
    Ans0 = lists:sort(fun({_, I1}, {_, I2}) ->
                              I1 =< I2
                      end, UWithIndex),
    [ V || {V, _} <- Ans0 ].

Conclusion and the end of this post

The great thing about posting to your blog is, you don't have to have a proper conclusion if you don't want to.

December 26, 2013 04:20 PM

December 09, 2013

Leandro Penz

Probabilistic bug hunting

Probabilistic bug hunting

Have you ever run into a bug that, no matter how careful you are trying to reproduce it, it only happens sometimes? And then, you think you've got it, and finally solved it - and tested a couple of times without any manifestation. How do you know that you have tested enough? Are you sure you were not "lucky" in your tests?

In this article we will see how to answer those questions and the math behind it without going into too much detail. This is a pragmatic guide.

The Bug

The following program is supposed to generate two random 8-bit integer and print them on stdout:

  
  #include <stdio.h>
  #include <fcntl.h>
  #include <unistd.h>
  
  /* Returns -1 if error, other number if ok. */
  int get_random_chars(char *r1, char*r2)
  {
  	int f = open("/dev/urandom", O_RDONLY);
  
  	if (f < 0)
  		return -1;
  	if (read(f, r1, sizeof(*r1)) < 0)
  		return -1;
  	if (read(f, r2, sizeof(*r2)) < 0)
  		return -1;
  	close(f);
  
  	return *r1 & *r2;
  }
  
  int main(void)
  {
  	char r1;
  	char r2;
  	int ret;
  
  	ret = get_random_chars(&r1, &r2);
  
  	if (ret < 0)
  		fprintf(stderr, "error");
  	else
  		printf("%d %d\n", r1, r2);
  
  	return ret < 0;
  }
  

On my architecture (Linux on IA-32) it has a bug that makes it print "error" instead of the numbers sometimes.

The Model

Every time we run the program, the bug can either show up or not. It has a non-deterministic behaviour that requires statistical analysis.

We will model a single program run as a Bernoulli trial, with success defined as "seeing the bug", as that is the event we are interested in. We have the following parameters when using this model:

  • \(n\): the number of tests made;
  • \(k\): the number of times the bug was observed in the \(n\) tests;
  • \(p\): the unknown (and, most of the time, unknowable) probability of seeing the bug.

As a Bernoulli trial, the number of errors \(k\) of running the program \(n\) times follows a binomial distribution \(k \sim B(n,p)\). We will use this model to estimate \(p\) and to confirm the hypotheses that the bug no longer exists, after fixing the bug in whichever way we can.

By using this model we are implicitly assuming that all our tests are performed independently and identically. In order words: if the bug happens more ofter in one environment, we either test always in that environment or never; if the bug gets more and more frequent the longer the computer is running, we reset the computer after each trial. If we don't do that, we are effectively estimating the value of \(p\) with trials from different experiments, while in truth each experiment has its own \(p\). We will find a single value anyway, but it has no meaning and can lead us to wrong conclusions.

Physical analogy

Another way of thinking about the model and the strategy is by creating a physical analogy with a box that has an unknown number of green and red balls:

  • Bernoulli trial: taking a single ball out of the box and looking at its color - if it is red, we have observed the bug, otherwise we haven't. We then put the ball back in the box.
  • \(n\): the total number of trials we have performed.
  • \(k\): the total number of red balls seen.
  • \(p\): the total number of red balls in the box divided by the total number of green balls in the box.

Some things become clearer when we think about this analogy:

  • If we open the box and count the balls, we can know \(p\), in contrast with our original problem.
  • Without opening the box, we can estimate \(p\) by repeating the trial. As \(n\) increases, our estimate for \(p\) improves. Mathematically: \[p = \lim_{n\to\infty}\frac{k}{n}\]
  • Performing the trials in different conditions is like taking balls out of several different boxes. The results tell us nothing about any single box.

Estimating \(p\)

Before we try fixing anything, we have to know more about the bug, starting by the probability \(p\) of reproducing it. We can estimate this probability by dividing the number of times we see the bug \(k\) by the number of times we tested for it \(n\). Let's try that with our sample bug:

  $ ./hasbug
  67 -68
  $ ./hasbug
  79 -101
  $ ./hasbug
  error

We know from the source code that \(p=25%\), but let's pretend that we don't, as will be the case with practically every non-deterministic bug. We tested 3 times, so \(k=1, n=3 \Rightarrow p \sim 33%\), right? It would be better if we tested more, but how much more, and exactly what would be better?

\(p\) precision

Let's go back to our box analogy: imagine that there are 4 balls in the box, one red and three green. That means that \(p = 1/4\). What are the possible results when we test three times?

Red balls Green balls \(p\) estimate
0 3 0%
1 2 33%
2 1 66%
3 0 100%

The less we test, the smaller our precision is. Roughly, \(p\) precision will be at most \(1/n\) - in this case, 33%. That's the step of values we can find for \(p\), and the minimal value for it.

Testing more improves the precision of our estimate.

\(p\) likelihood

Let's now approach the problem from another angle: if \(p = 1/4\), what are the odds of seeing one error in four tests? Let's name the 4 balls as 0-red, 1-green, 2-green and 3-green:

The table above has all the possible results for getting 4 balls out of the box. That's \(4^4=256\) rows, generated by this python script. The same script counts the number of red balls in each row, and outputs the following table:

k rows %
0 81 31.64%
1 108 42.19%
2 54 21.09%
3 12 4.69%
4 1 0.39%

That means that, for \(p=1/4\), we see 1 red ball and 3 green balls only 42% of the time when getting out 4 balls.

What if \(p = 1/3\) - one red ball and two green balls? We would get the following table:

k rows %
0 16 19.75%
1 32 39.51%
2 24 29.63%
3 8 9.88%
4 1 1.23%

What about \(p = 1/2\)?

k rows %
0 1 6.25%
1 4 25.00%
2 6 37.50%
3 4 25.00%
4 1 6.25%

So, let's assume that you've seen the bug once in 4 trials. What is the value of \(p\)? You know that can happen 42% of the time if \(p=1/4\), but you also know it can happen 39% of the time if \(p=1/3\), and 25% of the time if \(p=1/2\). Which one is it?

The graph bellow shows the discrete likelihood for all \(p\) percentual values for getting 1 red and 3 green balls:

The fact is that, given the data, the estimate for \(p\) follows a beta distribution \(Beta(k+1, n-k+1) = Beta(2, 4)\) (1) The graph below shows the probability distribution density of \(p\):

The R script used to generate the first plot is here, the one used for the second plot is here.

Increasing \(n\), narrowing down the interval

What happens when we test more? We obviously increase our precision, as it is at most \(1/n\), as we said before - there is no way to estimate that \(p=1/3\) when we only test twice. But there is also another effect: the distribution for \(p\) gets taller and narrower around the observed ratio \(k/n\):

Investigation framework

So, which value will we use for \(p\)?

  • The smaller the value of \(p\), the more we have to test to reach a given confidence in the bug solution.
  • We must, then, choose the probability of error that we want to tolerate, and take the smallest value of \(p\) that we can.

    A usual value for the probability of error is 5% (2.5% on each side).
  • That means that we take the value of \(p\) that leaves 2.5% of the area of the density curve out on the left side. Let's call this value \(p_{min}\).
  • That way, if the observed \(k/n\) remains somewhat constant, \(p_{min}\) will raise, converging to the "real" \(p\) value.
  • As \(p_{min}\) raises, the amount of testing we have to do after fixing the bug decreases.

By using this framework we have direct, visual and tangible incentives to test more. We can objectively measure the potential contribution of each test.

In order to calculate \(p_{min}\) with the mentioned properties, we have to solve the following equation:

\[\sum_{k=0}^{k}{n\choose{k}}p_{min} ^k(1-p_{min})^{n-k}=\frac{\alpha}{2} \]

\(alpha\) here is twice the error we want to tolerate: 5% for an error of 2.5%.

That's not a trivial equation to solve for \(p_{min}\). Fortunately, that's the formula for the confidence interval of the binomial distribution, and there are a lot of sites that can calculate it:

Is the bug fixed?

So, you have tested a lot and calculated \(p_{min}\). The next step is fixing the bug.

After fixing the bug, you will want to test again, in order to confirm that the bug is fixed. How much testing is enough testing?

Let's say that \(t\) is the number of times we test the bug after it is fixed. Then, if our fix is not effective and the bug still presents itself with a probability greater than the \(p_{min}\) that we calculated, the probability of not seeing the bug after \(t\) tests is:

\[\alpha = (1-p_{min})^t \]

Here, \(\alpha\) is also the probability of making a type I error, while \(1 - \alpha\) is the statistical significance of our tests.

We now have two options:

  • arbitrarily determining a standard statistical significance and testing enough times to assert it.
  • test as much as we can and report the achieved statistical significance.

Both options are valid. The first one is not always feasible, as the cost of each trial can be high in time and/or other kind of resources.

The standard statistical significance in the industry is 5%, we recommend either that or less.

Formally, this is very similar to a statistical hypothesis testing.

Back to the Bug

Testing 20 times

This file has the results found after running our program 5000 times. We must never throw out data, but let's pretend that we have tested our program only 20 times. The observed \(k/n\) ration and the calculated \(p_{min}\) evolved as shown in the following graph:

After those 20 tests, our \(p_{min}\) is about 12%.

Suppose that we fix the bug and test it again. The following graph shows the statistical significance corresponding to the number of tests we do:

In words: we have to test 24 times after fixing the bug to reach 95% statistical significance, and 35 to reach 99%.

Now, what happens if we test more before fixing the bug?

Testing 5000 times

Let's now use all the results and assume that we tested 5000 times before fixing the bug. The graph bellow shows \(k/n\) and \(p_{min}\):

After those 5000 tests, our \(p_{min}\) is about 23% - much closer to the real \(p\).

The following graph shows the statistical significance corresponding to the number of tests we do after fixing the bug:

We can see in that graph that after about 11 tests we reach 95%, and after about 16 we get to 99%. As we have tested more before fixing the bug, we found a higher \(p_{min}\), and that allowed us to test less after fixing the bug.

Optimal testing

We have seen that we decrease \(t\) as we increase \(n\), as that can potentially increases our lower estimate for \(p\). Of course, that value can decrease as we test, but that means that we "got lucky" in the first trials and we are getting to know the bug better - the estimate is approaching the real value in a non-deterministic way, after all.

But, how much should we test before fixing the bug? Which value is an ideal value for \(n\)?

To define an optimal value for \(n\), we will minimize the sum \(n+t\). This objective gives us the benefit of minimizing the total amount of testing without compromising our guarantees. Minimizing the testing can be fundamental if each test costs significant time and/or resources.

The graph bellow shows us the evolution of the value of \(t\) and \(t+n\) using the data we generated for our bug:

We can see clearly that there are some low values of \(n\) and \(t\) that give us the guarantees we need. Those values are \(n = 15\) and \(t = 24\), which gives us \(t+n = 39\).

While you can use this technique to minimize the total number of tests performed (even more so when testing is expensive), testing more is always a good thing, as it always improves our guarantee, be it in \(n\) by providing us with a better \(p\) or in \(t\) by increasing the statistical significance of the conclusion that the bug is fixed. So, before fixing the bug, test until you see the bug at least once, and then at least the amount specified by this technique - but also test more if you can, there is no upper bound, specially after fixing the bug. You can then report a higher confidence in the solution.

Conclusions

When a programmer finds a bug that behaves in a non-deterministic way, he knows he should test enough to know more about the bug, and then even more after fixing it. In this article we have presented a framework that provides criteria to define numerically how much testing is "enough" and "even more." The same technique also provides a method to objectively measure the guarantee that the amount of testing performed provides, when it is not possible to test "enough."

We have also provided a real example (even though the bug itself is artificial) where the framework is applied.

As usual, the source code of this page (R scripts, etc) can be found and downloaded in https://github.com/lpenz/lpenz.github.io

December 09, 2013 12:00 AM

December 01, 2013

Gregor Gorjanc

Read line by line of a file in R

Are you using R for data manipulation for later use with other programs, i.e., a workflow something like this:
  1. read data sets from a disk,
  2. modify the data, and
  3. write it back to a disk.
All fine, but of data set is really big, then you will soon stumble on memory issues. If data processing is simple and you can read only chunks, say only line by line, then the following might be useful:

## File
file <- "myfile.txt"
 
## Create connection
con <- file(description=file, open="r")
 
## Hopefully you know the number of lines from some other source or
com <- paste("wc -l ", file, " | awk '{ print $1 }'", sep="")
n <- system(command=com, intern=TRUE)
 
## Loop over a file connection
for(i in 1:n) {
tmp <- scan(file=con, nlines=1, quiet=TRUE)
## do something on a line of data
}
Created by Pretty R at inside-R.org

by Gregor Gorjanc (noreply@blogger.com) at December 01, 2013 10:55 PM

August 13, 2013

Gregor Gorjanc

Setup up the inverse of additive relationship matrix in R

Additive genetic covariance between individuals is one of the key concepts in (quantitative) genetics. When doing the prediction of additive genetic values for pedigree members, we need the inverse of the so called numerator relationship matrix (NRM) or simply A. Matrix A has off-diagonal entries equal to numerator of Wright's relationship coefficient and diagonal elements equal to 1 + inbreeding coefficient. I have blogged before about setting up such inverse in R using routine from the ASReml-R program or importing the inverse from the CFC program. However, this is not the only way to "skin this cat" in R. I am aware of the following attempts to provide this feature in R for various things (the list is probably incomplete and I would grateful if you point me to other implementations):
  • pedigree R package has function makeA() and makeAinv() with obvious meanings; there is also calcG() if you have a lot of marker data instead of pedigree information; there are also some other very handy functions calcInbreeding(), orderPed(), trimPed(), etc.
  • pedigreemm R package does not have direct implementation to get A inverse, but has all the needed ingredients, which makes the package even more interesting
  • MCMCglmm R package has function inverseA() which works with pedigree or phlyo objects; there are also handy functions such as prunePed(), rbv()sm2asreml(), etc.
  • kinship and kinship2 R packages have function kinship() to setup kinship matrix, which is equal to the half of A; there are also nice functions for plotting pedigrees etc. (see also here)
  • see also a series of R scripts for relationship matrices 
As I described before, the interesting thing is that setting up inverse of A is easier and cheaper than setting up A and inverting it. This is very important for large applications. This is an old result using the following matrix theory. We can decompose symmetric positive definite matrix as A = LU = LL' (Cholesky decomposition) or as A = LDU = LDL' (Generalized Cholesky decomposition), where L (U) is lower (upper) triangular, and D is diagonal matrix. Note that L and U in previous two equations are not the same thing (L from Cholesky is not equal to L from Generalized Cholesky decomposition)! Sorry for sloppy notation. In order to confuse you even more note that Henderson usually wrote A = TDT'. We can even do A = LSSU, where S diagonal is equal to the square root of D diagonal. This can get us back to A = LU = LL' as LSSU = LSSL' = LSS'L' = LS(LS)' = L'L (be ware of sloppy notation)! The inverse rule says that inv(A) = inv(LDU) = inv(U) inv(D) inv(L) = inv(L)' inv(D) inv(L) = inv(L)' inv(S) inv(S) inv(L). I thank to Martin Maechler for pointing out to the last (obviously) bit to me. In Henderson's notation this would be inv(A) = inv(T)' inv(D) inv(T) = inv(T)' inv(S) inv(S) inv(T) Uf ... The important bit is that with NRM (aka A) inv(L) has nice simple structure - it shows the directed graph of additive genetic values in pedigree, while inv(D) tells us about the precision (inverse variance) of additive genetic values given the additive genetic values of parents and therefore depends on knowledge of parents and their inbreeding (the more they are inbred less variation can we expect in their progeny). Both inv(L) and inv(D) are easier to setup.

Packages MCMCglmm and pedigree give us inv(A) directly (we can also get inv(D) in MCMCglmm), but pedigreemm enables us to play around with the above matrix algebra and graph theory. First we need a small example pedigree. Bellow is an example with 10 members and there is also some inbreeding and some individuals have both, one, or no parents known. It is hard to see inbreeding directly from the table, but we will improve that later (see also here).

ped <- data.frame( id=c(  1,   2,   3,   4,   5,   6,   7,   8,   9,  10),
fid=c( NA, NA, 2, 2, 4, 2, 5, 5, NA, 8),
mid=c( NA, NA, 1, NA, 3, 3, 6, 6, NA, 9))

Now we will create an object of a pedigree class and show the A = U'U stuff:

## install.packages(pkgs="pedigreemm")
libr
ary(package="pedigreemm")
 
ped2 <- with(ped, pedigree(sire=fid, dam=mid, label=id))
 
U <-
relfactor(ped2)
A &lt
;- crossprod(U)
 
round(U, digits=2)
## 10 x 10 sparse Matrix of class "dtCMatrix"
## [1,] 1 . 0.50 . 0.25 0.25 0.25 0.25 . 0.12
## [2,] . 1 0.50 0.50 0.50 0.75 0.62 0.62 . 0.31
## [3,] . . 0.71 . 0.35 0.35 0.35 0.35 . 0.18
## [4,] . . . 0.87 0.43 . 0.22 0.22 . 0.11
## [5,] . . . . 0.71 . 0.35 0.35 . 0.18
## [6,] . . . . . 0.71 0.35 0.35 . 0.18
## [7,] . . . . . . 0.64 . . .
## [8,] . . . . . . . 0.64 . 0.32
## [9,] . . . . . . . . 1 0.50
## [10,] . . . . . . . . . 0.66

## To check
U - chol(A)

round(A, digits=2)
## 10 x 10 sparse Matrix of class "dsCMatrix"
## [1,] 1.00 . 0.50 . 0.25 0.25 0.25 0.25 . 0.12
## [2,] . 1.00 0.50 0.50 0.50 0.75 0.62 0.62 . 0.31
## [3,] 0.50 0.50 1.00 0.25 0.62 0.75 0.69 0.69 . 0.34
## [4,] . 0.50 0.25 1.00 0.62 0.38 0.50 0.50 . 0.25
## [5,] 0.25 0.50 0.62 0.62 1.12 0.56 0.84 0.84 . 0.42
## [6,] 0.25 0.75 0.75 0.38 0.56 1.25 0.91 0.91 . 0.45
## [7,] 0.25 0.62 0.69 0.50 0.84 0.91 1.28 0.88 . 0.44
## [8,] 0.25 0.62 0.69 0.50 0.84 0.91 0.88 1.28 . 0.64
## [9,] . . . . . . . . 1.0 0.50
## [10,] 0.12 0.31 0.34 0.25 0.42 0.45 0.44 0.64 0.5
1.
0
0

N
ote tha
t
pedigreem
m package uses Matrix classes in order to store only what we need to store, e.g., matrix U is triangular (t in "dtCMatrix") and matrix A is symmetric (s in "dsCMatrix"). To show the generalized Cholesky A = LDU (or using Henderson notation A = TDT') we use gchol() from the bdsmatrix R package. Matrix T shows the "flow" of genes in pedigree.

## install.packages(pkgs="bdsmatrix")
library(package="bdsmatrix")
tmp
&lt;- gchol(as.matrix(A))
D &lt;- diag(tmp)
(T <- as(as.matrix(tmp), "dtCMatrix"))
## 10 x 10 sparse Matrix of class "dtCMatrix"
## [1,] 1.000 . . . . . . . . .
## [2,] . 1.0000 . . . . . . . .
## [3,] 0.500 0.5000 1.00 . . . . . . .
## [4,] . 0.5000 . 1.000 . . . . . .
## [5,] 0.250 0.5000 0.50 0.500 1.00 . . . . .
## [6,] 0.250 0.7500 0.50 . . 1.00 . . . .
## [7,] 0.250 0.6250 0.50 0.250 0.50 0.50 1 . . .
## [8,] 0.250 0.6250 0.50 0.250 0.50 0.50 . 1.0 . .
## [9,] . . . . . . . . 1.0 .
## [10,] 0.125 0.3125 0.25 0.125 0.25 0.25 . 0.5 0.5 1

## To chec
k
&
lt;
- T %*% diag(sqrt(D))
L - t(U)
Now the A inverse part (inv(A) = inv(T)' inv(D) inv(T) = inv(T)' inv(S) inv(S) inv(T) using Henderson's notation, note that ). The nice thing is that pedigreemm authors provided functions to get inv(T) and D.

(TInv <- as(ped2, "sparseMatrix"))
## 10 x 10 sparse Matrix of class "dtCMatrix" (unitriangular)
## 1 1.0 . . . . . . . . .
## 2 . 1.0 . . . . . . . .
## 3 -0.5 -0.5 1.0 . . . . . . .
## 4 . -0.5 . 1.0 . . . . . .
## 5 . . -0.5 -0.5 1.0 . . . . .
## 6 . -0.5 -0.5 . . 1.0 . . . .
## 7 . . . . -0.5 -0.5 1 . . .
## 8 . . . . -0.5 -0.5 . 1.0 . .
## 9 . . . . . . . . 1.0 .
## 10 . . . . . . . -0.5 -0.5 1
round(DInv <- Diagonal(x=1/Dmat(ped2)), digits=2)
## 10 x 10 diagonal matrix of class "ddiMatrix"
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 1 . . . . . . . . .
## [2,] . 1 . . . . . . . .
## [3,] . . 2 . . . . . . .
## [4,] . . . 1.33 . . . . . .
## [5,] . . . . 2 . . . . .
## [6,] . . . . . 2 . . . .
## [7,] . . . . . . 2.46 . . .
## [8,] . . . . . . . 2.46 . .
## [9,] . . . . . . . . 1 .
## [10,] . . . . . . . . . 2.33
 
round(t(TInv) %*% DInv %*% TInv, digits=2)
## 10 x 10 sparse Matrix of class "dgCMatrix"
## .
..
ro
und(crossprod(sqrt(DInv) %*% TInv), digits=2)
## 10 x 10 sparse Matrix of class "dsCMatrix"
##  [1,]  1.5  0.50 -1.0  .     .     .     .     .     .     .   
## [2,] 0.5 2.33 -0.5 -0.67 . -1.00 . . . .
## [3,] -1.0 -0.50 3.0 0.50 -1.00 -1.00 . . . .
## [4,] . -0.67 0.5 1.83 -1.00 . . . . .
## [5,] . . -1.0 -1.00 3.23 1.23 -1.23 -1.23 . .
## [6,] . -1.00 -1.0 . 1.23 3.23 -1.23 -1.23 . .
## [7,] . . . . -1.23 -1.23 2.46 . . .
## [8,] . . . . -1.23 -1.23 . 3.04 0.58 -1.16
## [9,] . . . . . . . 0.58 1.58 -1.16
## [10,] . . . . . . . -1.16 -1.16
  2
.3
3

#
# T
o c
heck
so
l
ve
(A
) - crossprod(sqrt(DInv) %*% TInv)

The second method (using crossprod) is preferred as it leads directly to symmetric matrix (dsCMatrix), which stores only upper or lower triangle. And make sure you do not do crossprod(TInv %*% sqrt(DInv)) as it is the wrong order of matrices.

As promised we will display (plot) pedigree by use of conversion functions of matrix objects to graph objects using the following code. Two examples are provided using the graph and igraph packages. The former does a very good job on this example, but otherwise igraph seems to have much nicer support for editing etc.

## source("http://www.bioconductor.org/biocLite.R")
## biocLite(pkgs=c("graph", "Rgraphviz"))
library(package="graph")
library(package="Rgraphviz")
g <- as(t(TInv), "graph")
plot(g)



## install.packages(pkgs="igraph")
li
brary(package="igraph")
i &l
t;- igraph.from.graphNEL(graphNEL=g)
V(
i)$label <- 1:10
plot
(i, layout=layout.kamada.kawai)
## tkplot(i)

by Gregor Gorjanc (noreply@blogger.com) at August 13, 2013 02:28 PM

July 02, 2013

Gregor Gorjanc

Parse arguments of an R script

R can be used also as a scripting tool. We just need to add shebang in the first line of a file (script):

#!/usr/bin/Rscript

and then the R code should follow.

Often we want to pass arguments to such a script, which can be collected in the script by the commandArgs() function. Then we need to parse the arguments and conditional on them do something. I came with a rather general way of parsing these arguments using simply these few lines:

## Collect arguments
args <- commandArgs(TRUE)
 
## Default setting when no arguments passed
if(length(args) < 1) {
args <- c("--help")
}
 
## Help section
if("--help" %in% args) {
cat("
The R Script
 
Arguments:
--arg1=someValue - numeric, blah blah
--arg2=someValue - character, blah blah
--arg3=someValue - logical, blah blah
--help - print this text
 
Example:
./test.R --arg1=1 --arg2="
output.txt" --arg3=TRUE \n\n")
 
q(save="no")
}
 
## Parse arguments (we expect the form --arg=value)
parseArgs <- function(x) strsplit(sub("^--", "", x), "=")
argsDF <- as.data.frame(do.call("rbind", parseArgs(args)))
argsL <- as.list(as.character(argsDF$V2))
names(argsL) <- argsDF$V1
 
## Arg1 default
if(is.null(args$arg1)) {
## do something
}
 
## Arg2 default
if(is.null(args$arg2)) {
## do something
}
 
## Arg3 default
if(is.null(args$arg3)) {
## do something
}
 
## ... your code here ...
Created by Pretty R at inside-R.org

It is some work, but I find it pretty neat and use it for quite a while now. I do wonder what others have come up for this task. I hope I did not miss some very general solution.

by Gregor Gorjanc (noreply@blogger.com) at July 02, 2013 04:55 PM

March 24, 2013

Romain Francois

Moving

Moving_Boxes___Packing_Material.jpg

This blog is moving to blog.r-enthusiasts.com. The new one is powered by wordpress and gets a subdomain of r-enthusiasts.com.

See you there

by romain francois at March 24, 2013 03:52 PM